1. Introduction
Due to the complexity of the causes of electrical accidents, it is usually difficult
to trace the causes through electrical accident images. Even if professionals detect
them, it still takes a long time[1]. In recent years, with the emergence of new technologies in the fields of computer
vision and deep learning, vision-based defect recognition and detection technologies
have made great progress. However, due to problems such as high structural complexity,
small sample size, and strong noise interference, it is still challenging to use deep
learning models to accurately analyze these images[8].
Knowledge transfer technology[9] is a method of training on a dataing performance. Knowledge distillation has greater
flexibility in adjusting the student network architecture. It allows the student network
to learn different types of teacher networks according to the requirements of the
target task. It has been regarded as a potential solution. Traditional distillation
solutions usually use two isomorphic models, but they do not perform well in solving
heterogeneous architecture problems involving two models.
This paper proposes a structure based on a semantic segmentation teacher network (U-Net3+[18]) guiding a classification student network (ResNet-18[7]). The model's ability to perceive the structure of small sample electrical melting
images is improved by distilling the intermediate layer features, thereby improving
the classification accuracy and judgment confidence. However, the dataset of our experiment
is very small, with only 117 training samples and uneven numbers. U-Net-3+ is used
as the teacher network and ResNet-18 is used as the student network. The feature alignment
module[2] is introduced in the intermediate layer[20], and a composite distillation loss function[13] is designed to adopt a feature learning method[17] to enhance the visual structure preservation. In addition, we use label smoothing
to regularize the output-level distillation. We further design a dynamic weighting
scheme to gradually increase the influence of distillation as training progresses.
A learning rate scheduling based on cosine annealing and an early stopping strategy
are used to ensure the stability of training[14].
The reliability of the model prediction for unlabeled data is explored by measuring
the average confidence and average entropy on a small-scale image dataset. The results
show that under the conditions of small data samples of electrical cable melting images
and heterogeneous teacher and student networks, choosing the right model and distillation
scheme can also maintain high reliability, which also verifies the effectiveness and
potential of our proposed method.
2. Related Work
2.1 Dataset and Evaluation Metrics
In the analysis of electrical fire accidents, the study of cable melting morphology
is of great significance for determining the cause of the accident. However, the melting
morphologies caused by different causes are highly similar in appearance, which increases
the challenge of the classification task.
The dataset constructed in this study covers three typical types of electrical cable
melting morphologies: (1) melting due to short circuit(2) melting caused by fire-induced
short circuit(3) melting under the direct action of high temperature of flame. A total
of 117 images: 55 images of wire melting caused by combustion, 32 images of wire melting
caused by short circuit, and 29 images of wire melting caused by fire-induced short
circuit. As shown in Figure 1, short circuit melting and fire short circuit melting are often accompanied by metal
beads and surface carbonization, while cable deformation caused by fire burning may
also present a morphology similar to arc burning, making it difficult to achieve stable
classification results based on traditional image discrimination methods on small
sample data sets.
The first type of scarring is the main cause. When electrical circuits are subjected
to physical external forces, the sheath temperature rises to 2000℃~3000℃, causing
a short circuit, resulting in some of the metal fragments disintegrating and leaving
a metallic mesh-like scar over time.
The second type involves electrical wires that, while energized, lose their protective
coating due to the high temperature of a fire. When these wires come into contact
with other wires, they are significantly affected by the intensity of the flame. The
copper wire softens under the heat, causing the mesh to sag and resulting in short-circuit
marks. Primary short-circuit marks have less visible sheen. They may also have threads.
These are often marked with basic markings, making them difficult to distinguish with
the naked eye.
The third type is the copper wire melting marks caused by the heat of a fire when
the power is off. The melting area is large and the shape is different from the primary
and secondary scars. The scars are not directly caused by the electrical fire and
are different from the primary and secondary scars. The scars are easily distinguishable
with the naked eye[21].
그림 1. 전기 케이블 녹는 이미지
Fig. 1. Electrical cable melting image
In order to further enhance the scientificity and credibility of model evaluation,
this paper introduces average confidence and average entropy as unsupervised indicators
to predict the results. At the same time, relevant studies have shown that the probability
distribution characteristics reflected by average confidence and average entropy reflect
the model's subjective confidence and uncertainty expression ability in sample prediction
to a certain extent, which helps to verify the generalization ability of the proposed
distillation model in unsupervised scenarios.
Assuming that the Softmax probability vector output by the model is p = [p₁, p₂, ...,
pk], where K is the total number of categories, the average confidence is calculated
as in (1):
This indicator reflects the strongest confidence value of the model prediction. The
higher the value, the more concentrated and certain the model output is. The calculation
method of average entropy is as in (2):
Entropy reflects the uncertainty of the model output distribution. The lower the value,
the more certain and the model prediction is reliable. Where log is the natural logarithm,
and pik represents the predicted probability of the i-th sample for the k-th category.
The average confidence measures the strongest category preference of the model under
unsupervised input, which is suitable for quantifying whether the model "dares to
make a clear judgment"; while the average entropy characterizes the overall uncertainty
of the model output from the perspective of information entropy, which helps to discover
potential fuzzy decision areas. The combination of the two can fully reflect the stability
of the model on unknown domain samples, as an evaluation method complementary to traditional
supervised indicators such as accuracy and F1-score.
2.2 Model Selection
The U-Net3+ structure-enhance image segmentation model is an improved version of the
U-Net series model. By introducing full-scale skip connections and deep decoder structures,
the model's perception of boundaries and fine-grained structures is significantly
improved. Unlike the original U-Net, U-Net3+ fuses feature maps from multiple scales
in the decoder part, which can more effectively capture hot spot textures, bifurcated
edges, and complex backgrounds in electrical images. U-Net3+ was selected as the teacher
network, on the one hand because of its strong representation ability that has been
widely verified in scenarios such as industrial image segmentation, and on the other
hand because of its clear encoder structure, which facilitates cross-scale intermediate
layer alignment with the ResNet-18 student network.
ResNet-18 lightweight residual network is one of the classic residual networks. The
network has a compact structure, containing only 18 layers of convolution and residual
modules, and is suitable for lightweight deployment scenarios. ResNet-18 performs
more stably in small sample tasks and has high inference efficiency. Therefore, ResNet-18
was selected as the student network in this study. In addition, it has a higher transfer
efficiency when receiving knowledge distillation from the teacher network. Although
its parameter volume is significantly smaller than that of U-Net3+, with the feature
distillation strategy of the intermediate layer proposed in this study, ResNet-18
can still achieve better performance than the original model in complex electrical
image tasks.
2.3 Heterogeneous Feature Middle Layer Strategy
Knowledge distillation was proposed by Hinton et al. in 2015 [11]. The goal is to transfer knowledge from a more powerful and complex teacher network
to a smaller and more compact student network to match certain outputs. Some research
focuses on logit-based methods to enhance student predictions, which in turn rely
on model isomorphism to improve distillation results. However, if the student network
and the teacher network have the same complete architecture, overfitting may occur
[3]. TAKD proposes to reduce the difference between the teacher and the student by using
additional intermediate model sizes, and DGKD further improves TAKD by centrally aggregating
all intermediate models to strengthen the guidance of the student [19]. Another solution is the reverse distillation architecture [3], in which the features extracted by the teacher network are first compressed and
then decoded by the student network. However, there are relatively few studies on
the exploration and utilization of the intermediate layer, which often contains rich
information. Due to the structural differences between different models, this paper
uses a lightweight intermediate feature alignment module to solve the inconsistency
problem between the intermediate layer features of students and teachers in the semantic
space and channel dimensions in heterogeneous architectures. This module does not
require additional supervision signals and can automatically adjust the expression
of student features during training to better approximate the distribution of teacher
features.
2.4 Dynamic Distillation Weight Adjustment Strategy
Many knowledge distillation-related methods use fixed distillation loss weights, but
this static weight strategy has certain limitations in its application. Especially
in the early stages of training, the student network has not yet establish a stable
representation capability. If it relies too much on the teacher signal, it may limit
the learning and optimization of its own feature space, and even lead to convergence
difficulties or performance degradation[10].
To this end, this paper uses a dynamic distillation weight adjustment strategy based
on training progress, which aims to adaptively adjust the guidance strength of the
teacher signal at different stages. Reducing the distillation loss weight in the early
stages of training helps the student network to converge stably; gradually strengthening
the distillation effect in the later stages can more fully guide feature migration
and knowledge absorption. This strategy not only improves the final performance of
the model, but also enhances its generalization ability and training stability.
2.5 Distillation Loss Design and Output-Level Regularization
For the knowledge distillation task, this paper adopts a fusion design of three distillation
losses: mean square error (MSE), cosine similarity, and structural similarity index
(SSIM). MSE loss is used to measure the absolute error between the output of the student
network and the teacher network to ensure the consistency of the numerical level;
cosine similarity focuses on the directional consistency of the feature vectors of
the two, which helps to capture the similarity of high-level semantic features; and
SSIM introduces the constraint of structural information [15], further enhancing the model's ability to perceive image structural details. This
multi-dimensional loss fusion design not only improves the distillation effect, but
also makes the student network perform better in multi-level feature alignment.
The output probability distribution of the teacher network in knowledge distillation
not only reflects the probability of the target category, but also contains the similar
structure between categories, providing the student network with richer and more detailed
supervision information than traditional hard labels. However, the prediction distribution
of the teacher network usually shows a highly confident feature, that is, the probability
of the target category is close to 1, while the probability of the non-target category
is close to 0. This sharp probability distribution may cause the student network to
overfit during the training process, limiting the generalization performance of the
model.
To alleviate the above problems, label smoothing technology was introduced into the
distillation paradigm.This technology effectively slows down the extreme phenomenon
of output probability and suppresses the overconfidence of the model by adjusting
the label probability of the target category from 1 to 1−ε and evenly distributing
the remaining probability to other categories. It provides a softer and regularized
supervision signal for the student network, promoting its better generalization of
the data distribution[4].
2.6 Training Strategy and Optimization Techniques
In the process of deep neural network training, the learning rate scheduling strategy
has an important impact on the model convergence speed and performance. Traditional
learning rate scheduling methods mostly use fixed step decay or exponential decay,
which can easily lead to a sudden drop in learning rate during training, thereby causing
optimization oscillation. This paper proposes a cosine annealing learning rate scheduling
strategy that integrates the warm-up phase. The smooth transition of the learning
rate is achieved through piecewise functions, which shows excellent optimization performance
in the feature distillation experiment of this paper and is suitable for distillation
scenarios that require fine feature alignment.
In order to effectively prevent the model from overfitting during training and improve
training efficiency, this paper introduces an early stopping mechanism. This mechanism
automatically stops the training process when performance improvement fails within
a period of continuous training rounds, thereby avoiding overfitting of the training
set and resulting in performance degradation on the test set.
3. Proposed Method
3.1 Overall Framework
This paper proposes a teacher-student network distillation framework based on intermediate
layer feature alignment, and its overall structure is shown in the figure. The framework
consists of two parallel information processing paths, corresponding to the teacher
network and the student network respectively. The teacher network adopts the complex
U-Net3+ architecture with strong representation ability and multi-scale feature extraction
ability, while the student network adopts the lightweight ResNet-18.
The input image is passed to the teacher and student networks at the same time. The
teacher network keeps the parameters frozen during the inference process, and its
output intermediate layer features are used to guide the learning of the student network.
In view of the structural heterogeneity of the teacher and student networks, a feature
alignment module is introduced to narrow the differences between the two in channel
dimension and feature space distribution. The aligned student features are matched
with the teacher features at two levels on the intermediate layer. By designing a
sophisticated loss function, the student network is guided to gradually approach the
teacher network in feature expression. While ensuring the model compression effect,
this framework significantly improves the expression ability and generalization performance
of the student network.
그림 2. 증류 실험 흐름도
Fig. 2. Distillation experiment flow chart
3.2 Total Loss Function
The total loss function during the entire training process is defined as in (3):
λ(t) is a dynamic weight function that gradually increases with the training rounds,
and grows in the form of a quadratic function,so that the model focus on label supervision
in the early stage of training and gradually strengthens the transfer of teacher knowledge
in the later stage.it is defined as in (4):
The square form (epoch/epochs)² is used to present the nonlinear weighted curve (slow
start, late acceleration), which is a reasonable and effective learning idea. As shown
in Figure 3
그림 3. 동적 증류 중량 그래프
Fig. 3. Dynamic Distillation Weight Graph
The initial distillation loss weight is small (0.1) to prevent the student network
from relying on the teacher's features too early and retain a certain degree of autonomous
learning ability.The weight is gradually increased to 0.5 in the later stage. As the
student's ability improves, the intermediate features of the teacher are more fully
utilized.
L(ce) represents the cross-entropy loss, which is used to measure the difference between
the model's predicted distribution and the true label distribution. Through the learning
of intermediate features, the student network can more stably obtain the semantic
information in the teacher network, significantly improving the overall performance.
3.3 Module
This paper uses a lightweight module . This lightweight convolution module automatically
learns feature channel transformation, avoids manual dimension adjustment, and improves
versatility and flexibility. This module automatically maps the student's intermediate
layer features to the channel dimension that matches the teacher's features through
two layers of convolution and normalization operations, making the subsequent distillation
loss calculation more efficient and stable. Compared with directly aligning the number
of channels, this design has higher flexibility and versatility and is suitable for
distillation tasks with different architecture combinations. Two layers of convolution
are used to automatically align the intermediate feature channel dimensions of the
teacher and student networks. The conversion formula is as in (5):
Conv2d(kernel_size=5) uses a 5×5 convolution kernel to perform convolution operations
on the input feature map. The 5×5 convolution kernel can capture a wider range of
spatial information and local context, and extract high-level semantic information
from the input features through multiple convolution kernels. Batch Normalization
(BN) performs batch normalization on the output of the convolution layer. Accelerates
training: By normalizing the input distribution and reducing the internal covariate
shift, the model converges faster. It also has a slight regularization effect, which
helps prevent overfitting[6]. It allows a larger learning rate to be used, enhancing the robustness and stability
of the model. ReLU (Rectified Linear Unit) Function: Apply a nonlinear activation
function f(x) = max(0, x). Its function is to introduce nonlinearity: enable the network
to learn complex nonlinear mapping relationships. Sparse activation: Set negative
values to 0 to make some neurons "silent" and increase the sparsity of the model.
It is simple to calculate and alleviates the gradient vanishing problem.
Conv2d(kernel_size=1) uses a 1×1 convolution kernel to perform point-by-point convolution
on the feature map. It performs channel shuffling and dimensionality reduction. By
adjusting the number of output channels, it achieves channel compression or expansion
of the feature map. It fuses information between different channels without changing
the spatial dimension. The number of parameters and calculations of 1×1 convolution
is significantly reduced, reducing computational complexity.
3.4 Feature-level Distillation Based on Multi-loss Fusion
In order to effectively improve the learning ability of the lightweight student network
on the intermediate features of the teacher network, this paper designs a feature-level
distillation method based on multi-loss fusion. This method introduces three different
loss functions to supervise the alignment of intermediate layer features, including
mean square error (MSE), cosine similarity (Cosine Similarity) and structural similarity
index (SSIM). Among them, MSE is used to measure pixel-level error, cosine similarity
focuses on the directional consistency of feature vectors, and SSIM can capture structural
information, thereby improving the semantic preservation ability of features. The
feature distillation loss function of multi-loss fusion is as in (6):
Among them, β is 0.5 and γ is 0.1 in this experiment,which represent the weight hyperparameter
of cosine similarity and SSIM loss respectively. This setting is based on the following
considerations: the dominant term is composed of MSE+Cosine Similarity, and Cosine
Similarity can effectively capture the directionality of feature distribution. Its
effect is usually stronger than structural information, so it is given a higher weight
(0.5). SSIM is used as a structural compensation term: Considering that SSIM has a
large impact on the gradient, too high a weight may lead to unstable optimization,
so it is set to a smaller value (0.1) to provide moderate perceptual regularization
without interfering with the optimization of the main loss.The parameter combination
has been verified to have good robustness and generalization ability. In the experiment,
the combination of β=0.5, γ=0.1 showed stable convergence and performance improvement,
as shown in the ablation experiment section of Table 5 for details.
3.5 Output-level Distillation with Label Smoothing
This paper introduces the label smoothing technology, which alleviates the problem
of the model's overconfidence in a certain category by smoothing the distribution
of true labels, and effectively improves the robustness of the model on the validation
set. The label smoothing loss can be expressed as in (7):
Where pi represents the probability of the softmax output class i, and $\hat{y_{i}}$ is the
target distribution after label smoothing, which is defined as in (8):
Among them, ε ∈ [0,1] is a smoothing parameter, which is set to 0.1 in this paper,
and K is the total number of categories. Through this process, the model will not
be overconfident about a certain category, which helps to improve generalization ability
and reduce overfitting.
3.6 Learning Rate Dynamic Scheduling Strategy Based on Warm-up Cosine Annealing
This paper introduces a learning rate dynamic scheduling strategy that combines warm-up
mechanism[16] with cosine annealing. This strategy makes the training process smoother by adjusting
the learning rate in stages, which is especially suitable for scenarios with high
requirements for model convergence accuracy in tasks such as feature distillation.This
strategy mainly consists of two stages:
1. Linear warm-up stage: In the early stage of training, in order to avoid sudden
large gradient updates leading to parameter oscillation, the learning rate is gradually
increased from 0 to the initial learning rate lr₀ in a linear increment. Specifically,
in the first w training cycles, the learning rate is updated as in (9):
Where 0 ≤ t < w, t represents the current training round, and w represents the number
of warm-up rounds. In this stage, the learning rate is slowly increased to make the
model more stably adapt to the gradient changes at the beginning of training.
2. Cosine annealing stage: When the number of training rounds reaches or exceeds the
number of warm-up rounds w, the model enters the cosine annealing stage. In this stage,
the learning rate slowly decays to the set minimum value η_min according to the cosine
function. The specific expression is as in (10):
Where w ≤ t ≤ T, T is the total number of training cycles. This decay strategy is
smoother than the traditional step-by-step descent method, which helps the model converge
to a better solution in the later stage of training and improves the generalization
ability on unseen samples.
In this study, for the feature distillation task, the following specific parameter
configurations are used: warm-up rounds w = 10, total training cycle T = 100, initial
learning rate lr₀ = 1e-4, minimum learning rate η_min=0. By integrating this strategy
with the custom scheduling module, the current learning rate is automatically updated
after each round of training to ensure the dynamic adaptability of the training process.
The core logic of this strategy can be simplified as shown in the following pseudo
code:
Input: optimizer, maximum epochs T, Warm-up epochs w, initial learning rate lr_0,
minimum learning rate eta_min
Initialize: current epoch t = 0
Function ComputeLearningRate(t):
if t < w then
# Linear Warm-up phase
lr ←lr_0 * (t / w)
else
# Cosine annealing decay phase
progress ←(t - w) / (T - w)
lr ←eta_min + 0.5 * (lr_0 - eta_min) * (1 + cos(π* progress))
return lr
Main loop:
for t in 0 to T-1 do
lr ←ComputeLearningRate(t)
Update optimizer learning rate to lr
Perform forward and backward propagation
optimizer.step()
scheduler.step()
End
To summarize, the proposed method consists of a U-Net3+ teacher and a ResNet-18 student
network. The student learns from the teacher via intermediate feature alignment using
, guided by a multi-loss fusion strategy. The training process is regularized by label
smoothing, enhance by a dynamic loss weighting mechanism, and optimized using a warm-up
cosine learning rate scheduler. Together, these components contribute to stable and
efficient distillation learning under small sample constraints.
4. Experimental Study and Results
The experiment in this paper is divided into two stages: one is to use supervised
indicators for evaluation during the training model stage, and the other is to use
the trained model to classify a set of unlabeled electrical cable melting images,
using unsupervised indicators for evaluation. The results of supervised and unsupervised
indicators in the two stages complement each other to verify the feasibility of the
solution.
4.1 Experimental Parameters, Environment, and Dataset
The environment of this experiment is shown in Table 1:
표 1. 실험 환경 매개변수 표
Table 1. Experimental environment parameter table
|
Orating system
|
Windows10Pro (64-bit)
|
|
Central Processing Unit (CPU)
|
AMD Ryzen7 9800X3D
|
|
Graphics Processing Unit (GPU)
|
NVIDIA GeForce RTX 5080
|
|
Memory
|
16G×2 (dual channel 32GB)
|
|
Software Environment
|
Programming language:
Python 3.10
|
The complete experimental parameters of this experiment are shown in Table 2
표 2. 실험 매개변수
Table 2. Experimental parameters
|
Parameter Category
|
Setting Value
|
|
Optimizer
|
AdamW
|
|
Initial learning rate
|
1 × 10⁻⁴
|
Weight decay ( L2
regularization )
|
5 × 10⁻⁴
|
|
Loss function combination
|
Label Smoothing Cross Entropy +
Distillation Loss
|
|
Distillation loss weight α
|
Increase gradually from 0.1 to 0.5
|
Cosine similarity loss
coefficient β
|
0.5
|
Structural similarity loss
coefficient γ
|
0.1
|
|
Input image size
|
224 × 224
|
|
Batch size
|
16
|
|
Number of training rounds
|
100
|
|
Learning Rate Scheduler
|
Warm-up - Dynamic learning rate
schedul- ing strategy for cosine
annealing
|
This paper introduces a variety of image enhancement strategies during the training
phase. These enhancement methods are designed to simulate the various changes that
may occur in images in practical applications (such as rotation, flipping, illumination
changes, blur, etc.), thereby increasing the diversity of training data, alleviating
the overfitting problem, and improving the model's adaptability in complex environments.
Table3 systematically summarizes the enhancement methods used and their main parameters
in the training process.
표 3. 실험에 사용된 데이터 향상 전략 표
Table 3. Table of data enhancement strategies used in the experiment
Enhancement
Methods
|
PyTorch operate
|
Main parameters
|
Random
rotation
|
transforms.Random
Rotation (15)
|
Rotation angle range :
±15°
|
Random
horizontal flip
|
transforms.Random
HorizontalFlip ()
|
Flip probability: 0.5
|
Random
vertical flip
|
transforms.Random
VerticalFlip ()
|
Flip probability: 0.5
|
|
Color Jitter
|
transforms.ColorJitt
er (...)
|
Brightness / contrast /
saturation: ±0.15 , hue:
±0.08
|
Random
Cropping
|
transforms.Random
Crop (224,
padding=4)
|
Crop size: 224×224 ,
Padding: 4 pixels
|
|
Gaussian Blur
|
transforms.Gaussia
nBlur (3, (0.1, 1.5))
|
Kernel size: 3×3 , σ ∈
[0.1, 1.5]
|
4.2 Construction of the Teacher Network U-Net3+
Data source and label format: This study uses an image dataset after dataset enhancement.
The mask is a single-channel label map, and integer values 0–3.
Data enhancement strategy: The data enhancement library implements the image enhancement
strategy shown in Table 3, which significantly improves the diversity and robustness of training samples. The
additional training data after expansion is about four times the original data.
Preprocessing process:
1.Use adaptive Gaussian weighted average threshold(block size 11 ,offset constant
C=2) to segment suspected defectareas;
2.Eliminate isolated noise through morphological opening operation (kernel size 3×3),
and combine contour area filtering (retain areas with area >100 pixels);
3.Convert the color mask into a single-channel semantic map.
Model structure: The U-Net3+ architecture is used as the segmentation model, which
has multi-scale feature aggregation paths and dense skip connections, significantly
enhancing the ability to characterize complex structural defects. The model accepts
three-channel RGB input images and outputs single-channel pixel-level segmentation
prediction maps.
Training parameter configuration:
Optimizer : Adam , initial learning rate set to 1×10⁻³
Lossfunction : BCEWithLogitsLoss integrates Sigmoid activation with binary cross entropy
to improve numerical stability[5];
Batchsize : 4
Total number of training rounds : 10
Computing platform : GPU is preferred on devices with CUDA support , otherwise CPU
is used.
Semantic mapping design: The color mask is uniformly converted into a single-channel
label map, which simplifies the semantic mapping logic and improves the computational
efficiency, while facilitating the subsequent alignment of intermediate features with
the student network.
Model structure advantages: The U-Net3+ architecture enhances its responsiveness to
small targets and locally blurred areas through deep multi-scale fusion and full-scale
skip connections, and is particularly suitable for image segmentation tasks with complex
structures and diverse defect types.
After training is complete, save the optimal model as a file ending in .pth. This
model serves as the teacher network in the distillation process, providing semantically
rich intermediate layer supervision for the student network (ResNet-18), which helps
improve the feature expression ability and generalization performance in downstream
tasks.
4.3 Baseline and Comparison
To verify the effectiveness of the feature distillation framework proposed in this
paper, we built multiple sets of baseline models as performance references. All models
were trained under the same dataset partitioning (70% for training, 15% for validation,
and 15% for testing) and data augmentation strategies, including:
Baseline model 1: ResNet-18 model.
Baseline model 2: Student network is replaced with ShuffleNetV2.
Baseline model 3: Student network is replaced with MobileNetV3.
표 4. 기준 모델 데이터
Table 4. Baseline model data
|
|
Supervised Indicators
|
Unsupervised indicators
|
|
Type
|
Acc
|
F1
|
Loss
|
AE
|
AC
|
|
ResNet-18
|
0.9682%
|
0.9765%
|
0.4275
|
1.0211
|
72.18%
|
|
Student network :ShuffleNetV2
|
0.9888%
|
0.9867%
|
0.0435
|
1.2442
|
51.76%
|
|
Student network :MobileNetV3
|
0.9888%
|
0.9841%
|
0.0766
|
1.0957
|
60.83%
|
|
Complete Model
|
1.0000%
|
1.0000%
|
0.0291
|
0.3287
|
90.90%
|
Table 4 show that the original ResNet-18 model can achieve 96.82% accuracy and 97.65% F1
value on the test set when trained without distillation guidance, but its loss value
is relatively high (0.4275), the average entropy is 1.0211, and the output average
confidence is 72.18%. After replacing the student network with ShuffleNetV2, although
the model parameters are greatly reduced, the accuracy is improved to 98.88% and the
F1 value is improved to 98.67% under the guidance of distillation, and the loss is
significantly reduced to 0.0435, indicating that the distillation mechanism effectively
enhances the discriminative ability of the lightweight model. However, its average
entropy rises to 1.2442 and the output average confidence drops to 51.76%, indicating
that the uncertainty of the model output has increased. Similarly, using MobileNetV3
as the student network also achieved similar accuracy (98.88%) and F1 value (98.41%)
as ShuffleNetV2, with a loss of 0.0766, an average entropy of 1.0957, and a average
confidence of 60.83%.
The complete distillation model performs best in all indicators, with accuracy and
F1 value both reaching 100%, loss reduced to 0.0291, average entropy significantly
reduced to 0.3287, and average confidence reaching 90.90%. The above results fully
prove that the distillation strategy proposed in this paper can not only improve the
performance of the student network, but also effectively enhance the certainty and
stability of the model output, especially showing good adaptability and generalization
ability on the lightweight model.
4.4 Ablation Experiment
To further verify the specific role and performance contribution of each module in
the proposed distillation framework, this paper designed nine systematic ablation
experiments. All experiments are based on the complete distillation model, gradually
removing or modifying specific modules to analyze their impact on the final performance.
The experimental settings are shown in Table 2.
표 5. 절제 실험 데이터
Table 5. Ablation experiment data
|
|
Supervised Indicators
|
Unsupervised indicators
|
|
Type
|
Acc
|
F1
|
Loss
|
AE
|
AC
|
|
FeatureAlign + MSE
|
0.9888%
|
0.9845%
|
0.0278
|
0.4733
|
86.70%
|
|
FeatureAlign + MSE + Cosine Similarity
|
0.9888%
|
0.9886%
|
0.0435
|
0.3600
|
89.73%
|
|
FeatureAlign +MSE+SSIM
|
0.9885%
|
0.9838%
|
0.1075
|
0.5438
|
84.72%
|
|
FeatureAlign + cosine similarity
|
0.9770%
|
0.9704%
|
0.2293
|
0.5316
|
85.35%
|
|
FeatureAlign + cosine similarity + SSIM
|
0.9655%
|
0.9582%
|
0.0877
|
0.5477
|
84.08%
|
|
FeatureAlign +SSIM
|
0.9885%
|
0.9871%
|
0.0599
|
0.4619
|
86.67%
|
|
Fixed distillation weights
|
1.0000%
|
1.0000%
|
0.0183
|
0.4595
|
87.38%
|
|
Unlabeled smoothing
|
1.0000%
|
1.0000%
|
0.0377
|
0.4038
|
88.85%
|
|
Basic cosine annealing
|
0.9888%
|
0.9867%
|
0.0740
|
0.3795
|
89.09%
|
|
Complete Model
|
1.0000%
|
1.0000%
|
0.0291
|
0.3287
|
90.90%
|
Table 5 show that the complete model integrates FeatureAlign feature alignment ,label smoothing
, MSE and structural similarity loss (including SSIM and cosine similarity), and achieves
the best results in supervised indicators such as accuracy (100.00%), F1 value (100.00%)
and minimum loss (0.0291). At the same time, it also shows significant advantages
in unsupervised evaluation: the average entropy is the lowest (0.3287), indicating
that the model output is more stable, the distribution is more concentrated and specific,
and the average confidence is the highest (90.90%), reflecting the high confidence
prediction ability of the sample category.
In contrast, removing any structural constraint (such as SSIM or cosine similarity)
will lead to a decrease in model performance to varying degrees. For example, using
only the FeatureAlign + MSE combination can maintain a high accuracy (98.88%), but
the Loss increases to 0.0278 and the average entropy increases to 0.4733, indicating
that the lack of structural perception leads to increased output uncertainty; when
using SSIM or cosine constraints alone, the F1 value and average confidence cannot
exceed the complete model. At the same time, removing the FeatureAlign module (no
FeatureAlign) reduces F1 to 96.45%, further verifying the importance of feature alignment
as a bridge for intermediate representation migration.
In addition, for the experimental group with fixed distillation weights and unlabeled
smoothing, although the supervised indicators still maintain high values (Acc/F1 =
100%), the unsupervised indicators are significantly weakened (such as the average
entropy is 0.4595 and 0.4038 respectively), indicating that the model is less robust
when facing uncertain samples or potential interference information.
In summary, the complete model significantly outperforms all other ablation combinations
in all key indicators and has strong generalization ability in small sample datasets.
4.5 Comparison of Unsupervised Indicators
Comparison of the average confidence and average entropy of using the distillation
model and ResNet-18 model alone to process the electrical cable melting dataset:
그림 4. 전기 케이블 이미지 분류 후 증류 모델과 ResNet-18 모델의 평균 엔트로피 비교
Fig. 4. Comparison of the average entropy of the distillation model and the ResNet-18
model after classifying electrical cable images
The Figure 4 reflects the average entropy predicted by the model . In the left chart, the average
entropy of Class 0 is 0.4388, Class 1 is 0.2958, Class 2 is 0.3320, and the overall
average is 0.3287, indicating that the prediction uncertainty of different categories
is different. In the right chart, the average entropy of Class 0 is 1.1421, Class
1 is 0.9689, Class 2 is 1.3208, and the overall average is 1.0993. The difference
in the average entropy of each category between the distillation model and the ResNet-18
model reflects that the distillation model has stronger prediction stability and lower
uncertainty for samples of different categories than the ResNet-18 model.
그림 5. 전기 케이블 이미지 분류 후 증류 모델과 ResNet-18 모델 간 평균 신뢰도 비교
Fig. 5. Comparison of average confidence between the distillation model and the ResNet-18
model after classifying electrical cable images
From the Figure 5, the dotted line 90.90% in the left box plot may represent the average confidence
statistic, reflecting the confidence distribution characteristics of the distillation
model, and showing the skewness of the data distribution. In the right box plot, the
average value is marked as 67.75%, which is used to measure the average confidence
level of the ResNet-18 model. Through the box position, interquartile range, etc.,
it can be intuitively compared that the average confidence performance of the distillation
model in predicting the melting data of electrical cables is more reliable.
4.6 Comparison of Supervised Indicators
그림 6. 증류 모델과 ResNet-18 모델의 손실\정확도\F1\재현율 점수 비교
Fig. 6. Comparison of Loss\Accuracy\F1\Recall Score between distillation model and
ResNet-18 model
그림 7. 수신기 작동 특성(ROC) 곡선
Fig. 7. Receiver Operating Characteristic (ROC) Curve
그림 8. 증류 모델의 혼동 행렬
Fig. 8. Confusion Matrix of distillation model
Figure 6 and Figure 7shows the comparison results of the distillation model and the original ResNet-18
model in terms of loss, accuracy, and F1 Score in the training and validation stages.
The first image in each set of comparison images (with a dotted line in the background)
corresponds to the training results of the distillation model, from which the performance
improvement brought by the distillation strategy can be clearly observed. In the first
set of comparison graphs, the distillation model converges quickly at the beginning
of training and maintains a low and stable validation loss throughout the training
process. The final test set loss is significantly lower than that of the ResNet-18
model (0.0291 vs. 0.4417), reflecting the significant improvement of the distillation
model in feature fitting and generalization capabilities. In the second set of comparison
graphs, the distillation model achieved near-saturation accuracy performance at an
earlier stage of training, and achieved better accuracy performance than ResNet-18
on both the validation set and the test set (test accuracy reached 1.0000), indicating
that distillation learning effectively alleviated the underfitting problem of the
original model and improved the model's discrimination ability. The third set of comparison
graphs further verified the optimization of the overall performance of the model by
the distillation strategy. The distilled model showed higher consistency and stability
during the training process, and the F1 score quickly reached and maintained a level
close to 1.0000, which was much better than the test set performance of the original
model (1.0000 vs. 0.9720), indicating that it has stronger classification robustness
in the case of unbalanced samples or blurred boundaries.
Figure 7 shows the ROC curves. After distillation, the student network successfully learned
the distinguishing features of different categories while maintaining performance
close to that of the teacher network. The model's ability to identify categories 1,
2, and 3 is almost perfect, especially with AUC = 1.0, indicating almost no false
positives or false negatives.
Figure 8 shows the confusion matrix, indicating that the distillation model almost perfectly
classifies each category. Even with few-sample training, the student network can stably
learn the features of each category through distillation by the teacher network.
In summary, the distillation model on the left side of the figure significantly outperforms
the undistilled ResNet-18 model in terms of loss convergence speed, accuracy improvement,
and F1 score performance. Furthermore, the good results in ROC curve and confusion
matrix ensure the reliability of the results. This fully verifies the effectiveness
of the proposed teacher-student network structure and multi-loss fusion distillation
strategy in enhancing the model's generalization ability and classification performance.
5. Conclusion
In this paper, a heterogeneous model distillation framework is designed to solve the
problem of limited performance of lightweight models in the classification of small
sample electrical cable melting images. By using the highly expressive U-Net3+ as
the teacher network and outputting the information of the middle layer to the lightweight
ResNet-18 student network with different structures, the generalization ability of
the lightweight model on complex electrical images is effectively improved on only
117 training data sets. A multi-level, multi-loss fusion knowledge distillation framework
is constructed. The main innovations include: multi-scale middle layer feature distillation
design and introduction of middle layer feature alignment module. Feature distillation
mechanism of multi-loss fusion. Soft label optimization strategy based on Label Smoothing.
Joint scheduling of dynamic learning rate strategy and early stopping mechanism, combined
with EarlyStopping mechanism to realize automatic selection of optimal weights and
avoid overfitting. Multi-dimensional visual evaluation and diagnosis mechanism, introducing
supervised and unsupervised indicators. In addition, comparative experiments and ablation
experiments are used to illustrate the rationality and credibility of the method used
in this paper.
In summary, this paper has made systematic improvements and integrations in multiple
dimensions, including distillation structure design, loss function fusion, training
scheduling mechanism, and evaluation system, which significantly improved the performance
and generalization ability of the student network in environments such as small sample
electrical cable melting datasets and lightweight models under complex problems.