• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Dept.of IT Applied Engineering, Jeonbuk National University, Jeonju, Rebublic of Korea.)
  2. (Dept.of Energy engineering, Jeonbuk National University, Jeonju, Rebublic of Korea.)



Electrical cable melting images, Small sample dataset, Knowledge distillation, Heterogeneous model, Multi-loss fusion

1. Introduction

Due to the complexity of the causes of electrical accidents, it is usually difficult to trace the causes through electrical accident images. Even if professionals detect them, it still takes a long time[1]. In recent years, with the emergence of new technologies in the fields of computer vision and deep learning, vision-based defect recognition and detection technologies have made great progress. However, due to problems such as high structural complexity, small sample size, and strong noise interference, it is still challenging to use deep learning models to accurately analyze these images[8].

Knowledge transfer technology[9] is a method of training on a dataing performance. Knowledge distillation has greater flexibility in adjusting the student network architecture. It allows the student network to learn different types of teacher networks according to the requirements of the target task. It has been regarded as a potential solution. Traditional distillation solutions usually use two isomorphic models, but they do not perform well in solving heterogeneous architecture problems involving two models.

This paper proposes a structure based on a semantic segmentation teacher network (U-Net3+[18]) guiding a classification student network (ResNet-18[7]). The model's ability to perceive the structure of small sample electrical melting images is improved by distilling the intermediate layer features, thereby improving the classification accuracy and judgment confidence. However, the dataset of our experiment is very small, with only 117 training samples and uneven numbers. U-Net-3+ is used as the teacher network and ResNet-18 is used as the student network. The feature alignment module[2] is introduced in the intermediate layer[20], and a composite distillation loss function[13] is designed to adopt a feature learning method[17] to enhance the visual structure preservation. In addition, we use label smoothing to regularize the output-level distillation. We further design a dynamic weighting scheme to gradually increase the influence of distillation as training progresses. A learning rate scheduling based on cosine annealing and an early stopping strategy are used to ensure the stability of training[14].

The reliability of the model prediction for unlabeled data is explored by measuring the average confidence and average entropy on a small-scale image dataset. The results show that under the conditions of small data samples of electrical cable melting images and heterogeneous teacher and student networks, choosing the right model and distillation scheme can also maintain high reliability, which also verifies the effectiveness and potential of our proposed method.

2. Related Work

2.1 Dataset and Evaluation Metrics

In the analysis of electrical fire accidents, the study of cable melting morphology is of great significance for determining the cause of the accident. However, the melting morphologies caused by different causes are highly similar in appearance, which increases the challenge of the classification task.

The dataset constructed in this study covers three typical types of electrical cable melting morphologies: (1) melting due to short circuit(2) melting caused by fire-induced short circuit(3) melting under the direct action of high temperature of flame. A total of 117 images: 55 images of wire melting caused by combustion, 32 images of wire melting caused by short circuit, and 29 images of wire melting caused by fire-induced short circuit. As shown in Figure 1, short circuit melting and fire short circuit melting are often accompanied by metal beads and surface carbonization, while cable deformation caused by fire burning may also present a morphology similar to arc burning, making it difficult to achieve stable classification results based on traditional image discrimination methods on small sample data sets.

The first type of scarring is the main cause. When electrical circuits are subjected to physical external forces, the sheath temperature rises to 2000℃~3000℃, causing a short circuit, resulting in some of the metal fragments disintegrating and leaving a metallic mesh-like scar over time.

The second type involves electrical wires that, while energized, lose their protective coating due to the high temperature of a fire. When these wires come into contact with other wires, they are significantly affected by the intensity of the flame. The copper wire softens under the heat, causing the mesh to sag and resulting in short-circuit marks. Primary short-circuit marks have less visible sheen. They may also have threads. These are often marked with basic markings, making them difficult to distinguish with the naked eye.

The third type is the copper wire melting marks caused by the heat of a fire when the power is off. The melting area is large and the shape is different from the primary and secondary scars. The scars are not directly caused by the electrical fire and are different from the primary and secondary scars. The scars are easily distinguishable with the naked eye[21].

그림 1. 전기 케이블 녹는 이미지

Fig. 1. Electrical cable melting image

../../Resources/kiee/KIEE.2026.75.1.204/fig1.png

In order to further enhance the scientificity and credibility of model evaluation, this paper introduces average confidence and average entropy as unsupervised indicators to predict the results. At the same time, relevant studies have shown that the probability distribution characteristics reflected by average confidence and average entropy reflect the model's subjective confidence and uncertainty expression ability in sample prediction to a certain extent, which helps to verify the generalization ability of the proposed distillation model in unsupervised scenarios.

Assuming that the Softmax probability vector output by the model is p = [p₁, p₂, ..., pk], where K is the total number of categories, the average confidence is calculated as in (1):

(1)
$$ Average\; Confidence=\dfrac{1}{N}\sum_{i=1}^{N}\max(p_{1},\: p_{2},\: ...,\: p_{k}) $$

This indicator reflects the strongest confidence value of the model prediction. The higher the value, the more concentrated and certain the model output is. The calculation method of average entropy is as in (2):

(2)
$$ Average\; Entropy=-\dfrac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K}p_{i,\: k}\cdot\log(p_{i,\: k}) $$

Entropy reflects the uncertainty of the model output distribution. The lower the value, the more certain and the model prediction is reliable. Where log is the natural logarithm, and pik represents the predicted probability of the i-th sample for the k-th category.

The average confidence measures the strongest category preference of the model under unsupervised input, which is suitable for quantifying whether the model "dares to make a clear judgment"; while the average entropy characterizes the overall uncertainty of the model output from the perspective of information entropy, which helps to discover potential fuzzy decision areas. The combination of the two can fully reflect the stability of the model on unknown domain samples, as an evaluation method complementary to traditional supervised indicators such as accuracy and F1-score.

2.2 Model Selection

The U-Net3+ structure-enhance image segmentation model is an improved version of the U-Net series model. By introducing full-scale skip connections and deep decoder structures, the model's perception of boundaries and fine-grained structures is significantly improved. Unlike the original U-Net, U-Net3+ fuses feature maps from multiple scales in the decoder part, which can more effectively capture hot spot textures, bifurcated edges, and complex backgrounds in electrical images. U-Net3+ was selected as the teacher network, on the one hand because of its strong representation ability that has been widely verified in scenarios such as industrial image segmentation, and on the other hand because of its clear encoder structure, which facilitates cross-scale intermediate layer alignment with the ResNet-18 student network.

ResNet-18 lightweight residual network is one of the classic residual networks. The network has a compact structure, containing only 18 layers of convolution and residual modules, and is suitable for lightweight deployment scenarios. ResNet-18 performs more stably in small sample tasks and has high inference efficiency. Therefore, ResNet-18 was selected as the student network in this study. In addition, it has a higher transfer efficiency when receiving knowledge distillation from the teacher network. Although its parameter volume is significantly smaller than that of U-Net3+, with the feature distillation strategy of the intermediate layer proposed in this study, ResNet-18 can still achieve better performance than the original model in complex electrical image tasks.

2.3 Heterogeneous Feature Middle Layer Strategy

Knowledge distillation was proposed by Hinton et al. in 2015 [11]. The goal is to transfer knowledge from a more powerful and complex teacher network to a smaller and more compact student network to match certain outputs. Some research focuses on logit-based methods to enhance student predictions, which in turn rely on model isomorphism to improve distillation results. However, if the student network and the teacher network have the same complete architecture, overfitting may occur [3]. TAKD proposes to reduce the difference between the teacher and the student by using additional intermediate model sizes, and DGKD further improves TAKD by centrally aggregating all intermediate models to strengthen the guidance of the student [19]. Another solution is the reverse distillation architecture [3], in which the features extracted by the teacher network are first compressed and then decoded by the student network. However, there are relatively few studies on the exploration and utilization of the intermediate layer, which often contains rich information. Due to the structural differences between different models, this paper uses a lightweight intermediate feature alignment module to solve the inconsistency problem between the intermediate layer features of students and teachers in the semantic space and channel dimensions in heterogeneous architectures. This module does not require additional supervision signals and can automatically adjust the expression of student features during training to better approximate the distribution of teacher features.

2.4 Dynamic Distillation Weight Adjustment Strategy

Many knowledge distillation-related methods use fixed distillation loss weights, but this static weight strategy has certain limitations in its application. Especially in the early stages of training, the student network has not yet establish a stable representation capability. If it relies too much on the teacher signal, it may limit the learning and optimization of its own feature space, and even lead to convergence difficulties or performance degradation[10].

To this end, this paper uses a dynamic distillation weight adjustment strategy based on training progress, which aims to adaptively adjust the guidance strength of the teacher signal at different stages. Reducing the distillation loss weight in the early stages of training helps the student network to converge stably; gradually strengthening the distillation effect in the later stages can more fully guide feature migration and knowledge absorption. This strategy not only improves the final performance of the model, but also enhances its generalization ability and training stability.

2.5 Distillation Loss Design and Output-Level Regularization

For the knowledge distillation task, this paper adopts a fusion design of three distillation losses: mean square error (MSE), cosine similarity, and structural similarity index (SSIM). MSE loss is used to measure the absolute error between the output of the student network and the teacher network to ensure the consistency of the numerical level; cosine similarity focuses on the directional consistency of the feature vectors of the two, which helps to capture the similarity of high-level semantic features; and SSIM introduces the constraint of structural information [15], further enhancing the model's ability to perceive image structural details. This multi-dimensional loss fusion design not only improves the distillation effect, but also makes the student network perform better in multi-level feature alignment.

The output probability distribution of the teacher network in knowledge distillation not only reflects the probability of the target category, but also contains the similar structure between categories, providing the student network with richer and more detailed supervision information than traditional hard labels. However, the prediction distribution of the teacher network usually shows a highly confident feature, that is, the probability of the target category is close to 1, while the probability of the non-target category is close to 0. This sharp probability distribution may cause the student network to overfit during the training process, limiting the generalization performance of the model.

To alleviate the above problems, label smoothing technology was introduced into the distillation paradigm.This technology effectively slows down the extreme phenomenon of output probability and suppresses the overconfidence of the model by adjusting the label probability of the target category from 1 to 1−ε and evenly distributing the remaining probability to other categories. It provides a softer and regularized supervision signal for the student network, promoting its better generalization of the data distribution[4].

2.6 Training Strategy and Optimization Techniques

In the process of deep neural network training, the learning rate scheduling strategy has an important impact on the model convergence speed and performance. Traditional learning rate scheduling methods mostly use fixed step decay or exponential decay, which can easily lead to a sudden drop in learning rate during training, thereby causing optimization oscillation. This paper proposes a cosine annealing learning rate scheduling strategy that integrates the warm-up phase. The smooth transition of the learning rate is achieved through piecewise functions, which shows excellent optimization performance in the feature distillation experiment of this paper and is suitable for distillation scenarios that require fine feature alignment.

In order to effectively prevent the model from overfitting during training and improve training efficiency, this paper introduces an early stopping mechanism. This mechanism automatically stops the training process when performance improvement fails within a period of continuous training rounds, thereby avoiding overfitting of the training set and resulting in performance degradation on the test set.

3. Proposed Method

3.1 Overall Framework

This paper proposes a teacher-student network distillation framework based on intermediate layer feature alignment, and its overall structure is shown in the figure. The framework consists of two parallel information processing paths, corresponding to the teacher network and the student network respectively. The teacher network adopts the complex U-Net3+ architecture with strong representation ability and multi-scale feature extraction ability, while the student network adopts the lightweight ResNet-18.

The input image is passed to the teacher and student networks at the same time. The teacher network keeps the parameters frozen during the inference process, and its output intermediate layer features are used to guide the learning of the student network. In view of the structural heterogeneity of the teacher and student networks, a feature alignment module is introduced to narrow the differences between the two in channel dimension and feature space distribution. The aligned student features are matched with the teacher features at two levels on the intermediate layer. By designing a sophisticated loss function, the student network is guided to gradually approach the teacher network in feature expression. While ensuring the model compression effect, this framework significantly improves the expression ability and generalization performance of the student network.

그림 2. 증류 실험 흐름도

Fig. 2. Distillation experiment flow chart

../../Resources/kiee/KIEE.2026.75.1.204/fig2.png

3.2 Total Loss Function

The total loss function during the entire training process is defined as in (3):

(3)
$$ L_{total}= L_{ce}+ \lambda(t)\times L_{distill } $$

λ(t) is a dynamic weight function that gradually increases with the training rounds, and grows in the form of a quadratic function,so that the model focus on label supervision in the early stage of training and gradually strengthens the transfer of teacher knowledge in the later stage.it is defined as in (4):

(4)
$$ \lambda(t)= 0.1 +(0.5 - 0.1)\times(epoch / epochs)^{2 } $$

The square form (epoch/epochs)² is used to present the nonlinear weighted curve (slow start, late acceleration), which is a reasonable and effective learning idea. As shown in Figure 3

그림 3. 동적 증류 중량 그래프

Fig. 3. Dynamic Distillation Weight Graph

../../Resources/kiee/KIEE.2026.75.1.204/fig3.png

The initial distillation loss weight is small (0.1) to prevent the student network from relying on the teacher's features too early and retain a certain degree of autonomous learning ability.The weight is gradually increased to 0.5 in the later stage. As the student's ability improves, the intermediate features of the teacher are more fully utilized.

L(ce) represents the cross-entropy loss, which is used to measure the difference between the model's predicted distribution and the true label distribution. Through the learning of intermediate features, the student network can more stably obtain the semantic information in the teacher network, significantly improving the overall performance.

3.3 Module

This paper uses a lightweight module . This lightweight convolution module automatically learns feature channel transformation, avoids manual dimension adjustment, and improves versatility and flexibility. This module automatically maps the student's intermediate layer features to the channel dimension that matches the teacher's features through two layers of convolution and normalization operations, making the subsequent distillation loss calculation more efficient and stable. Compared with directly aligning the number of channels, this design has higher flexibility and versatility and is suitable for distillation tasks with different architecture combinations. Two layers of convolution are used to automatically align the intermediate feature channel dimensions of the teacher and student networks. The conversion formula is as in (5):

(5)
$$ {Conv}2{d}({k}=5)+{Batch No}+{Re LU}+{Conv}2{d}({k}=1) $$

Conv2d(kernel_size=5) uses a 5×5 convolution kernel to perform convolution operations on the input feature map. The 5×5 convolution kernel can capture a wider range of spatial information and local context, and extract high-level semantic information from the input features through multiple convolution kernels. Batch Normalization (BN) performs batch normalization on the output of the convolution layer. Accelerates training: By normalizing the input distribution and reducing the internal covariate shift, the model converges faster. It also has a slight regularization effect, which helps prevent overfitting[6]. It allows a larger learning rate to be used, enhancing the robustness and stability of the model. ReLU (Rectified Linear Unit) Function: Apply a nonlinear activation function f(x) = max(0, x). Its function is to introduce nonlinearity: enable the network to learn complex nonlinear mapping relationships. Sparse activation: Set negative values ​​to 0 to make some neurons "silent" and increase the sparsity of the model. It is simple to calculate and alleviates the gradient vanishing problem.

Conv2d(kernel_size=1) uses a 1×1 convolution kernel to perform point-by-point convolution on the feature map. It performs channel shuffling and dimensionality reduction. By adjusting the number of output channels, it achieves channel compression or expansion of the feature map. It fuses information between different channels without changing the spatial dimension. The number of parameters and calculations of 1×1 convolution is significantly reduced, reducing computational complexity.

3.4 Feature-level Distillation Based on Multi-loss Fusion

In order to effectively improve the learning ability of the lightweight student network on the intermediate features of the teacher network, this paper designs a feature-level distillation method based on multi-loss fusion. This method introduces three different loss functions to supervise the alignment of intermediate layer features, including mean square error (MSE), cosine similarity (Cosine Similarity) and structural similarity index (SSIM). Among them, MSE is used to measure pixel-level error, cosine similarity focuses on the directional consistency of feature vectors, and SSIM can capture structural information, thereby improving the semantic preservation ability of features. The feature distillation loss function of multi-loss fusion is as in (6):

(6)
$$ L_{distill}=\sum_{i}(L_{MSE}^{(i)}+\beta\times L_{\cos\sim}^{(i)}+\gamma\times L_{S\sim}^{(i)})cal $$

Among them, β is 0.5 and γ is 0.1 in this experiment,which represent the weight hyperparameter of cosine similarity and SSIM loss respectively. This setting is based on the following considerations: the dominant term is composed of MSE+Cosine Similarity, and Cosine Similarity can effectively capture the directionality of feature distribution. Its effect is usually stronger than structural information, so it is given a higher weight (0.5). SSIM is used as a structural compensation term: Considering that SSIM has a large impact on the gradient, too high a weight may lead to unstable optimization, so it is set to a smaller value (0.1) to provide moderate perceptual regularization without interfering with the optimization of the main loss.The parameter combination has been verified to have good robustness and generalization ability. In the experiment, the combination of β=0.5, γ=0.1 showed stable convergence and performance improvement, as shown in the ablation experiment section of Table 5 for details.

3.5 Output-level Distillation with Label Smoothing

This paper introduces the label smoothing technology, which alleviates the problem of the model's overconfidence in a certain category by smoothing the distribution of true labels, and effectively improves the robustness of the model on the validation set. The label smoothing loss can be expressed as in (7):

(7)
$$ L_{LS}=-\sum_{i=1}^{i}\hat{y_{i}}\log p_{i} $$

Where pi represents the probability of the softmax output class i, and $\hat{y_{i}}$ is the target distribution after label smoothing, which is defined as in (8):

(8)
$$ \hat{y_{i}}=\left\{\begin{matrix}\begin{aligned}1-\epsilon i=K\\\dfrac{\epsilon}{K-1}i\ne K\end{aligned}\end{matrix}\right . $$

Among them, ε ∈ [0,1] is a smoothing parameter, which is set to 0.1 in this paper, and K is the total number of categories. Through this process, the model will not be overconfident about a certain category, which helps to improve generalization ability and reduce overfitting.

3.6 Learning Rate Dynamic Scheduling Strategy Based on Warm-up Cosine Annealing

This paper introduces a learning rate dynamic scheduling strategy that combines warm-up mechanism[16] with cosine annealing. This strategy makes the training process smoother by adjusting the learning rate in stages, which is especially suitable for scenarios with high requirements for model convergence accuracy in tasks such as feature distillation.This strategy mainly consists of two stages:

1. Linear warm-up stage: In the early stage of training, in order to avoid sudden large gradient updates leading to parameter oscillation, the learning rate is gradually increased from 0 to the initial learning rate lr₀ in a linear increment. Specifically, in the first w training cycles, the learning rate is updated as in (9):

(9)
$$ lr(t)= lr_{0}×(t / w) $$

Where 0 ≤ t < w, t represents the current training round, and w represents the number of warm-up rounds. In this stage, the learning rate is slowly increased to make the model more stably adapt to the gradient changes at the beginning of training.

2. Cosine annealing stage: When the number of training rounds reaches or exceeds the number of warm-up rounds w, the model enters the cosine annealing stage. In this stage, the learning rate slowly decays to the set minimum value η_min according to the cosine function. The specific expression is as in (10):

(10)
$$ lr(t)=\eta_{\min}+0.5·(lr_{0}-\eta_{\min)}·(1+\cos((t-w)/(T-w)·\pi)) $$

Where w ≤ t ≤ T, T is the total number of training cycles. This decay strategy is smoother than the traditional step-by-step descent method, which helps the model converge to a better solution in the later stage of training and improves the generalization ability on unseen samples.

In this study, for the feature distillation task, the following specific parameter configurations are used: warm-up rounds w = 10, total training cycle T = 100, initial learning rate lr₀ = 1e-4, minimum learning rate η_min=0. By integrating this strategy with the custom scheduling module, the current learning rate is automatically updated after each round of training to ensure the dynamic adaptability of the training process.

The core logic of this strategy can be simplified as shown in the following pseudo code:

Input: optimizer, maximum epochs T, Warm-up epochs w, initial learning rate lr_0, minimum learning rate eta_min

Initialize: current epoch t = 0

Function ComputeLearningRate(t):

if t < w then

# Linear Warm-up phase

lr ←lr_0 * (t / w)

else

# Cosine annealing decay phase

progress ←(t - w) / (T - w)

lr ←eta_min + 0.5 * (lr_0 - eta_min) * (1 + cos(π* progress))

return lr

Main loop:

for t in 0 to T-1 do

lr ←ComputeLearningRate(t)

Update optimizer learning rate to lr

Perform forward and backward propagation

optimizer.step()

scheduler.step()

End

To summarize, the proposed method consists of a U-Net3+ teacher and a ResNet-18 student network. The student learns from the teacher via intermediate feature alignment using , guided by a multi-loss fusion strategy. The training process is regularized by label smoothing, enhance by a dynamic loss weighting mechanism, and optimized using a warm-up cosine learning rate scheduler. Together, these components contribute to stable and efficient distillation learning under small sample constraints.

4. Experimental Study and Results

The experiment in this paper is divided into two stages: one is to use supervised indicators for evaluation during the training model stage, and the other is to use the trained model to classify a set of unlabeled electrical cable melting images, using unsupervised indicators for evaluation. The results of supervised and unsupervised indicators in the two stages complement each other to verify the feasibility of the solution.

4.1 Experimental Parameters, Environment, and Dataset

The environment of this experiment is shown in Table 1:

표 1. 실험 환경 매개변수 표

Table 1. Experimental environment parameter table

Orating system Windows10Pro (64-bit)
Central Processing Unit (CPU) AMD Ryzen7 9800X3D
Graphics Processing Unit (GPU) NVIDIA GeForce RTX 5080
Memory 16G×2 (dual channel 32GB)
Software Environment Programming language:
Python 3.10

The complete experimental parameters of this experiment are shown in Table 2

표 2. 실험 매개변수

Table 2. Experimental parameters

Parameter Category Setting Value
Optimizer AdamW
Initial learning rate 1 × 10⁻⁴
Weight decay ( L2
regularization )
5 × 10⁻⁴
Loss function combination Label Smoothing Cross Entropy +
Distillation Loss
Distillation loss weight α Increase gradually from 0.1 to 0.5
Cosine similarity loss
coefficient β
0.5
Structural similarity loss
coefficient γ
0.1
Input image size 224 × 224
Batch size 16
Number of training rounds 100
Learning Rate Scheduler Warm-up - Dynamic learning rate
schedul- ing strategy for cosine
annealing

This paper introduces a variety of image enhancement strategies during the training phase. These enhancement methods are designed to simulate the various changes that may occur in images in practical applications (such as rotation, flipping, illumination changes, blur, etc.), thereby increasing the diversity of training data, alleviating the overfitting problem, and improving the model's adaptability in complex environments. Table3 systematically summarizes the enhancement methods used and their main parameters in the training process.

표 3. 실험에 사용된 데이터 향상 전략 표

Table 3. Table of data enhancement strategies used in the experiment

Enhancement
Methods
PyTorch operate Main parameters
Random
rotation
transforms.Random
Rotation (15)
Rotation angle range :
±15°
Random
horizontal flip
transforms.Random
HorizontalFlip ()
Flip probability: 0.5
Random
vertical flip
transforms.Random
VerticalFlip ()
Flip probability: 0.5
Color Jitter​​ transforms.ColorJitt
er (...)
Brightness / contrast /
saturation: ±0.15 , hue:
±0.08
Random
Cropping
transforms.Random
Crop (224,
padding=4)
Crop size: 224×224 ,
Padding: 4 pixels
Gaussian Blur transforms.Gaussia
nBlur (3, (0.1, 1.5))
Kernel size: 3×3 , σ ∈
[0.1, 1.5]

4.2 Construction of the Teacher Network U-Net3+

Data source and label format: This study uses an image dataset after dataset enhancement. The mask is a single-channel label map, and integer values ​​0–3.

Data enhancement strategy: The data enhancement library implements the image enhancement strategy shown in Table 3, which significantly improves the diversity and robustness of training samples. The additional training data after expansion is about four times the original data.

Preprocessing process:

1.Use adaptive Gaussian weighted average threshold(block size 11 ,offset constant C=2) to segment suspected defectareas;

2.Eliminate isolated noise through morphological opening operation (kernel size 3×3), and combine contour area filtering (retain areas with area >100 pixels);

3.Convert the color mask into a single-channel semantic map.

Model structure: The U-Net3+ architecture is used as the segmentation model, which has multi-scale feature aggregation paths and dense skip connections, significantly enhancing the ability to characterize complex structural defects. The model accepts three-channel RGB input images and outputs single-channel pixel-level segmentation prediction maps.

Training parameter configuration:

Optimizer : Adam , initial learning rate set to 1×10⁻³

Lossfunction : BCEWithLogitsLoss integrates Sigmoid activation with binary cross entropy to improve numerical stability[5];

Batchsize : 4

Total number of training rounds : 10

Computing platform : GPU is preferred on devices with CUDA support , otherwise CPU is used.

Semantic mapping design: The color mask is uniformly converted into a single-channel label map, which simplifies the semantic mapping logic and improves the computational efficiency, while facilitating the subsequent alignment of intermediate features with the student network.

Model structure advantages: The U-Net3+ architecture enhances its responsiveness to small targets and locally blurred areas through deep multi-scale fusion and full-scale skip connections, and is particularly suitable for image segmentation tasks with complex structures and diverse defect types.

After training is complete, save the optimal model as a file ending in .pth. This model serves as the teacher network in the distillation process, providing semantically rich intermediate layer supervision for the student network (ResNet-18), which helps improve the feature expression ability and generalization performance in downstream tasks.

4.3 Baseline and Comparison

To verify the effectiveness of the feature distillation framework proposed in this paper, we built multiple sets of baseline models as performance references. All models were trained under the same dataset partitioning (70% for training, 15% for validation, and 15% for testing) and data augmentation strategies, including:

Baseline model 1: ResNet-18 model.

Baseline model 2: Student network is replaced with ShuffleNetV2.

Baseline model 3: Student network is replaced with MobileNetV3.

표 4. 기준 모델 데이터

Table 4. Baseline model data

Supervised Indicators

Unsupervised indicators

Type

Acc

F1

Loss

AE

AC

ResNet-18

0.9682%

0.9765%

0.4275

1.0211

72.18%

Student network :ShuffleNetV2

0.9888%

0.9867%

0.0435

1.2442

51.76%

Student network :MobileNetV3

0.9888%

0.9841%

0.0766

1.0957

60.83%

Complete Model

1.0000%

1.0000%

0.0291

0.3287

90.90%

Note:Acc=Accuracy,AE=Average Entropy ,
AC=Average Confidence

Table 4 show that the original ResNet-18 model can achieve 96.82% accuracy and 97.65% F1 value on the test set when trained without distillation guidance, but its loss value is relatively high (0.4275), the average entropy is 1.0211, and the output average confidence is 72.18%. After replacing the student network with ShuffleNetV2, although the model parameters are greatly reduced, the accuracy is improved to 98.88% and the F1 value is improved to 98.67% under the guidance of distillation, and the loss is significantly reduced to 0.0435, indicating that the distillation mechanism effectively enhances the discriminative ability of the lightweight model. However, its average entropy rises to 1.2442 and the output average confidence drops to 51.76%, indicating that the uncertainty of the model output has increased. Similarly, using MobileNetV3 as the student network also achieved similar accuracy (98.88%) and F1 value (98.41%) as ShuffleNetV2, with a loss of 0.0766, an average entropy of 1.0957, and a average confidence of 60.83%.

The complete distillation model performs best in all indicators, with accuracy and F1 value both reaching 100%, loss reduced to 0.0291, average entropy significantly reduced to 0.3287, and average confidence reaching 90.90%. The above results fully prove that the distillation strategy proposed in this paper can not only improve the performance of the student network, but also effectively enhance the certainty and stability of the model output, especially showing good adaptability and generalization ability on the lightweight model.

4.4 Ablation Experiment

To further verify the specific role and performance contribution of each module in the proposed distillation framework, this paper designed nine systematic ablation experiments. All experiments are based on the complete distillation model, gradually removing or modifying specific modules to analyze their impact on the final performance. The experimental settings are shown in Table 2.

표 5. 절제 실험 데이터

Table 5. Ablation experiment data

Supervised Indicators

Unsupervised indicators

Type

Acc

F1

Loss

AE

AC

FeatureAlign + MSE

0.9888%

0.9845%

0.0278

0.4733

86.70%

FeatureAlign + MSE + Cosine Similarity

0.9888%

0.9886%

0.0435

0.3600

89.73%

FeatureAlign +MSE+SSIM

0.9885%

0.9838%

0.1075

0.5438

84.72%

FeatureAlign + cosine similarity

0.9770%

0.9704%

0.2293

0.5316

85.35%

FeatureAlign + cosine similarity + SSIM

0.9655%

0.9582%

0.0877

0.5477

84.08%

FeatureAlign +SSIM

0.9885%

0.9871%

0.0599

0.4619

86.67%

Fixed distillation weights

1.0000%

1.0000%

0.0183

0.4595

87.38%

Unlabeled smoothing

1.0000%

1.0000%

0.0377

0.4038

88.85%

Basic cosine annealing

0.9888%

0.9867%

0.0740

0.3795

89.09%

Complete Model

1.0000%

1.0000%

0.0291

0.3287

90.90%

Table 5 show that the complete model integrates FeatureAlign feature alignment ,label smoothing , MSE and structural similarity loss (including SSIM and cosine similarity), and achieves the best results in supervised indicators such as accuracy (100.00%), F1 value (100.00%) and minimum loss (0.0291). At the same time, it also shows significant advantages in unsupervised evaluation: the average entropy is the lowest (0.3287), indicating that the model output is more stable, the distribution is more concentrated and specific, and the average confidence is the highest (90.90%), reflecting the high confidence prediction ability of the sample category.

In contrast, removing any structural constraint (such as SSIM or cosine similarity) will lead to a decrease in model performance to varying degrees. For example, using only the FeatureAlign + MSE combination can maintain a high accuracy (98.88%), but the Loss increases to 0.0278 and the average entropy increases to 0.4733, indicating that the lack of structural perception leads to increased output uncertainty; when using SSIM or cosine constraints alone, the F1 value and average confidence cannot exceed the complete model. At the same time, removing the FeatureAlign module (no FeatureAlign) reduces F1 to 96.45%, further verifying the importance of feature alignment as a bridge for intermediate representation migration.

In addition, for the experimental group with fixed distillation weights and unlabeled smoothing, although the supervised indicators still maintain high values (Acc/F1 = 100%), the unsupervised indicators are significantly weakened (such as the average entropy is 0.4595 and 0.4038 respectively), indicating that the model is less robust when facing uncertain samples or potential interference information.

In summary, the complete model significantly outperforms all other ablation combinations in all key indicators and has strong generalization ability in small sample datasets.

4.5 Comparison of Unsupervised Indicators

Comparison of the average confidence and average entropy of using the distillation model and ResNet-18 model alone to process the electrical cable melting dataset:

그림 4. 전기 케이블 이미지 분류 후 증류 모델과 ResNet-18 모델의 평균 엔트로피 비교

Fig. 4. Comparison of the average entropy of the distillation model and the ResNet-18 model after classifying electrical cable images

../../Resources/kiee/KIEE.2026.75.1.204/fig4-1.png../../Resources/kiee/KIEE.2026.75.1.204/fig4-2.png

The Figure 4 reflects the average entropy predicted by the model . In the left chart, the average entropy of Class 0 is 0.4388, Class 1 is 0.2958, Class 2 is 0.3320, and the overall average is 0.3287, indicating that the prediction uncertainty of different categories is different. In the right chart, the average entropy of Class 0 is 1.1421, Class 1 is 0.9689, Class 2 is 1.3208, and the overall average is 1.0993. The difference in the average entropy of each category between the distillation model and the ResNet-18 model reflects that the distillation model has stronger prediction stability and lower uncertainty for samples of different categories than the ResNet-18 model.

그림 5. 전기 케이블 이미지 분류 후 증류 모델과 ResNet-18 모델 간 평균 신뢰도 비교

Fig. 5. Comparison of average confidence between the distillation model and the ResNet-18 model after classifying electrical cable images

../../Resources/kiee/KIEE.2026.75.1.204/fig5-1.png../../Resources/kiee/KIEE.2026.75.1.204/fig5-2.png

From the Figure 5, the dotted line 90.90% in the left box plot may represent the average confidence statistic, reflecting the confidence distribution characteristics of the distillation model, and showing the skewness of the data distribution. In the right box plot, the average value is marked as 67.75%, which is used to measure the average confidence level of the ResNet-18 model. Through the box position, interquartile range, etc., it can be intuitively compared that the average confidence performance of the distillation model in predicting the melting data of electrical cables is more reliable.

4.6 Comparison of Supervised Indicators

그림 6. 증류 모델과 ResNet-18 모델의 손실\정확도\F1\재현율 점수 비교

Fig. 6. Comparison of Loss\Accuracy\F1\Recall Score between distillation model and ResNet-18 model

../../Resources/kiee/KIEE.2026.75.1.204/fig6-1.png../../Resources/kiee/KIEE.2026.75.1.204/fig6-2.png../../Resources/kiee/KIEE.2026.75.1.204/fig6-3.png../../Resources/kiee/KIEE.2026.75.1.204/fig6-4.png
그림 7. 수신기 작동 특성(ROC) 곡선

Fig. 7. Receiver Operating Characteristic (ROC) Curve

../../Resources/kiee/KIEE.2026.75.1.204/fig7-1.png../../Resources/kiee/KIEE.2026.75.1.204/fig7-2.png../../Resources/kiee/KIEE.2026.75.1.204/fig7-3.png
그림 8. 증류 모델의 혼동 행렬

Fig. 8. Confusion Matrix of distillation model

../../Resources/kiee/KIEE.2026.75.1.204/fig8.png

Figure 6 and Figure 7shows the comparison results of the distillation model and the original ResNet-18 model in terms of loss, accuracy, and F1 Score in the training and validation stages. The first image in each set of comparison images (with a dotted line in the background) corresponds to the training results of the distillation model, from which the performance improvement brought by the distillation strategy can be clearly observed. In the first set of comparison graphs, the distillation model converges quickly at the beginning of training and maintains a low and stable validation loss throughout the training process. The final test set loss is significantly lower than that of the ResNet-18 model (0.0291 vs. 0.4417), reflecting the significant improvement of the distillation model in feature fitting and generalization capabilities. In the second set of comparison graphs, the distillation model achieved near-saturation accuracy performance at an earlier stage of training, and achieved better accuracy performance than ResNet-18 on both the validation set and the test set (test accuracy reached 1.0000), indicating that distillation learning effectively alleviated the underfitting problem of the original model and improved the model's discrimination ability. The third set of comparison graphs further verified the optimization of the overall performance of the model by the distillation strategy. The distilled model showed higher consistency and stability during the training process, and the F1 score quickly reached and maintained a level close to 1.0000, which was much better than the test set performance of the original model (1.0000 vs. 0.9720), indicating that it has stronger classification robustness in the case of unbalanced samples or blurred boundaries.

Figure 7 shows the ROC curves. After distillation, the student network successfully learned the distinguishing features of different categories while maintaining performance close to that of the teacher network. The model's ability to identify categories 1, 2, and 3 is almost perfect, especially with AUC = 1.0, indicating almost no false positives or false negatives.

Figure 8 shows the confusion matrix, indicating that the distillation model almost perfectly classifies each category. Even with few-sample training, the student network can stably learn the features of each category through distillation by the teacher network.

In summary, the distillation model on the left side of the figure significantly outperforms the undistilled ResNet-18 model in terms of loss convergence speed, accuracy improvement, and F1 score performance. Furthermore, the good results in ROC curve and confusion matrix ensure the reliability of the results. This fully verifies the effectiveness of the proposed teacher-student network structure and multi-loss fusion distillation strategy in enhancing the model's generalization ability and classification performance.

5. Conclusion

In this paper, a heterogeneous model distillation framework is designed to solve the problem of limited performance of lightweight models in the classification of small sample electrical cable melting images. By using the highly expressive U-Net3+ as the teacher network and outputting the information of the middle layer to the lightweight ResNet-18 student network with different structures, the generalization ability of the lightweight model on complex electrical images is effectively improved on only 117 training data sets. A multi-level, multi-loss fusion knowledge distillation framework is constructed. The main innovations include: multi-scale middle layer feature distillation design and introduction of middle layer feature alignment module. Feature distillation mechanism of multi-loss fusion. Soft label optimization strategy based on Label Smoothing. Joint scheduling of dynamic learning rate strategy and early stopping mechanism, combined with EarlyStopping mechanism to realize automatic selection of optimal weights and avoid overfitting. Multi-dimensional visual evaluation and diagnosis mechanism, introducing supervised and unsupervised indicators. In addition, comparative experiments and ablation experiments are used to illustrate the rationality and credibility of the method used in this paper.

In summary, this paper has made systematic improvements and integrations in multiple dimensions, including distillation structure design, loss function fusion, training scheduling mechanism, and evaluation system, which significantly improved the performance and generalization ability of the student network in environments such as small sample electrical cable melting datasets and lightweight models under complex problems.

Acknowledgements

This work was supported by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MCEE) RS-2022-KP002707, Jeonbuk Regional Energy Cluster Training of human resources)

References

1 
R. Tîrnovan, M. Cristea, 2019, Advanced techniques for fault detection and classification in electrical power transmission systems: An overview, pp. 1-6DOI
2 
Z. Huang, Y. Wei, X. Wang, W. Liu, T. S. Huang, H. Shi, 2020, AlignSeg:Feature-Aligned Segmentation Networks, arXiv preprint arXiv:2003.00872DOI
3 
H. Deng, X. Li, 2022, Anomaly detection via reverse distillation from one-class embedding, pp. 9737-9746Google Search
4 
G. Pereyra, 2017, Regularizing neural networks by penalizing confident output distributions, arXiv preprint arXiv:1701.06548DOI
5 
K. P. Murphy, 2012, Machine Learning: A Probabilistic PerspectiveGoogle Search
6 
S. Wager, S. Wang, P. Liang, 2013, Advances in Neural Information Processing Systems, Vol. 26Google Search
7 
K. He, X. Zhang, S. Ren, J. Sun, 2016, Deep residual learning for image recognition, pp. 770-778Google Search
8 
R. Namdar, 2022, Improving generalization in deep learning models under noisy and small sample conditions via multitask learning, IEEE Access, Vol. 10, pp. 12345-12358Google Search
9 
F. Zhuang, 2021, A comprehensive survey on transfer learning, Proceedings of the IEEE, Vol. 109, No. 1, pp. 43-76DOI
10 
P. Lu, 2021, RW-KD: Sample-wise loss terms re-weighting for knowledge distillation, Findings of ACL: EMNLP, Vol. 2021, pp. 3145-3152DOI
11 
Y. Ren, 2023, Tailoring instructions to student’s learning levels boosts knowledge distillation, arXiv preprint arXiv:2305.09651DOI
12 
A. Romero, 2015, FitNets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550Google Search
13 
S. Park, J. Lee, H. Kim, 2024, Cosine similarity-guided knowledge distillation for robust object detectors, Scientific Reports, Vol. 14, No. 1, pp. 12345DOI
14 
I. Loshchilov, F. Hutter, 2017, SGDR: Stochastic gradient descent with warm restarts, arXiv preprint arXiv:1608.03983DOI
15 
Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, 2004, Image quality assessment: From error visibility to structural similarity, IEEE Trans. Image Process., Vol. 13, No. 4, pp. 600-612DOI
16 
P. Goyal, 2017, Accurate, large minibatch SGD: Training ImageNet in 1 hour, arXiv preprint arXiv:1706.02677DOI
17 
H.-J. Jung, D. Kim, S.-H. Na, K. Kim, 2023, Feature structure distillation with centered kernel alignment in BERT transferring, Expert Syst. Appl.DOI
18 
H. Huang, 2020, UNet 3+: A full-scale connected UNet for medical image segmentation, pp. 1055-1059DOI
19 
C.-B. Zhang, 2020, Delving deep into label smoothing, arXiv preprint arXiv:2011.12562DOI
20 
E. D. Gireesh, V. P. Gurupur, 2023, Information entropy measures for evaluation of reliability of deep neural network results, Entropy, Vol. 25, No. 4, pp. 573DOI
21 
H-G. Park, J-H. Bang, J-H. Kim, B-M. So, J-H. Song, K-M. Park, 2023, A Study on the Comparative Analysis of the Performance of CNN-Based Algorithms for the Determination of Arc Beads and Molten Mark by Model, The Journal of Next-generation Convergence Technology Association, Vol. 7, No. 4, pp. 543-552Google Search

저자소개

조걸(Zhao Jie)
../../Resources/kiee/KIEE.2026.75.1.204/au1.png

He holds a Master of Science degree in Computer Technology Engineering from Nanchang University in China and is currently pursuing a PhD in the Department of IT Appiled System at Jeonbuk National University. His main research areas are deep learning , blockchain and fire detection.

방준호(Junho Bang)
../../Resources/kiee/KIEE.2026.75.1.204/au2.png

He received B.S., M.S. and Ph.D. degrees in Department of Electrical Engineering from Jeonbuk National University, in 1989, 1991 and 1996 respectively. He was a research engineer with LG Semiconductor from 1997 to 1998. He is currently working as a professor in Division of Convergence Technology Engineering and Department of Energy/Conversion Engineering of Graduate School, Jeonbuk National University, Jeonju, Rebublic of Korea. His main research interests include IT convergency systemdesign.

최철영(Chul-Young Choi)
../../Resources/kiee/KIEE.2026.75.1.204/au3.png

He received M.S. degrees in Department of Energy engineering at Jeonbuk National University and currently enrolled in a doctor’s course in Department of Energy engineering at Jeonbuk National University. His main research areas are Energy and offshore wind.

선로빈(Robin Sun)
../../Resources/kiee/KIEE.2026.75.1.204/au4.png

He received M.S. degrees in Department of Energy Storage Conversion Engineering at Jeonbuk National University and currently enrolled in a doctor’s course in Department of IT Appiled System at Jeonbuk National University. His main research areas are Energy and offshore wind.

박소연(Soyeon Park)
../../Resources/kiee/KIEE.2026.75.1.204/au5.png

She received M.S. degrees in IT application system engineering at Jeonbuk National University and currently enrolled in a doctor’s course. Her main research interests are electrical energy storage system and machine learning.