(Oralbek Bayazov)
1iD
(Anel Aidos)
2iD
강정원
(Jeong Won Kang)
†iD
(Assel Mukasheva)
††iD
-
(School of Information Technology and Engineering, Kazakh-British Technical University,
Kazakhstan.)
-
(School of Science and Humanities, Nazarbayev University, Astana, Kazakhstan.)
Copyright © The Korea Institute for Structural Maintenance and Inspection
Key words
Voice biometrics, Wav2Vec 2.0, Spoof detection, LSTM, Federated learning, Multimodal authentication, Deep learning., ㅍ
1. Introduction
The increasing reliance on digital platforms for banking, education, healthcare,
and communication has significantly amplified the need for robust and seamless authentication
systems. Voice biometric authentication is gaining momentum due to its convenience,
contactless operation, and potential for integration into everyday technologies such
as smart-phones, virtual assistants, and call centers. Unlike fingerprint or facial
recognition systems, voice-based methods do not require physical contact or camera
access, making them especially useful in low-resource or privacy-sensitive contexts.
Nevertheless, the practicality of voice biometrics is hindered by multiple factors.
Variations in microphones, environmental noise, changes in speaker’s health, emotional
state, and most critically, spoofing attacks, can undermine system performance. Spoofing-through
replayed recordings or synthetic speech-presents a unique challenge because it can
closely imitate legitimate input.
Artificial Intelligence, and particularly deep learning, has opened new avenues to
enhance the accuracy and adaptability of voice authentication systems[1,2]. Unlike traditional signal processing techniques, deep neural networks can learn
hierarchical representations of speech signals, enabling them to generalize across
different conditions. This study seeks to evaluate and compare three distinct neural
architectures- Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM),
and Wav2Vec 2.0-for their efficacy in authenticating speakers under challenging conditions.
2. Materials and Methods
This study adopts a rigorous mathematical and experimental framework to evaluate the
robustness of CNN, LSTM, and Wav2Vec 2.0 models for voice biometric authentication.
CNN was chosen for its ability to capture local spectral patterns from spectrograms,
LSTM for modeling long-term temporal dependencies in speech, and Wav2Vec 2.0 as a
state-of-the-art transformer that learns contextual representations directly from
raw audio. The methodology is divided into several subsections covering data representation,
model architectures, and evaluation metrics.
2.1 Data Representation
Each voice sample can be represented as a discrete-time signal:
Here, 𝑥(𝑡) denotes the raw speech waveform, and 𝑇 is the total number of sampled
points.
For CNN and LSTM models, these signals are transformed into mel-spectrograms and mel-frequency
cepstral coefficient (MFCCs):
where STFT denotes the short-time Fourier transform, producing a time-frequency representation,
while MFCC captures perceptually relevant spectral features. In contrast, Wav2Vec
2.0 directly consumes the raw waveform without handcrafted transformations.
2.2 Convolutional Neural Network (CNN)
CNNs are designed to capture local spatial features from spectrograms. A convolutional
layer can be mathematically expressed as:
where $w_{m,\: n}^{(l)}$ are the convolutional kernel weights, $b^{(l)}$ is the bias,
and 𝜎 denotes the ReLU activation function. This formulation describes how each feature
map is generated by applying learnable filters to the spectrogram, enabling the extraction
of edges, frequency bands, and temporal structures.
2.3 Long Short-Term Memory (LSTM)
LSTM networks are recurrent architectures specialized in modeling sequential dependencies.
The state updates can be written as:
Here, $f_{t}$, $i_{t}$, and $o_{t}$ represent the forget, input, and output gates,
respectively, while $c_{t}$ is the memory cell state. This mechanism allows the network
to selectively retain or discard information, making it suitable for speech data with
long-term temporal dependencies.
2.4 Wav2Vec 2.0
Wav2Vec 2.0 operates on raw audio using a transformer-based encoder and employs self-supervised
pretraining. Its objective is expressed through a contrastive loss function:
where $c_{t}$ is the contextualized representation of the masked time step, $q_{t}$
is the correct (positive) quantized target, and $q$ represents negative distractor
samples. This objective ensures that the model learns discriminative features directly
from raw audio without relying on hand-crafted transformations.
2.5 Evaluation Metrics
To quantify performance, three main metrics are considered:
Accuracy measures overall classification performance, while False Acceptance Rate
(FAR) evaluates how often spoofed or unauthorized voices are incorrectly accepted.
False Rejection Rate (FRR) measures how often genuine users are rejected.
In addition, robustness is defined as the relative performance drop when the models
are evaluated under noisy or spoofed conditions compared to clean scenarios. This
provides a more practical assessment of real-world deployment.
2.6 Data Augmentation
To improve robustness and generalization, several data augmentation techniques were
applied to the training datasets. These augmentations simulate real-world variability
and adversarial conditions:
- Additive Gaussian Noise:
where $x(t)$ is the clean speech signal and $\sigma^{2}$ is the noise variance. This
method simulates microphone and environmental noise.
- Reverberation: Convolution of the signal with a room impulse response (RIR):
- Pitch Shifting: Frequency modification applied via phase vocoder:
where α is the pitch scaling factor.
- Speed Perturbation: Temporal scaling applied to the waveform:
- Background Speech Mixing: Random segments from other speakers are linearly mixed:
where y(t) is another speaker’s audio and λ is a mixing coefficient.
These augmentation strategies increase dataset variability, reduce overfitting, and
enhance spoofing resistance by exposing models to challenging acoustic conditions
during training. Similar to other works where noise augmentation was applied to improve
the robustness of CNN-based models in medical imaging tasks [3], our study incorporated Gaussian noise, reverberation, and pitch shifting to simulate
realistic acoustic environments.
Figure 1 shows the spectrogram comparison of clean and augmented speech signals. The first
spectrogram represents a clean speech waveform with distinct harmonic structures.
The second spectrogram illustrates the effect of additive Gaussian noise, where random
high-frequency components blur the formant patterns. The third spectrogram demonstrates
reverberation, introducing temporal smearing caused by simulated room impulse responses.
The fourth spectrogram shows pitch shifting, where the spectral bands are displaced
due to frequency scaling. Finally, the fifth spectrogram presents speed perturbation,
which compresses or stretches temporal features, altering the rhythm of speech. Together,
these augmentations increase data variability and simulate realistic acoustic environments
for model training.
Fig. 1. Examples of spectrograms after different augmentation techniques
3. Results and Discussion
The comparative results reveal significant differences in how each architecture handles
noise and spoofing. CNN models, while effective on clean spectrograms, suffered considerable
performance degradation when noise or distortions were introduced. Their reliance
on static visual patterns limited adaptability.
LSTM networks showed superior noise handling due to their ability to model time-series
dynamics. However, they struggled with spoofed inputs, especially those generated
via high-quality TTS, suggesting their temporal memory alone is insufficient for spoof
detection.
Wav2Vec 2.0, as hypothesized, delivered the highest accuracy overall. Its ability
to process raw audio signals al lowed it to learn deep representations resilient to
distortion and background interference[2,7]. In clean conditions, it achieved 92 percent accuracy, with minimal performance drop
under noisy conditions. However, even Wav2Vec misclassified certain high-fidelity
synthetic voices as genuine, indicating that spoofing remains a system-wide vulnerability[1].
As shown in Figure 2, Wav2Vec 2.0 leads in all three key metrics-accuracy, noise robustness, and spoof
resistance-compared to CNN and LSTM.
Fig. 2. Model comparison of performance for three key metrics
These results align with recent studies showing the superiority of transformer-based
models in speech processing tasks[3]. However, the persistent vulnerability to spoofing across all models confirms findings
by other researchers that neural networks, regardless of their depth, can be deceived
by audio crafted to imitate human speech[7,8].
Figure 3 presents the spoof resistance distribution across models, demonstrating that while
Wav2Vec 2.0 performs better, it still accepts over 25 percent of spoofed samples.
This confirms the need for integrating explicit spoof detection modules, such as Light
Convolutional Neural Networks (LCNN), and training models with adversarial examples
crafted from state-of-the-art voice cloning tools[6,8].
Furthermore, incorporating multi-modal biometric fusion-e.g., combining voice with
facial or behavioral signals-could significantly reduce spoofing risk[9]. Alternatively, privacy-aware architectures such as federated learning may allow
decentralized training on user devices, mitigating the risk of centralized data leakage[10].
The key evaluation metrics are summarized in Table 1, enabling a side-by-side comparison of the three architectures in numerical terms.
Fig. 3. Model comparison by performance metrics
Table 1 Summary of Model Evaluation Metrics
|
Model
|
Accuracy
|
Noise Robustness
|
Spoof Resistance
|
|
CNN
|
78%
|
65%
|
52%
|
|
LSTM
|
85%
|
78%
|
67%
|
|
Wav2Vec 2.0
|
92%
|
88%
|
75%
|
3.1 Cross-Dataset Evaluation
To evaluate the generalization capability of the models, a cross-dataset experiment
was conducted: models were trained on the Mozilla Common Voice dataset and tested
on VoxCeleb. Results show a significant performance drop for CNN and LSTM due to overfitting
to spectrogram-specific features. Wav2Vec 2.0 demonstrated stronger generalization,
maintaining 84% accuracy compared to 92% on in-domain data.
3.2 Spoof Generalization
An additional experiment was carried out where models were trained with one type of
text-to-speech (TTS) synthesis and tested with another unseen TTS system. Both CNN
and LSTM models failed to adapt, showing spoof acceptance rates above 40%. Wav2Vec
2.0 reduced the error but still accepted 28% of spoofed samples, highlighting the
challenge of unseen synthetic voices.
3.3 Adversarial Attack Robustness
To simulate targeted spoofing, adversarial perturbations were generated using the
Fast Gradient Sign Method (FGSM):
where 𝜖 is the perturbation budget. Even with small 𝜖 = 0.01, CNN and LSTM misclassified
45% and 39% of samples, respectively, while Wav2Vec 2.0 showed improved robustness
with only 22% error.
Figure 4 compares the models performance, such as CNN, LSTM, and Wav2Vec 2.0, under different
experimental conditions of accuracy across four scenarios: clean speech, noisy speech,
cross-dataset evaluation, and adversarial attacks. Wav2Vec 2.0 consistently outperforms
the other models, showing stronger robustness to noise, dataset variability, and adversarial
perturbations.
Fig. 4. Model performance under different experimental conditions
3.4 Discussion
The comparative results presented in the previous section demonstrate that Wav2Vec
2.0 outperforms CNN and LSTM models across multiple evaluation metrics. However, a
deeper analysis reveals crucial insights into why these models behave differently
under clean, noisy, and spoofed conditions.
CNNs perform well on clean mel spectrograms due to their ability to extract spatial
features. However, they lack temporal modeling capabilities, making them more vulnerable
to variations in speech dynamics and environmental changes. When Gaussian noise or
reverb is introduced, CNN performance drops sharply, indicating overreliance on static
patterns[11].
LSTM networks, being recurrent in nature, demonstrate stronger resistance to temporal
distortion[12]. They can adapt to background conversations or inconsistent pacing in speech. However,
their inability to detect anomalies in spectral patterns leads to a higher false acceptance
rate during spoofing attacks, especially when using high-quality TTS inputs[13].
Wav2Vec 2.0 exhibits clear advantages by processing raw waveform data directly. Its
transformer-based architecture allows it to extract multi-level, contextualized features[2,14]. This robustness contributes to its strong noise resistance and superior generalization.
However, it too struggles with advanced spoofing, misclassifying some AI-generated
voices as genuine[1,7,15].
Further analysis of the confusion matrices indicates that while all models can distinguish
between genuine and replayed inputs fairly well, they struggle when presented with
deepfake voices generated using state-of-the-art TTS systems[6,16]. Figure 5 shows the confusion matrix for Wav2Vec 2.0 under spoofing conditions.
Fig. 5. Confusion matirix of Wave2Vec 2.0 under spoofing conditions (TTS vs. genuine)
Moreover, latency analysis showed that Wav2Vec 2.0 requires more computational resources
due to the transformer layers, which could limit its deployment on edge devices[17].
To summarize, while Wav2Vec 2.0 is a promising solution, deploying it in real-world
scenarios requires complementing it with specialized anti-spoofing modules and optimizing
it for lightweight inference.
3.4.1 Error Analysis
A detailed examination of the error cases shows that CNN frequently accepted spoofed
samples with stable spectral envelopes, indicating its overreliance on static features.
LSTM demonstrated difficulty when spoofed voices contained long pauses or irregular
temporal dynamics, which disrupted its sequence modeling. Wav2Vec 2.0 performed better
overall but still misclassified advanced TTS-based voices, especially those reproducing
natural coarticulation and prosodic variations. These findings highlight that spoof
detection remains an open challenge across all architectures.
3.4.2 Latency and Computational Efficiency
Alongside recognition accuracy, inference time was also evaluated. CNN achieved the
fastest processing (≈12 ms per sample) owing to its lightweight convolutional layers,
while LSTM required slightly longer (≈18 ms) due to sequential recurrence. Wav2Vec
2.0, despite offering the highest accuracy, was the slowest (≈45 ms), primarily because
of its transformer layers and contextual embeddings. This trade-off indicates that
while Wav2Vec 2.0 is most robust, CNN and LSTM may still be preferable for deployment
on resource-constrained or edge devices. Recent works also emphasize optimization-oriented
approaches in software engineering, such as the use of low-code platforms to improve
efficiency[17].
3.5 Mathematical Robustness Analysis
To provide a deeper theoretical perspective, the robustness of voice biometric models
can be formalized using mathematical definitions and performance bounds.
3.5.1 Signal-to-Noise Ratio (SNR) and Degradation
The impact of noise on speech signals is quantified through SNR:
where $x(t)$ is the clean signal and $x'(t)$ is the noisy version. A higher SNR corresponds
to clearer input, while a lower SNR indicates stronger noise contamination. Model
robustness can be expressed as the relative drop in accuracy with decreasing SNR.
3.5.2 False Acceptance and Rejection Bounds
The overall reliability of authentication can be expressed via the FAR and FRR. A
robust system minimizes both rates simultaneously. However, in practice, there exists
a trade-off: reducing one rate typically increases the other. To summarize overall
system performance, the Equal Error Rate (EER) is commonly used.
The EER is defined as the operating point where FAR equals FRR :
where $\tau^{*}$ is the decision threshold that balances the two errors. When an exact
equality is not attainable due to discrete thresholds, $\tau^{*}$ is chosen as
with the corresponding approximation:
This measure provides a single scalar value to compare biometric systems and is widely
reported in speaker verification studies.
3.5.3 Spoof Detection under Adversarial Perturbations
Spoofing attacks can be formalized as adversarial perturbations to the input signal:
where 𝛿 is the perturbation bounded by ϵ. The adversarial objective maximizes classification
error:
CNN and LSTM models exhibit higher sensitivity to small perturbations, while transformer-based
models like Wav2Vec 2.0 provide partial robustness but remain vulnerable when 𝜖 is
sufficiently large.
3.5.4 Robust Training Objective
To mitigate these vulnerabilities, adversarial training modifies the loss function:
where LCE is the standard cross-entropy loss, Ladv penalizes misclassification under
adversarial perturbations, and 𝜆 balances the two objectives. This formulation ensures
models not only fit clean data but also resist spoofed and noisy inputs.
3.6 Ethical and Privacy Considerations
The deployment of AI-driven voice biometric systems raises several ethical, legal,
and privacy challenges that must not be overlooked.
3.6.1 Data Ownership and Consent
Voice recordings are inherently personal and can reveal far more than just identity
- including health status, emotions, or even mental state. It is imperative that users
retain control over their data. Consent must be explicit, revocable, and informed.
Systems must be designed with privacy-by-default and privacy-by-design principles[18].
3.6.2 Bias and Fairness
Bias in training data is a serious concern. Voice datasets such as Common Voice and
VoxCeleb, while large, may not be fully balanced in terms of gender, age, dialect,
or accent. This can lead to biased model behavior - for instance, better recognition
accuracy for male over female speakers or for native English speakers over non-native.
Ensuring equitable performance requires diverse training data and continuous auditing
for fairness[19].
3.6.3 Secure and Private Learning Approaches
Federated Learning offers a solution by enabling decentralized training. In this framework,
models are trained locally on user devices without transmitting raw audio to central
servers. This not only preserves privacy but also reduces the risk of data leaks and
attacks on central repositories[9,20].
3.6.4 Regulatory Compliance
Deployment in real-world systems must comply with data protection regulations such
as GDPR, CCPA, or national laws. Explainable AI and audit trails must be in place
to ensure accountability and legal transparency[21,22].
4. Conclusion and Future Work
This study presented a comparative evaluation of CNN, LSTM, and Wav2Vec 2.0 architectures
for voice biometric authentication under clean, noisy, and spoofed conditions. Wav2Vec
2.0 consistently outperformed the other models in accuracy and robustness, although
none of the approaches achieved complete resistance to high-quality spoofing attacks.
In addition to the main findings, several broader conclusions can be drawn:
∙ Accuracy vs. Efficiency Trade-off: While CNN and LSTM provide faster inference suitable
for deployment on edge devices, their robustness against spoofing remains limited.
Wav2Vec 2.0, although computationally heavier, delivers superior generalization and
noise resilience, suggesting its suitability for cloud-based or hybrid systems.
∙ Vulnerability to Emerging Attacks: All models demonstrated weaknesses against novel
TTS systems and adversarial perturbations, confirming that spoof detection remains
one of the most critical bottlenecks in voice biometrics.
∙ Importance of Data Diversity: Cross-dataset experiments revealed that limited domain
coverage in training data reduces generalization. This emphasizes the necessity of
large-scale, diverse, and continuously updated datasets for robust biometric authentication.
Future research directions include several promising avenues:
∙ Hybrid Architectures: Combining CNN’s efficient feature extraction, LSTM’s sequential
modeling, and transformer-based contextual representations could lead to improved
balance between robustness and efficiency.
∙ Adversarial and Spoof-Aware Training: Integrating adversarial training strategies
and explicit spoof detection modules (e.g., LCNNs or spectro-temporal anomaly detectors)
to mitigate vulnerabilities.
∙ Multimodal Biometric Fusion: Exploring fusion of voice with facial recognition,
lip movements, or behavioral biometrics to enhance overall security.
∙ On-Device Deployment: Investigating lightweight transformer variants (e.g., DistilWav2Vec,
quantization, pruning) for mobile and embedded systems.
∙ Privacy-Preserving Learning: Applying federated learning and differential privacy
methods to protect sensitive voice data while maintaining high accuracy.
Ultimately, future systems must not only achieve technical robustness but also comply
with ethical, legal, and privacy requirements to ensure trustworthy real-world deployment.
Acknowledgements
This research was supported by the Science Committee of the Ministry of Science and
Higher Education of the Republic of Kazakhstan (grant no. BR28712579).
References
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi and N. Evans, “Automatic speaker
verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,”
2022. arXiv preprint arXiv:2202.12233

S. Novoselov, G. Lavrentyeva, A. Avdeeva, V. Volokhov and Aleksei Gusev, “Robust speaker
recognition with transformers using wav2vec 2.0.,” 2022. arXiv preprint arXiv:2203.15095

A. Mukasheva, D. Koishiyeva, Z. Suimenbayeva, S. Rakhmetulayeva, A. Bolshibayeva and
G. Sadikova, “Comparison Evaluation of Unet-Based Models with Noise Augmentation for
Breast Cancer Segmentation on Ultrasound Images,” Eastern-European Journal of Enterprise
Technologies, vol. 125, no. 9, 2023 10.15587/1729-4061.2023.289044

N. Vaessen and D. A. Van Leeuwen, “Fine-tuning wav2vec2 for speaker recognition,”
In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), IEEE, pp. 7967-7971, 2022. 10.1109/ICASSP43922.2022.9746952

K. Li, C. Baird and D. Lin, “Defend data poisoning attacks on voice authentication,”
IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 4, pp. 1754-1769,
2023. 10.1109/TDSC.2023.3289446

J. W. Lee, E. Kim, J. Koo and K. Lee, “Representation selective self-distillation
and wav2vec 2.0 feature exploration for spoof-aware speaker verification,” 2022.
Preprint, Available at: arXiv:2204.02639

S. Salturk and N. Kahraman, “Deep learning-powered multimodal biometric authentication:
integrating dynamic signatures and facial data for enhanced online security,” Neural
Computing and Applications, vol. 36, no. 19, pp. 11311-11322, 2024. 10.1007/s00521-024-09690-2

K. Merit and M. Beladgham, “Enhancing Biometric Security with Bimodal Deep Learning
and Feature-level Fusion of Facial and Voice Data,” Journal of Telecommunications
and Information Technology, vol. 98, no. 4, pp. 31-42, 2024. 10.26636/jtit.2024.4.1754

Y. Elbayoumi (2024), “Applying machine learning and deep learning in the voice biometrics
technology,” Master’s Thesis, Bahcesehir University, 22 January 2024. https://www.researchgate.net/publication/380131916

K. Koutini, H. Eghbal-zadeh and G. Widmer, “Receptive field regularization techniques
for audio classification and tagging with deep convolutional neural networks,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1987-2000, 2021.
10.1109/TASLP.2021.3082307

T. N. Sainath, O. Vinyals, A. Senior and H. Sak, “Convolutional, long short-term memory,
fully connected deep neural networks,” In 2015 IEEE international conference on acoustics,
speech and signal processing (ICASSP) IEEE, pp. 4580-4584. 2015. 10.1109/ICASSP.2015.7178838

G. Zhang, C. Yan, X. Ji, T. Zhang, T. Zhang and W. Xu, “Dolphinattack: Inaudible voice
commands,” In Proceedings of the 2017 ACM SIGSAC conference on computer and communications
security, IEEE, pp. 103-117. 2017. 10.1145/3133956.3134052

A. Mohamed, H.-Y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff,
et al., “Self-supervised speech representation learning: A review,” IEEE Journal of
Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, 2022. 10.1109/JSTSP.2022.3207050

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kinnunen, M. Todisco, et
al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,” IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507-2522, 2023.
10.1109/TASLP.2023.3285283

S. Tuli and N. K. Jha, “EdgeTran: Device-aware co-search of transformers for efficient
inference on mobile edge platforms,” IEEE Transactions on Mobile Computing, vol. 23,
no. 6, pp. 7012-7029, 2023. 10.1109/TMC.2023.3328287

S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup and M. Shah, “A survey of on-device
machine learning: An algorithms and learning theory perspective,” ACM Transactions
on Internet of Things, vol. 2, no. 3, pp. 1-49, 2021. 10.1145/3450494

E. Seitzhan, A. Bissembayev, A. Mukasheva, H. S. Park and J. W. Kang, “A Study on
the Optimization Efficiency of Software Development with Low-Code Platforms,” Transactions
of the Korean Institute of Electrical Engineers, vol. 74, no. 5, pp. 957-968, 2025.
10.5370/KIEE.2025.74.5.957

L. H. X. Ng, A. C. M. Lim, A. X. W. Lim and A. Taeihagh, “Digital ethics for biometric
applications in a smart city,” Digital Government: Research and Practice, vol. 4,
no. 4, pp. 1-6, 2023. 10.1145/3630261

A. Koenecke, A. Nam, E. Lake, J. Nudell, M. Quartey, Z. Mengesha, C. Toups, J. R.
Rickford, D. Jurafsky and S. Goel, “Racial disparities in automated speech recognition,”
Proceedings of the national academy of sciences, vol. 117, no. 14, pp. 7684-7689,
2020. 10.1073/pnas.1915768117

K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon,
et al., “Towards federated learning at scale: System design,” Proceedings of machine
learning and systems, vol. 1, pp. 374-388, 2019. https://proceedings.mlsys.org/paper_files/paper/2019/file/7b770da633baf74895be22a8807f1a8f-Paper.pdf

P. Voigt and A. Von dem Bussche, “The EU General Data Protection Regulation (GDPR),”
A Practical Guide, 1st ed., Cham: Springer International Publishing, 2017. 10.1007/978-3-319-57959-7

S. Wachter, B. Mittelstadt and C. Russell, “Counterfactual explanations without opening
the black box: Automated decisions and the GDPR,” Harvard Journal of Law & Technology,
vol. 31, no. 2, pp. 841-887, 2017. 10.2139/ssrn.3063289

저자소개
He received the B.S. degree in Computer Systems and Software from Kazakh-British Technical
University (KBTU), Almaty, Kazakhstan, in 2022. Since 2024, he has been pursuing the
M.S. degree in Information Systems at the School of Information Technology and Engineering,
KBTU. He is currently working as a freelance Java Backend developer. His research
interests include machine learning, artificial intelligence, and voice biometric authentication.
She is studying at Nazarbayev University’s School of Sciences and Humanities and is
currently in her junior year as a sociology student. Her research interests include
a wide variety of subjects, including quantitative and qualitative research, as well
as policy implementation.
He received his B.S., M.S., and Ph.D. degrees in electronic engineering from Chung-Ang
University, Seoul, Korea, in 1995, 1997, and 2002, respectively. In March 2008, he
joined the Korea National University of Transportation, Republic of Korea, where he
currently holds the position of Professor in the Department of Transportation System
Engineering, the Department of SMART Railway System, and the Department of Smart Railway
and Transportation Engineering.
She received the B.S., M.S., and PhD. degrees from Satbayev University, Almaty, Kazakhstan,
in 2004, 2014, and 2020, respectively. In September 2023, she joined Kazakh-British
Technical University, where she is currently an professor in School of Information
Technology and Engineering. Big Data, cyber security, machine learning, and comparative
study of deep learning methods.