ISSN: 3048-6815

Deepfake Speech Technology: Trends in Voice Cloning and Audio Generation (P1-P1)

Abstract

Deepfake audio refers to the use of artificial intelligence techniques to synthesize realistic human speech. Leveraging deep learning models such as GANs and autoencoders, modern voice cloning systems can generate synthetic voices that are nearly indistinguishable from real human speech. This paper presents a survey of recent advancements in deepfake audio technology, focusing on the underlying methodologies, practical applications,and ethical concerns. We also examine the existing detection methods and regulatory challenges posed by these advancements.

References

  1. [1]A. v. d. Oord, S. Dieleman, H. Zen et al., “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016. [2] J. Shen, R. Pang, R. J. Weiss et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” ICASSP, 2018. [3] Y. Jia, Y. Zhang, R. J. Weiss et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in neural information processing systems, 2018. [4] R. Kumar and M. Singh, “Ai-powered voice cloning for assistive tech- nologies: A user-centric perspective,” Journal of Assistive Technologies, vol. 16, no. 3, pp. 123–132, 2022. [5] M.-H. Maras and A. Alexandrou, “Determining authenticity of audio deepfakes: Legal and ethical implications,” Forensic Science Interna- tional: Reports, vol. 2, p. 100090, 2020. [6] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, vol. 51, no. 11, pp. 1039–1064, 2009. [7] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” arXiv preprint arXiv:1905.05879, 2019. [8] M. Binkowski, C. Donahue, and S. Dieleman, “High fidelity speech synthesis with adversarial networks,” arXiv preprint arXiv:1909.11646, 2019. [9] K. Kumar, R. Kumar, J. de Boissiere et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in Neural Information Processing Systems, vol. 32, 2019. [10] Y. Ren, C. Hu, X. Tan, J. He, S. Zhao, Z. Zhao, and T. Qin, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations (ICLR), 2021. [11] J. Kim, S. Kim, B. Jung, B.-J. Kim, and S. Yoon, “Conditional variational autoencoder with adversarial learning for end-to-end text- to-speech,” in Proceedings of the 38th International Conference on Machine Learning, 2021. [12] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” arXiv preprint arXiv:1710.10467, 2018. [13] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis,” arXiv preprint arXiv:2010.05646, 2020. [14] M. Wester and J. Latorre, “Synthesis of personalized speech using deep neural networks,” in Interspeech, 2016. [15] E. Cooper and A. Singh, “Synthetic speech in audiobook production: Enhancing accessibility and performance,” Journal of Audio Engineer- ing, vol. 68, no. 3, 2020. [16] D. Jiang, H. Zhang, W.-N. Wang, and X. Liu, “Deep voice cloning for language dubbing,” Multimedia Tools and Applications, vol. 79, no. 39, pp. 29 537–29 557, 2020. [17] J. Yamagishi and S. King, “Speech synthesis technologies for individuals with vocal disabilities,” in Interspeech, 2012. [18] S. Kreps, P. McCain, and M. Brundage, “Deepfake technology and the risk of audio fraud,” Brookings TechStream, 2021. [19] J. Vincent, “The art and ethics of deepfake music,” The Verge, 2020, available at: https://www.theverge.com/2020/9/24/21453827. [20] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor et al., “Towards end-to-end prosody transfer for expressive speech syn- thesis with tacotron,” in International Conference on Machine Learning, 2018. [21] C. Zhang, Y. Liu, and S. Narayanan, “Towards robust and generalizable voice cloning in unseen conditions,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1253–1266, 2022. [22] T. Kinnunen, H. Delgado, M. Todisco, M. Sahidullah, N. Evans, J. Yamagishi, and K. A. Lee, “Asvspoof 2019: Automatic speaker verification spoofing and countermeasures challenge evaluation plan,” in INTERSPEECH, 2020. [23] T. M¨uller and D. Kolossa, “Signature detection of synthesized speech from phase and lpc residual cues,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 2159–2171, 2021. [24] E. Albadawy, A. Lopatka, and K. Patil, “Detecting audio deepfakes using parallel wavegan,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2587–2591. [25] P. Mittal, P. Verma, and S. Saxena, “Contrastive representation learning for deepfake audio detection,” arXiv preprint arXiv:2301.11289, 2023. [26] Y. Koizumi, K. Sonobe, Y. Kawaguchi, and T. Toda, “Detection of adver- sarial examples for voice spoofing countermeasures using inconsistency in speech features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 291–304, 2022.
Download PDF

How to Cite

Pritee Nilesh Fuldeore, (2025/8/25). Deepfake Speech Technology: Trends in Voice Cloning and Audio Generation. JANOLI International Journal of Artificial Intelligence and its Applications, Volume EOCMPeqBj5R9ZDur0Rlk, Issue 4.