73% Danger: The Chilling Reality of Speech Deepfake Detection
A study from the University of London has demonstrated that, whether among native speakers of English or Mandarin, the accuracy rate for distinguishing artificially synthesized voices stands only at 73%. Published in the journal “PLOS ONE,” this research marks the first assessment by an institution of non-English-speaking populations’ ability to recognize deepfake voices.
Deepfakes aim to replicate authentic human voices or appearances and fall within the realm of generative artificial intelligence. It is an application of machine learning (ML) that trains algorithms to recognize patterns and characteristics within datasets, such as real human videos or audio, to recreate the original sound or image as closely as possible.
Early deepfake voice algorithms required a substantial amount of individual voice samples to generate closely matching audio. In contrast, the latest pre-trained algorithms need only a three-second snippet of a person’s speech to reproduce the sound. Moreover, these open-source algorithms and tools are freely available online, requiring minimal expertise, enabling ordinary individuals to master this technology within days.
However, in comparison to the benefits this technology provides, the potential hazards it poses merit greater scrutiny — such as the prevalent risk of fraud. As early as 2019, the CEO of a British energy company was deceived by a forged recording, leading to the transfer of hundreds of thousands of pounds to a fraudulent supplier.
Current research on discerning deepfakes mainly focuses on automated machine-learning detection systems, with little attention given to human discernment capabilities. Researchers at the University of London recognized this gap and initiated a study. They utilized a text-to-speech (TTS) algorithm, trained on publicly available datasets in both English and Mandarin, to generate 50 deepfake voice samples for each language. These artificially generated samples, along with their original counterparts, were subsequently played to 529 test participants to determine if they could identify whether a voice was real or fabricated.
The results indicated that English and Mandarin speakers had a comparable accuracy rate, with only a 73% probability of correctly identifying deepfakes. Prior training of participants in recognizing deepfakes did not significantly enhance accuracy. Furthermore, since participants were aware that some deepfake voice segments were included, and the researchers did not employ the most advanced voice synthesis technology, it suggests that real-world individuals would likely perform worse than these study participants.
Professor Lewis Griffin of the University of London’s Department of Computer Science remarked that generative AI technology teeters on the edge of myriad benefits and risks, urging governments and organizations to implement policies to prevent its misuse. Deepfakes are likely to become even more challenging to detect in the future. Based on their findings, the researchers concluded that expecting people to accurately discern deepfakes is unrealistic, and emphasis should be placed on enhancing machine learning detection systems.