Last year saw the emergence of artificial intelligence tools (AI) that can create images, artwork, or even video with a text prompt.
Now, just a few days into 2023, another powerful use case for AI has stepped into the limelight - a text-to-voice tool that can impeccably mimic a person’s voice.
Developed by Microsoft, VALL-E can take a three-second recording of someone’s voice, and replicate that voice, turning written words into speech, with realistic intonation and emotion depending on the context of the text.
Trained with 60,000 hours worth of English speech recordings, it can deliver a speech in a "zero-shot situation," which means without any prior examples or training in a specific context or situation.
Introducing VALL-E in a paper published by Cornell University, the developers explained that the recording data consisted of more than 7,000 unique speakers.
The team say their Text To Speech system (TTS) used hundreds of times more data than the existing TTS systems, helping them to overcome the zero-shot issue.
The tool is not currently available for public use - but it does throw up questions about safety, given it could feasibly be used to generate any text coming from anybody’s voice.
Microsoft betting big on AI
Its creators have, however, provided a demo, showcasing a number of three-second speaker prompts and a demonstration of the text-to-speech in action, with the voice correctly mimicked.
Alongside the speaker prompt and VALL-E’s output, you can compare the results with the "ground truth" - the actual speaker reading the prompt text - and the “baseline” result from current TTS technology.
Microsoft has invested heavily in AI and is one of the backers of OpenAI, the company behind ChatGPT and DALL-E, a text-to-image or art tool.
The software giant invested $1 billion (€930 million) in OpenAI in 2019, and a report this week on semafor.com stated it was looking at investing another $10 billion (€9.3 billion) in the company.