Microsoft's newest AI can "recreate any voice from a three-second sample clip"
· Jan 12, 2023 · NottheBee.com

We all knew this was going to happen eventually. But darn it, aren't we supposed to have like 15 more years before it gets to this point? It's happening too fast!

Microsoft's latest foray into the world of artificial intelligence comes in the form of VALL-E, a transformer-based text-to-speech model that can "recreate any voice from a three-second sample clip". Cybersecurity experts say that without proper protections, it could be used for more realistic phishing attacks and to spread misinformation.

"Phishing attacks?" "Misinformation?" No way this thing ever gets used for nefarious purposes, right?

So how does this crazy new technology work? Here's a diagram from Microsoft:

Well, uh, I doubt many of us know what that means. So here's what the system's designers wrote about it:

[T]o synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder. The discrete acoustic tokens derived from an audio codec model enable us to treat TTS as conditional codec language modeling, and advanced prompting-based large-model techniques (as in GPTs) can be leveraged for the TTS tasks. The acoustic tokens also allow us to generate diverse synthesized results in TTS by using different sampling strategies during inference.

Me reading that:

Well anyway it's all very interesting and terrifying.

Ready to join the conversation? Subscribe today.

Access comments and our fully-featured social platform.

Sign up Now
App screenshot