Futurology

Incredible Demo of AI voice to voice model: Crossing the uncanny valley of conversational voice

March 1, 2025

TFenrir on March 1, 2025 7:13 pm

“Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Without additional context—including tone, rhythm, and history of the conversation—models lack the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody.

To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from our work. The first is that CSM operates as a single-stage model, thereby improving efficiency and expressivity. The second is our evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.”