This will be open sourced in a few weeks, apache 2 license. Apparently built off of llama. I had some people try it. One person blushed while chatting with the voices. The male voice in particular I think will appeal to a surprising amount of people, not just the stereotypical female "Her" voice we're all expecting.

https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo

Share.

1 Comment

  1. “Even though recent models produce highly human-like speech, they struggle with the one-to-many problem: there are countless valid ways to speak a sentence, but only some fit a given setting. Without additional context—including tone, rhythm, and history of the conversation—models lack the information to choose the best option. Capturing these nuances requires reasoning across multiple aspects of language and prosody.

    To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages the history of the conversation to produce more natural and coherent speech. There are two key takeaways from our work. The first is that CSM operates as a single-stage model, thereby improving efficiency and expressivity. The second is our evaluation suite, which is necessary for evaluating progress on contextual capabilities and addresses the fact that common public evaluations are saturated.”