- Nvidia released PersonaPlex, an AI model that listens and speaks simultaneously
- PersonaPlex uses hybrid training with real and synthetic human conversation data
- It outperformed rivals in dialogue naturalness but audio quality is phone-like
Nvidia, a US-based tech giant, recently released PersonaPlex, a conversational Artificial Intelligence (AI) model that can listen and speak at the same time, instead of waiting for a user to finish speaking before responding. The system combines voice and text prompts, and also handles interruptions, backchannels and emotional responses.
Voice assistants are old news, but a majority of them respond poorly, more or less like a machine.
But Nvidia's PersonaPlex is an AI model that allows seamless interactions in human-like timing and tone, using simple conversational words like "uh-huh", "got it", "oh okay", "okay", "yeah", "yeah, I think they do" to signal active listening.
"Full-duplex models like Moshi finally made AI conversations feel natural with real-time listening and speaking, but locked you into a single fixed voice and role," the company said, highlighting the rapid evolution of AI, which will be a point of discussion at the NDTV Ind.AI Summit, hosted by NDTV on February 18.
Also read | Seedance 2.0 vs Sora 2: How Two Big AI Tools Stack Against Each Other
How was PersonaPlex trained?
According to the official statement, PersonaPlex uses a hybrid training mix that combines real human conversation recordings from the Fisher English Corpus with those synthetic dialogues.
The synthetic transcripts were created using Qwen3-32B and GPT-OSS-120B, powerful models specialising in advanced reasoning and coding. The speech was generated with Chatterbox TTS, which is a high-speed open-source text-to-speech model.
Also read | Sarvam vs ChatGPT And Gemini: Which AI Fits Your Needs
The hype around PersonaPlex
Nvidia claimed that PersonaPlex outperformed commercial and open-source rivals in naturalness and interruption handling. It achieved a Dialogue Naturalness Mean Opinion Score (MOS) of 3.90, surpassing Gemini Live (3.72) and Moshi (3.11).
But the audio is 24 kHz quality and can sound somewhat "phone-like" compared with studio voices.
Though impressive, PersonaPlex is not a finished consumer product, with some online users reporting that it can sound synthetic or less intelligent than high-end text LLMs. So far, it only supports English, though future expansions may add more languages.
"English only, doesn't work in production. Need a full AI / ML team to make it work in production, would take years to actually make inference work without massive hallucinations," one user wrote on X.
"The model, built on the open Moshi architecture, excels in benchmarks like FullDuplexBench for low-latency responses (0.17s for turn-taking) and interruption handling, but supports only English and requires NVIDIA GPUs for optimal performance," said another.
"Wow this is a game-changer for sales calls," a third user wrote, highliting it strengths.














