Voice is a super interesting modality right now - maybe the first modality we're seeing move to open source models across a number of scale ups / enterprises. Reliability concerns, high costs, and open source model performance are pushing engineers to do their own fine tuning vs. relying on third-party vendors of proprietary models. Many of these orgs have already been collecting their own first-party data and now with third-party vendors like Extrian, David AI, etc they can train really high quality models. RL has been insanely hyped, but it's been unclear how long it will take scale ups and enterprises to actually lean in. Voice AI might be hitting that inflection point faster than expected.
The harness conversation is the interesting one right now in my opinion. Models are converging. The differentiator is what sits around them. Fine-tuning optimizes the model. The harness optimizes the relationship between a specific person and a specific model. Both matter, but only one compounds with use. I'm really curious to see how that extends to voice
Whole RL fine-tuning space, I think, is really fascinating. At OpsCompanion, we're kind of slowly stumbling on something that can be quite powerful to create a kind of full eval system for the entire SDLC process.
Check out attention labs. Helping voice models and agents hear+understand reliably
Don't think voice open source is a thing in any meaningful way. Current OS models don't scale, sound awful and fine-tuning isn't straightforward due to voice consent. LLM OS is way bigger with Mistral, DeepSeek or Llama