actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.
the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.
btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.
80% of my home voice assistant requests really need no response other than an affirmative sound effect.
100% agree. I dont want a Yes, Got it, Will do or even worse, I have turned on the Bedroom Light. I want soft success ding or a low failure boop.
why would you want an audio notification for a light? it either turns on and it worked or it doesnt turn on. i see no value in having a ding or anything of the kind
if i imagine constant dinging whenever i enter a room and the motion sensor toggles the light innit i'd go mad
Talk back is how you make sure what you asked for is what happens.
An affirmative beep but the light does not turn on means you have to guess what did.
Star Trek got it right. two beeps, "Low High" = yup, "High Low" = nope
> actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.
I would argue that the hardest part is correctly recognizing that it's being addressed. 98% of my frustration with voice assistants is them not responding when spoken to. The other 2% is realizing I want them to stop talking.