There are literally two new dictation apps on Show HN every week: https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=fal...
This one is unique in that it supports iPhone. I haven't seen mobile support very often.
Despite all these apps, there are two things holding me back from using a dictation app on a regular basis:
- streaming transcription: see words in realtime
- multimodal input: mix voice with keyboard
So I started prototyping this type of realtime multimodal dictation UX: https://rift-transcription.vercel.app
This HN comment captures why streaming is important for transcription: https://hw.leftium.com/#/item/47149479
Streaming transcription is something I’m working on. The main challenge so far has been accuracy. Streaming models, especially cloud ones, often drop enough quality that the tradeoff isn’t always worth it. Local models look more promising, so streaming will likely land there first.
On multimodal input, the UX you’re prototyping where you switch between dictating and typing while composing is interesting. I haven’t really seen that approach before.
The direction I took is a bit different. Instead of mixing modalities mid-composition, dictation becomes context-aware during post-processing. Selected/Copied text or surrounding field content can be inserted into the post-processing prompt so the spoken input is interpreted relative to what’s already on screen.
Yeah, I will add post-processing to my prototype, too. I already prepared a detailed spec (prototyping new ways to do this, as well): https://github.com/Leftium/rift-transcription/blob/main/spec...
One idea I was tossing around was streaming transcription + batch re-transcription:
- Use streaming transcription, which works most of the time (for example, I've found the Web Speech API pretty good, as well as moonshine)
- If the streaming transcription was poor, select the bad part and re-transcribe with a more accurate batch transcription model.
I tested something similar and continous re-transcription was the only way I could get close to batch-level accuracy.
In my current implementation I’m fairly aggressive with it. I don’t rely much on streaming word confidence. Instead I continuously reprocess audio using a sliding window. As new audio comes in, it’s retranscribed together with the previous segment so the model always sees a longer context.
That recovers a lot of the accuracy lost with streaming, but the amount of retranscription makes it hard to justify economically with cloud APIs. That’s why I’m focusing on a local-first approach for now.