hacker news

I use dictation a lot for writing, notes, and prompting agents. After trying a bunch of dictation tools, I kept running into the same issues: recurring cost, unclear privacy/data handling, and not much control over how the final text was cleaned up.

So I built Utter.

The main idea is simple: it should work as both a fast dictation tool and a longer-form voice note / meeting capture tool, while still giving the user control over where processing happens and what the output looks like.

A few things it does today:

  - global dictation with customizable shortcuts

  - saved modes for different workflows, with different prompts/models

  - remembers the last mode used per app

  - meeting recording with speaker-labeled transcripts, summaries, and action items

  - file transcription for audio/video

  - saved audio/transcripts with export options

  - prompt-based post-processing for turning raw speech into notes, messages, summaries, etc.

  - built-in note editor

  - iPhone app with dictation keyboard

A big motivation was being able to use it locally. It supports local transcription, optional local post-processing, BYOK, or cloud providers depending on the workflow. I also wanted phone-to-desktop capture to feel simple, so it syncs through iCloud and doesn’t require an account.

Curious to hear from people who use dictation heavily, especially on:

  - where current dictation tools still fall short

  - whether the “modes” idea makes sense in practice

Leftium ・ a day ago

There are literally two new dictation apps on Show HN every week: https://hn.algolia.com/?dateRange=pastWeek&page=0&prefix=fal...

This one is unique in that it supports iPhone. I haven't seen mobile support very often.

Despite all these apps, there are two things holding me back from using a dictation app on a regular basis:

- streaming transcription: see words in realtime

- multimodal input: mix voice with keyboard

So I started prototyping this type of realtime multimodal dictation UX: https://rift-transcription.vercel.app

This HN comment captures why streaming is important for transcription: https://hw.leftium.com/#/item/47149479

hubab ・ a day ago

Streaming transcription is something I’m working on. The main challenge so far has been accuracy. Streaming models, especially cloud ones, often drop enough quality that the tradeoff isn’t always worth it. Local models look more promising, so streaming will likely land there first.
On multimodal input, the UX you’re prototyping where you switch between dictating and typing while composing is interesting. I haven’t really seen that approach before.
The direction I took is a bit different. Instead of mixing modalities mid-composition, dictation becomes context-aware during post-processing. Selected/Copied text or surrounding field content can be inserted into the post-processing prompt so the spoken input is interpreted relative to what’s already on screen.
- Leftium ・ a day ago
  
  Yeah, I will add post-processing to my prototype, too. I already prepared a detailed spec (prototyping new ways to do this, as well): https://github.com/Leftium/rift-transcription/blob/main/spec...
  One idea I was tossing around was streaming transcription + batch re-transcription:
  - Use streaming transcription, which works most of the time (for example, I've found the Web Speech API pretty good, as well as moonshine)
  - If the streaming transcription was poor, select the bad part and re-transcribe with a more accurate batch transcription model.
  
  helro ・ a day ago
  
  I tested something similar and continous re-transcription was the only way I could get close to batch-level accuracy.
  In my current implementation I’m fairly aggressive with it. I don’t rely much on streaming word confidence. Instead I continuously reprocess audio using a sliding window. As new audio comes in, it’s retranscribed together with the previous segment so the model always sees a longer context.
  That recovers a lot of the accuracy lost with streaming, but the amount of retranscription makes it hard to justify economically with cloud APIs. That’s why I’m focusing on a local-first approach for now.

r0fl ・ a day ago

I ask this each time I see one of these apps

How is this different than me using the voice to speech feature on my iPhone or Mac that is built in, and free? I can talk into voice memos as well and get a full transcript even from crazy long files

Thanks

hubab ・ a day ago

The main differences are transcription quality and what happens after the transcript is generated.
Utter uses GPT-4o Transcribe by default for cloud transcription, and in my experience it’s best in class. The gap is most obvious on names, niche terminology, and technical vocabulary. I use it a lot for prompting coding agents, and I've found Apple’s built-in dictation and most other apps don't come close in terms of accuracy.
It also adds a custom post-processing step. So instead of ending up with a raw transcript, you can record a long, messy voice note and have it turned into a clean, structured markdown notes.
If you want to test the accuracy difference yourself, try dictating this with both Apple dictation and ChatGPT web (uses same model) and compare the output:
“My FastAPI service uses Pydantic, Celery, Redis, and SQLAlchemy, but the async worker is deadlocking when a background task retries after a Postgres connection pool timeout.”

Show HN: Utter, a local-first dictation app for Mac and iPhone