This is not a new model. Also, it hallucinates a lot. Also, it's very heavy and slow in inference. It's also bad in multilingual.
Edit: I'm talking purely about speech to text (STT). Not sure about the other things this can do.
It has some perks, is a bit more expressive in some cases, but overall is trained on really noisy data, uses more memory, and isn't that fast - I'm talking about the (7b?) version that they released then removed quickly (vibevoice-community on github) - I still use chatterbox turbo and sometimes qwen TTS.
Yeah, I don't get why it is suddenly getting so much attention today, it is all over twitter too
Simonw (who has a bit of a Midas touch for posts here) just posted about it https://simonwillison.net/2026/Apr/27/vibevoice/
To be fair, his Midas touch is a result of consistency and a lot of hard work.
It's like the gardener at one of the Oxford colleges said - it's really easy to create these perfect lawns, just turn up every day and trim and water it - for a couple hundred years.
I thought they rolled it as well?
As always with people: listen to what they say, not to what they do...
After all, they rarely do what they say themselves, so it's surely not entirely made up nonsense!
there is so much more subversive marketing out there than any of us can really fathom. i try not to be too paranoid but it's getting a lot harder every day.
i know someone who worked in what we might call the 'astroturfing' space within the entertainment industry. after having a few discussions with him and with things like this[0] becoming more known, it's really difficult to afford any assumption of organic intent when money is on the line - especially at the scale that microsoft works at compared to something as comparatively quaint as the music industry.
[0] https://www.wired.com/story/geese-chaotic-good-marketing-ind...
well duh, they updated the news section
https://github.com/microsoft/VibeVoice/commit/e73d1e17c3754f...
which is microsoft for "we removed two dead links". AI innovation knows no limits!
Interestingly that seems to be in response to [1], which might indeed be the trigger for this.
[1] https://doublepulsar.com/microsoft-vibing-capturing-screensh...
It is not good for text to speech (TTS) as well. I am trying it for few days. First of all 1.5B model documentation is not there. 0.5B realtime is shit model. I was converting text, line by line and it was randomly adding music and couldn't handle special characters like "…".
I really disappointed with this model to say the least.
The 7B parameter Vibevoice TTS model is still the most impressive local TTS model i've tried. It was pulled by Microsoft a few days after its release due to "abuse potential" but it can be found in various community maintained huggingface repos.
yep, it seems this was trained on large amount of podcasts with ad jingles or phone call queues with elevator music. I was also pretty disappointed to run the TTS last week.
Yes, the SOTA is currently much more advanced.
What do you consider to be SOTA?
you saved us a lot of time here.... i unstarred the repo
moving on....
I don't really pay attention to stars. Do people use them as bookmarks? Why would you star a repo if you knew so little about it?
Stars for me are basically "this might be interesting but I don't have time to look at it now, hopefully I'll think about it later and give it a second look".
I exclusively use stars as bookmarks which is why I always found it strange when people talked about lots of stars meaning high quality or trustworthy…I’ve learned since then that I’m probably in the minority (both in using stars as bookmarks and not caring about how many stars a repo has).
Judging by how many people apparently are paying bots to give their lazily vibe-coded repos thousands of stars, it seems like people both simultaneously take stars seriously while not taking them seriously at all. It breaks my brain.
Saved a lot of my time thanks!
I'm shocked, shocked to find that Microsoft takes credit for a slow, unoriginal product that doesn't actually do what it advertises.
Imagine the balls it took to willingly attach the Microsoft label to the front of the product that is Teams.
I mean the same can be said about most versions of Windows as well. People act like Windows 11 is where it all went sour, but I've personally kind of hated it since Windows XP.
I feel like a recurring pattern with Microsoft is to create something quickly, market it aggressively and push for everyone to use it immediately, and only once it is installed everywhere do people suddenly realize how terrible it is, but it's too late to change.
I'm surprised you picked XP as the falling point. I didn't enjoy the days of reinstalling 95/98/ME every 6 months to avoid driver weirdness and seemingly random failures. XP was built on the foundation of 2000, which tended to make it more robust vs. its predecessors.
Vista on the other hand...
I mean, part of it is that I really hated the Fisher Price look to it, but it was also the first time I ever felt like I had to "hack" things to make stuff work. I had to muck with registry keys. Oh, and it was the first time that I noticed that Windows repair tools do not work.
I suspect I might have hated 9x more but I was pretty young when they came out and I didn't really "get into" computers until XP, and I disliked it enough to dual-boot Linux as a twelve year old.
You just saved me an afternoon.
[flagged]
The nuance is lost on LLM agentic dominant partakers.