This is a tough one. It’s systemic —- MS provides a “best fit” code mapping from wide Unicode to ASCII, which is a known, published, “vibes-based” mapper. This best fit parser is used a lottt of places, and I’m sure that it’s required for ongoing inclusion based on how MS views backward compatibility. It’s linked in by default everywhere, whether or not you know you included it.
The exploits largely revolved around either speccing an unusual code point that “vibes” into say a slash or a hyphen or quotes. These code points are typically evaluated one way (correct full Unicode evaluation) inside a modern programming language, but when passed to shell commands or other Win32 API things are vibes-downed. Crucially this happens after you check them, since it’s when you’ve passed control.
To quote the curl maintainer “curl is a victim” here — but who is the culprit? It seems certain that curl will be used to retrieve user supplied data automatically by a server in the future. When that server mangles user input in one way for validation and another when applied to system libraries, you’re going to have a problem.
It seems to me like maybe the solution is to provide an opt-out of “best fit” munging in the Win32 space, but I’m not a Windows guy, so I speculate. At least then open source providers could just add the opt out to best practices, and deal with the many terrible problems that things like a Unicode wide variant of “ or \ delivers to them.
And of course even if you do that, you’ll interact with officially shipped APIs and software that has not opted out.
The opt-out is to use the unicode windows APIs (the functions ending in "w" instead of "a"). This also magically fixes all issues with paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly), and has been available and recommended since Windows XP.
I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
_Or_ set your application to use UTF-8 for the "A" APIs. Apparently this is supported as of a Windows 10 update from 2019. [1]
[1] https://learn.microsoft.com/en-us/windows/apps/design/global...
It should have been supported approximately 20 years earlier than that. I was coding against Win32 looong before 2019 and wondering for years why they wouldn't let you.
An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
> An explanation I heard ~10 years prior is that doing so exposed bugs in CRT and nobody wanted to fix them.
What I've heard is that the issue is not with the CRT, but with applications using fixed-size byte buffers. IIRC, converting from UTF-16 to any of the traditional Windows code pages requires at most two bytes for each UTF-16 code unit, while the UTF-8 "code page" can need three bytes. That would lead to buffer overflows in these legacy applications if the "ANSI" code page was changed to UTF-8.
Not sure what that has to do with CRT, given that it isn't part of Win32.
CRT in a form of msvcrt.dll file had a de-facto presence in Windows since the end of 1990s. Later on, since 2018 or so, CRT availability was formalized in Windows API in form of ucrtbase.dll module.
msvcrt was never for applications to use: https://devblogs.microsoft.com/oldnewthing/20140411-00/?p=12...
The bundled one with Windows wasn't. However the same "feature" exists in redistributed versions of msvcrt.
Which doesn't change the fact that Win32 doesn't depend on it.
It is extremely hard to create an application that doesn't depend on CRT on Windows. CRT provides tables for handlers of SEH exceptions and provides the default event handlers. Win32 headers have hard dependencies of the handler tables CRT provides. So you need to go quite a bit out of your way to hack deep Win32 headers. Loading DLLs etc also may call CRT functions.
You can read Mingw64 source to see how many hacks they had to do to make it work.
That's the "vcruntime" not the "ucrt". There has been a distinction since the ucrt was made an official part of the OS.
It's very easy to make a win32 program without the ucrt filesytems APIs so long as you don't mind being platform-specific (or making your own cross-platform wrappers).
It's still an important piece of the app compatibility story.
Does that mean that in this UTF-8 mode, GetCommandLineA would, when the full-width double quote occurs in the command line, return the UTF-8 bytes for that double quote, rather than steamrolling it to an ASCII double quote with the WorstFit mapping?
Yes, I wanted to suggest the same. I modified some old tools I wrote 15 years ago to do that a while ago. Not because I was aware of any vulnerability, but because a few places still used char* and I figured this would basically make it never fail with any weird filenames regardless of the code page.
So now it seems even if you think your app is fully Unicode, still do this just in case? :)
It sounds like something Cygwin ought to do across their ecosystem.
As mentioned elsewhere in this discussion, 99% of the time the cause is likely the use of standard C functions (or C++ `std::string`…) instead of MS's nonstandard wide versions. Which of course is a ubiquitous practice in portable command-line software like curl.
A lot of details is in linked curl hackerone: https://hackerone.com/reports/2550951
So the culprit is still the software writer. They should have wrapped the C++ library for OS-specific behavior on Windows. Because they are publishing buggy software and calling it cross-platform.
curl first released in 1996, shortly after Windows 95 has born and runs on numerous Windows versions even today. So, how many different versions shall be maintained? Are you going to help one of these versions?
On top of that, how many new gotchas these “modern” Windows functions hide, and how many fix cycles are required to polish them to the required level?
If we're talking about curl specifically, I absolutely think they would (NOT "should") fix/workaround it if there are actually common problems caused by it.
Yes it would have required numerous fix cycles, but curl in my mind is such a polished product and they would have bit the bullet.
You're right, if the problems created by this are big enough, the team will fix them without any fanfare and whining.
However, in neither case this is a shortcoming of curl. They'd be responding to a complicated problem caused by the platform they're running on.
Why would/should they? I've never paid for curl. Who even develops it? Sounds like a thankless job to fix obscure worstfit bugs.
> Why would/should they?
Because they care. That's it.
> I've never paid for curl.
I'm sure people who develop it doesn't want money and fame, but they're just doing what they like. However, curl has commercial support contracts if you need.
> Who even develops it?
Daniel Stenberg et al. Daniel can be found at https://daniel.haxx.se.
> Sounds like a thankless job to fix obscure worstfit bugs.
It may look thankless, but it's not. curl is critical infrastructure at this point. While https://xkcd.com/2347/ applies squarely to cURL, it's actually nice that the lead developer is making some money out of his endeavor.
Why would they develop curl at all by your logic?
They fix bugs because they simply want their product to be better, if if I were to take a guess? Like, I'm sure curl's contributors worked on OS-specific problems before, and it wouldn't be the last.
> to fix obscure worstfit bugs.
Again my premise is "if there are actually common problems caused by it". This specific bug does sound like that, at least not for now.
- [deleted]
You also have to use wmain instead of main, with a wchar_t argv, otherwise the compiled-in argparser will be calling the ANSI version. In other words... Anyone using MSVC and the cross-platform standardised and normal C system, are hit by this.
Oh, and wmain is a VisualC thing. It isn't found on other platforms. Not standardised.
Writing cross platform code which consistently uses UCS-2 wchar_t* on Windows and UTF-8 char* on UNIX-like systems sounds like absolute hell
A wchar_t "native" libc implementation would be an interesting thing.
>I'm not sure why the non-unicode APIs are still so commonly used.
Even argv is affected on Windows. That's part of the C and C++ standard, not really a Windows API. Telling all C/C++ devs they need to stop using argv is kind of a tough ask.
I think the issue is that native OS things like the windows command line, say, don’t always do this. Check the results of their ‘cd’ commands with Japanese Yen characters introduced. You can see that the path descriptor somehow has updated to a directory name with Yen (or a wide backslash) in it, while the file system underneath has munged, and put them into an actual directory. It’s precisely the problem that you can’t control the rest of the API surface to use W that is the source of the difficulties.
Using \\?\ has a downside: since it bypasses Win32's path processing, it also prevents relative paths like d:test.txt from working. Kind of annoying on the command line with tools like 7z.exe.
I share your recommendations of always using PWSTR when using windows apis.
> I'm not sure why the non-unicode APIs are still so commonly used
I think because the rest of the C world uses char* with utf-8, so that is what people are habituated to. Setting the ACP to CP_UTF8 would have solved a lot of problems, but I believe that's only been supported for a short period of time, bafflingly.
> I'm not sure why the non-unicode APIs are still so commonly used. I can't imagine it's out of a desire to support Windows 98 or Window 2000.
Nowadays, it's either for historical reasons (code written back when supporting Windows 9x was important, or even code migrated from Windows 3.x), or out of a desire to support non-Windows systems. Most operating systems use a byte-based multi-byte encoding (nowadays usually UTF-8) as their native encoding, instead of UTF-16.
> I'm not sure why the non-unicode APIs are still so commonly used.
Simple: portable code meant to run on Unix (where UTF-8 is king) and Windows -> want to use UTF-8 codepage on Windows and the "A" APIs.
The other opt-out might be to opt into UTF-8 support for the "A" functions.
> paths longer than 260 characters (if you add a "\\?\" prefix or set you manifest correctly)
A long ago released build of Windows 10 did this automatically so no need for adjustments anymore, 32k is the max....
...except for Office! It can't handle long paths. But Office has always been hacky (the title bar, for example).
Windows has a way of opting out of legacy behavior since Windows XP - manifest files. If you don't include a manifest, even GetWindowsVersion will not return the current version IIRC. It should be not too hard to add an opt-out in there (and at some point make it default in Visual Studio).
I think what is also needed is some kind of linting - there is usually no need to call ANSI WinAPI functions in a modern application (unless you set the locale to UTF-8 and only use the 8-bit functions, but I don't know how well that works). I think there are also a couple of settings and headers to include to make everything "just work" - meaning argv, printf and std::cout work with UTF-8, you get no strange conversions, and you just have functions to convert between UTF-8 and UTF-16 to use WinAPI. I'm pretty sure I have a Visual Studio project lying around somewhere where it works. But all those steps necessary need to be documented and put in one place by MS.
Using UTF8 internally and converting strings for W API calls is a way to gain some performance.
More like it's a way to keep your Windows port code to a minimum so that the rest can run on Unix. I.e., you want to use UTF-8 because that's the standard on Unix, and you don't want to have completely different versions of your code for Windows and Unix because now you have twice the maintenance trouble.
The loosey-goosey mapping of code points to characters has always bothered me about Unicode.
This isn't about Unicode having "loosey-goosey" anything. It's about aa mapping that Microsoft came up with to map Unicode to non-Unicode.
Yeah, they could have mapped code points to their textual descriptions. That'd require reallocations, but converting "to UNICODE_FULLWIDTH_QUOTATION_MARK_U+FF02 would be unambiguous. Ugly, but obvious what happened. Better than � IMO!
Since there's two possible antecedents for "they" (the Unicode Consortium, and Microsoft) here you'll have to clarify. Also, my question really was for u/UltraSane.
Microsoft should just never have created Best-Fit -- it's a disaster. If you have to lose information, use an ASCII character to denote loss of information and be done. (I hesitate to offer `?` as that character.) Or fail to spawn the process with an error indicating the impossibility of transcoding. Failure is better actually.