The GoL example it loaded with seemed to be running way slower than I expected it to. It turns out that there's actually a `usleep(1000 * 100)` call in the code which was inserted to make it easier to see the output; the actual kernels execute quickly and take up very little GPU time.
When I looked at the profiler, I was confused to see that one worker thread was at 100% usage the whole time it was running. At first, I thought that maybe it was actually running the code via Wasm on the CPU rather than on the GPU like it said.
Instead, it turns out that the worker was just running `emscripten_futex_wait` - which as far as I can tell is implemented by busy waiting in a loop. Probably doesn't matter for performance since I imagine that's just for the sleep call anyway.
----
Altogether this is an incredibly cool tool. I'm sure there is some performance gap compared to native, but even so this is a extremely impressive and likely has a ton of potential use cases.
Thank you so much for this! I was a bit concerned that the performance on my Mac was nearly identical to my new 3090 on PC and thought I might have messed up the setup there!