Here is a cute variant that doesn't need lzcnt nor tables, but only works up to 99,999:
(((x + 393206) & (x + 524188)) ^ ((x + 916504) & (x + 514288))) >> 17
This is for integer log 10, but could be adapted for number of digits. It needs a wrapper for 64 bit to invoke it multiple times, but most numbers in a JSON are small, so it might even be competitive; it needs only 4 cycles with enough instruction level parallelism.I gathered this idea from the output of a superoptimizer, it was fun to figure out how it works. For spoilers, see [1].
[1] https://github.com/rust-lang/rust/blob/master/library/core/s...
> that doesn't need lzcnt
Is it a big advantage to not need it, or can we safely assume CPU's have fast instructions for this today?
Sometimes you have a scalar instruction but not a vectorized one, or it doesn't match the 'lanes' you want to operate (ISTR weird holes in AVX where I could find a specific instruction for 8, 32, 64-bits lanes but not for 16). Always good to have an escape hatch there, especially a highly pipelined (or pipeline-able) one.
It's not obvious to me that this is worth much over
(x > CONST) does tend to need a pair of instructions to get the status flag into a register, but those can also fuse with the adds. Critical path latency is cmp-setae-sbb-add, or also four cycles.(x > 9) + (x > 99) + (x > 999) + (x > 9999)