http://3cbzkrvakrpetjjppdwzbzqrlkmzatjs7jbyazap5gwutj32gcltjpqd.onion
If we had an uint128_t and could use that just like we did with
uint64_t, we'd make life easier for the branch predictor since
there would be, simply put, fewer times the inner loop would end.
You can do it branch-free by means of conditional moves and such
(e.g., do two bit scans, switch between them based on whether the
lowest word is zero or not—similar for the other operations),
and there is some support from the compiler (__uint128_t
on GCC-like platforms), but in the end, going to...