Use a faster safe bitwise rotate implementation.#13
Use a faster safe bitwise rotate implementation.#13nmathewson wants to merge 3 commits intowahern:masterfrom
Conversation
This is in preparation for using faster versions.
On modern compilers, under my tests, this turns all of the rotate functions into a single rotl/rotr instruction. (For some links, see BLAKE2/BLAKE2#14 . The cryptopp library uses similar tricks.) The previous definitions, on the other hand, include a conditional branch instruction, which was probably slowing things down a bit. Also add a comment to explain why we don't just do "(v>>c)|(v<<(bits-c))" (because c can be 0). Also add a comment to assert that we don't need to actually handle c >= bits. Removing all of these branch instructions ought to improve performance a bit.
|
Hm, we'd better be careful here. I just ran the benchmarks over master and all 3 of these branches, and to my surprise they don't show a very large improvement. (I wasn't very careful in my setup, though, so maybe better benchmark run would show results better.) |
|
Yikes. That's alot of code for something simple. I'd rather hold off on this. I was thinking it was a small diff that reduced the overall SLoC. Regarding branching, I recently stumbled upon the very enlightening paper, Branch Prediction and the Performance of Interpreters - Don't Trust Folklore. The takeaway is that predictors have been getting much better across the board. Haswell is so good at it that often times it's wasted effort to even think about branch optimization. Regardless of performance, I do prefer removing conditionals as removing them them tends to make code easier to follow and manage overall, even if the details of why a condition is unneeded isn't obvious. But this patch adds more (preprocessor) conditionals and other code. Is it okay if I cherry pick (e.g. moving rotr and rotl to timeout-bitops) so I can merge the Makefile and regression test patches? |
|
Of course it's okay to cherry-pick. :) |
This branch replaces the rotation operation with one that compiles down to a single branchless rotate instruction (on my gcc and clang compilers). The magic trick here is the fact that :
works even when c==0, is conformant c, doesn't require a branch, and many compilers can recognize it and replace it with a rotate instruction.
I have tested this, but I haven't benchmarked it yet or tested with super-old or super-weird compilers. I would assume that eliminating these branch instructions is always a good idea, though, and it doesn't make the timeout.c code uglier.