> A method that improves on Horner's method in this aspect > is Estrin's method. It rewrites the polynomial as a (mostly > balanced) binary tree using x^2 and A*x+B terms (where A > and B are different subexpressions).
I googled for the technique and applied it, simple and straightforward. Instead of 6 deep multiplies using Horner's method, it ended being 8 multiplies only 3 deep. Unfortunately profiling my code tells me that it spends the lion's share in writing it out still, and I also don't do enough work to fill the pipeline. So in practice, there was no speed difference between the two methods.
I also got the compiler to unroll the loops automatically. With this option, Horner's method comes out slightly ahead (3% ?), since the pipelines can be filled in parallel from each unrolled iteration.