BF706 fastest possible FIR code - mixed C and assembly

Here is just about the fasts FIR filter for the BF706. It uses mixed C and assembly to get a 32-bit filter with 8060 taps sampling at 48 kHz. This is 97% of theoretical maximum. Right now, I cannot get anywhere near this in C alone and (see other Q&A), and I would very much like to know if anyone else has had this problem. Must be sample-by-sample, not block buffer mode. filter.pdf