Code efficiency: C and assembly help!

Using 32 bit word length, the BF706 is capable of 400 MMACs. so at a sample rate of 48 kHz, the maximum theoretical FIR filter length is 4e8/48e3 = 8333 taps. Using assembly, I can get very close to this - 8050 taps, 97%. Here is the kernel of the convolution:

A1:0+= R0 * R1 || R0 = [I0++] || R1 = [I1++]; // Filter left

This uses a parallel move and increment. However, in C, I can only manage 1024 coefficients, no matter what I do. Here is the code:

   int k;
   long fract temp;
   for (k=0;k<=modtype;k++)
      temp=h[k]*x[n++];               // Convolve
      y+=temp;                        // Convolve

I am using the & function to implement a circular buffer - modtype is a constant with a value of 1023. If this is increased to 2047, the audio output is degraded since it cannot complete the MACs before the next sample. The use of temp is advised in the optimisation data sheet, and it does improve the speed. However, it is still way short of 8050.

I would really appreciate some help on this - I have tried all the pragmas, and all of the optimisation tricks. Nothing works. I note that the disassembly looks really cluttered.

Both programs use sample by sample filtering.