AnsweredAssumed Answered

Need help optimizing C-function

Question asked by Rai on Feb 15, 2012
Latest reply on Feb 21, 2012 by MikeP

I'm using FIR-filtering and decimation in a critical portion of my program. Normally, I would use the provided function fir_decima_fr16, but I find that it is too slow when the decimation index is large and the filter length is large. I know the blackfin can do 2 16-bit fract multiplications in one cycle, so we should be able to do it faster.

Here's my C code:

/** General FIR-filtering and downsampling, identical to "fir_decima_fr16" but
*   faster for long filters.
* @param in Input buffer, of length n
* @param out Output buffer, of length n/state->l
* @param state Filter state, same as used for fir_decima_fr16. See Blackfin/VDSP documentation for details.
* @param n Number of input samples
*/
void filter_and_decimate(fract16 *in,fract16* out,fir_state_fr16* state,int n)
{
      int i,j,k; 
      int n_outsamples=n/state->l;

      #pragma no_alias
      #pragma loop_count(1,256,1)
       for(i=0;i<n_outsamples;i++) 
      { 
           int accum=0; 
           k=state->l*i;

           /* Need heavy optimization here, so give the compiler some clues. */
           #pragma no_alias
           #pragma loop_count(4,1024,4) 
           for (j=k;j<k+state->l;j++)
           {
                *(state->p)=in[j];
                state->p=circptr(state->p,2,state->d,2*state->k);
           }

           /* Need even heavier optimization here.*/
           #pragma no_alias
           #pragma loop_count(64,2048,64) 
           for (j=0;j<state->k;j++)
           { 
                accum+=state->p[0] *state->h[j]; 
                state->p=circptr(state->p,2,state->d,2*state->k); //Enforce circular buffer indexing in compiler optimization
           } 
           out[i]=accum/(1<<15);//Use this instead of bitshift, because it rounds instead of truncating
      } 
}

 

Now the inner loop will take two cycles per iterations. However, I found that by declaring n_outsamples to be volatile, thus forcing the compiler to not use the hardware loop facility for the outer loop, the inner loop will run twice for each cycle. That's a fourfold improvement! But it is kind of a hack, and it also creates a lot of overhead in the outer loop, which is a problem for shorter filters with less decimation.

So, any clues? How can I get the compiler to use maximum optimization for the inner loop without resorting to this kind of hack?

 

By the way, the state and coefficient vectors are in different memory banks, and I've tried the pragmas vector_for, all_aligned and different_banks without seeing any difference.

Outcomes