hi !

i use code to implement a lms algo

i want to optimize the most critical time consuming part of my code

the runtime is not fast enough so i need a speed up of this routine because this is done for each sample

and for a tapSize>=256

in c: (taps is a ptr to an short-array)

int factor;

int exp;

int n;

n=NofTaps;

for (i = 0; i < n; i++)

{ exp=*pHist++ *factor

taps[i] += ((exp+(1<<14)) >> 15);

}

so my asm-code looks like that

[FFA0BF5A] CC = R2 <= 0 ;

[FFA0BF5E] IF CC JUMP __END__.P37L5L ;

{

exp = *phist++ * factor;

[FFA0BF60] NOP ;

[FFA0BF62] I0 = P1 ;

[FFA0BF64] LSETUP ( __BEGIN__.P37L5L , 22 /*0xFFA0BF7A*/ ) LC0 = P4 ;

[FFA0BF68] MNOP || R0 = W [ P0 ++ ] ( X ) || R2.L = W [ I0 ] ;

[FFA0BF70] R0 *= R1 ;

taps[i] += ((exp+(1<<14)) >> 15);

[FFA0BF72] R0 <<= 0x1 ;

[FFA0BF74] R0.L = R0 ( RND ) ;

[FFA0BF78] R0 = R0 + R2 ;

[FFA0BF7A] W [ I0 ++ ] = R0.L ;

}

is there any potential to speed up the hardware-loop ???

best regards chris

Hi,

It seems like with proper unrolling of the first and last iterations of the loop, you could do something like:

LSETUP ( __BEGIN__.P37L5L , 22 ) LC0 = P4 ;

{

R0 = R0 * R1 ;

R0 = R0 << 0x1 || W [ I0 ++ ] = R3.L ;

R0.L = R0 ( RND ) ;

R3 = R0 + R2 || R0 = W [ P0 ++ ] ( X ) || R2.L = W [ I0 ] ;

}

Note that I moved the result writeback to parallel of the "next iteration's" shift (and changed the result register on the way), and moved the "next iteration's" data load in parallel to the addition of exp to taps.

Note also that since R1 holds the factor, then depending on your specific numbers, you can double the factor and thus eliminate the left-shift prior to the rounding. If you do not absolutely need the rounding, then reg-move 32bit to 16bit can be done through memory transactions, and maybe thus be paralleled with some compute instructions.