FIR Accelerator calculation time variation

Category: Hardware
Product Number: ADSP-21569
Software Version: CCES 2.9.4


I was testing the FIR accelerator on the ADSP-21569.
I've used 1024 taps, 8 samples and 8 channels controlled in legacy mode with 8 TCBs.
The first TCB is started every 8th sample and stops with the 8th TCB (TCB chaining).
All 8 buffers are in L1 memory, my samplerate is 96kHz, core clock 1GHz.

I'm measuring calculation time with the FLG2 output pin and see variations in the timing.
I think I can exclude access conflicts as I also disabled testwise all other operations.
What I found out is that if the input samples are wrapped around the modulo buffer end the calculation time is higher.
If the whole input samples amount is within the modulo buffer range the time is lowest and matches the simulation tool from ADI (EE408).

I think the burst size is 8 samples and my buffersize is a multiple of 8 so no load should be accross the buffer end.
The minimum time is 58µs, the maximum 63µs, simulation says 57.99µs (so matching the minimum).
That is quite much increase (5000 CCLK) for crossing 64 times the modulo buffer boundary I think.

I'm wondering why this happens? Do I overlook something?