Hello,
we are working on FIR filtering in the SHARC-FX Core, and after benchmarking different FIR functions available in the DSP library we found that none of them seems to reach the theoretical maximum of 8 MACs per cycle.
We prepared a simplified project (attached) to verify in which conditions the processor can reach this efficiency
- hardware: ADSP-SC835W-EV-SOM + breakout board, running at 1GHz, CCES 3.0.2. Measuring performance with GPIO toggles (their delay already considered) and CYC counters
- project built in Release configuration,all optimizations enabled
- firMacTest() executes 72 iterations of PDX_MULA_MXF32 (not FIR filter, just MACs), all signals located in L1 memory, aligned
- we are expecting that 72 iterations of PDX_MULA_MXF32 take approximately 72 cycles, achieving 8 MAC of 32 bit floats per cycle. But in reality the 72 iterations are taking in average 90 to 95 cycles, which translates approximately to 6 MAC of 32 bit floats per cycle
- in our complete scheme we are using a modified version of adi_s1fir_fastf, and overall we are getting around 5 MAC per cycle, which is understandable, that's why we arrived to test specifically the simplest possible case of looped MACs with PDX_MULA_MXF32
Could you give us any hint on why this could be happening, or if indeed it is the expected behaviour? Is there a limitation that we are not considering here?
Thanks,
Leopoldo