SHARC+ FFT time

Hi,
I have an ADSP-21587 that I'm doing some FFT with.  I'm confused because if I use the rfft() function I seem to get better performance than using the accel_rfft_large() function.  
I read the increase in the pREG_CGU0_TSCOUNT0/1 registers from before and after the function finishes to see how long its taking.

The functions are:

result = accel_rfft_large( in, out, accel_twiddles_4096, 1, 1.0, 4096 );
result = rfft( in, NULL, out, twiddles, stride, 4096 );                                      // twiddles is an 8K pre-made table, so stride = 2

Both functions work, but I would expect the accelerator function to be faster.  Is it normal that the accelerator function will take more time to run?  Does the FFT size need to be bigger to get any benefit from the accelerator function?

Thanks,
Matt