Large FFT (16K and 32K) benchmark on a 2146x Sharc?

Typical Sharc code examples for floating point complex FFTs are 8K points or less.  I need an optimized 16K and 32K large floating point FFT benchmark and code particularly for the 2146x.  Hopefully, this utilizes the large internal 5Mbits of memory (not external).