Hello!

I have recently started evaluating the ADSP-BF506F EZ-KIT Lite for an application where I need to compute a FFT over a spectral width of 1 Mhz (ADC at 2 Mhz) such that the per bin frequency width is at most around 488. So that calls for 2048 bins across the 1 Mhz spectral width, or basically 4096 point FFT computation.

I have successfully been able to run a 2048 point FFT with the following code:

#include <filter.h>

#include <complex.h>

#include <math.h>

#include <math_const.h>

#include <stdio.h>

#define FFT_SIZE 2048

complex_fract16 twiddle_table[FFT_SIZE/2];

complex_fract16 fft_vector[FFT_SIZE];

fract16 input_vector[FFT_SIZE];

fract16 mag_fft_vector[FFT_SIZE];

void create_sine_wave(float);

int main( void )

{

int block_exponent, i;

//Generate Twiddle Table

twidfftrad2_fr16(twiddle_table, FFT_SIZE);

//Generate Input

create_sine_wave(1000);

//Computer FFT

rfft_fr16(input_vector, fft_vector, twiddle_table, 1, FFT_SIZE, &block_exponent, 1);

for(i=0;i<FFT_SIZE; i++)

mag_fft_vector[i] = cabs_fr16(fft_vector[i]);

printf("\nDone");

fflush(stdout);

return 0;

}

void create_sine_wave(float f)

{

int i;

float fs = 48000.0;

//generate input

for (i=0; i<FFT_SIZE; i++) {

input_vector[i] = (fract16) (sin(2*PI*f*((float)i/fs))*32760.0);

}

}

However, when I try to compute a 4096 point FFT, I get an error: "**Out of memory in output section 'bsz_L1_data_a' in processor 'p0**'". Now if I understand correctly, a 4K point FFT should take up a total of 28K (8K input, 8K output, 4K twiddle,8K mag) out of 68K of total RAM.

What am I missing or misunderstanding here? Also, is there a way to see how much RAM is being consumed by various sections of my code?

Please advise.

Thank you for your help.

Rgds,

SG

Hi,

Another way to do the same is to breakdown the 4096-point-FFT into smaller FFTs, suggested by the

Engineer-to-Engineer Note EE-263 at

http://www.analog.com/static/imported-files/application_notes/250937934EE263v01.pdf

Summary --

BF50x can do 4096-point-FFT costing about 103,000 = 103 k cyc,

assuming to use about 20 kB L1 data memory and 16 kB L1 instruction memory to store data

(if L1 memory can holds everything 4096-point-FFT costs 56 k cyc).

The procedure explained below also suggests that BF50x can do 8192-point-FFT

and the cyc cost should be reasonable (say 3x vs the cyc when L1 can hold everything),

assuming to use all 32 kB L1 data memory and 16 kB L1 instruction memory to store data.

Let me explain in a few steps. In the first step, we are going to re-derive the formula

suggested by EE-263. This will allow you to check if I made any

mistakes early on. In Step 2, we are going to implement the algorithm,

which is good for any processors. In Step 3, we will take the L1

size constrains imposed by BF50x, see how the implementation

in step 2 is accomplished in BF50x. Along the way, we may also

estimate the cost in cycles and in L1 memory for BF50x.

Step 1.Re-derive the formula for the case of 4096-point-FFTLet N=4096, h=exp(-j*2*pi),

X(n) = 4096-point-FFT of x(k), k=0,…,N-1

= sum_k x(k)*h^(nk/N), summation of k in [0,N-1], (1.1)

Let k=16*l+m, with l in [0,255], m in [0,15], (1.2)

X(n) = sum_k x(k)*h^(nk/N),

= sum_m sum_l x(16*l+m)*h^(n*(16*l+m)/N),

where summation of l in [0,255], summation of m in [0,15].

Moving the term with m to outer summation, we have

X(n) = sum_m { h^(n*m/N) sum_l x(16*l+m)*h^(n*16*l/N) }

= sum_m { h^(n*m/N) sum_l x(16*l+m)*h^(n*l/256) }, (1.3)

The above basically was decimating in time and

the below will be decimating in frequency.

Let n=256*s+t, with s in [0,15], t in [0,255], (1.3a)

the above (1.3) can be written as

X(256*s+t)

= sum_m { h^((256*s+t)*m/N) sum_l x(16*l+m)*h^((256*s+t)*l/256) }

= sum_m { h^(256*s*m/N)*h^(t*m/N) sum_l x(16*l+m)*h^(t*l/256) }

= sum_m { h^(s*m/16)*h^(t*m/N) sum_l x(16*l+m)*h^(t*l/256) } (1.4)

To arrive at above, we used the fact that h^(256*s*l/256) = exp(-j*2*pi*s*l) = 1

since s and l are integers.

We recognize that the inner summation is 256-point-FFT. We denote it as

Y(16*t+m) = sum_l x(16*l+m)*h^(t*l/256), (1.5)

Further, if we denote

Z(16*t+m) = h^(t*m/N)*Y(16*t+m) (1.6)

= h^(t*m/N)*sum_l x(16*l+m)*h^(t*l/256),

the outer summation is a 16-point-FFT

X(256*s+t) = sum_m { h^(s*m/16)*Z(16*t+m) } (1.7)

The above (1.5)-(1.7) suggest that 4096-point-FFT X(n),

n=256*s+t, with s in [0,15], t in [0,255], can be obtained by 3 steps:

(1a)

For each m=0,…,15, calculate the 16 256-point-FFT

Y(16*t+m) = sum_l x(16*l+m)*h^(t*l/256), (1.5)

For each fixed m, the input is a 256 point data. Later,

we can see that this is equivalent to do FFT on each column

of a 256x16 matrix.

(1b)

Scale the resultant 16 256-point-FFT to obtain

Z(16*t+m) = h^(t*m/N)*Y(16*t+m) (1.6)

(1c)

For each t in [0, 255], do 16-point-FFT on Z

X(256*s+t) = sum_m { h^(s*m/16)*Z(16*t+m) } (1.7)

There are total 256 such 16-point-FFT.

Step 2, Implement 4096-point-FFT by doing 16 256-point-FFTfollowed by doing 256 16-point-FFTThis implementation is valid for all processors, i.e., no memory

constrain is taking into consideration.

(2a)

Imagine write the 4096 inputs into a 256x16 matrix like (call it matrix_x)

x(0), x(1), x(2), ..., x(15)

x(16), x(17), x(18), ..., x(31) (matrix_x)

......

x(255*16), x(255*16+1), x(255*16+2), ..., x(255*16+15)

This basically decimates a big input sequence indexed by

k=0, …, 4095 into 16 sub-sequence, each with 256 points, i.e.,

k=16*l+m, with l in [0,255], m in [0,15], (1.2)

(2b)

Do 256-point-FFT along each column of the above matrix,

Y(16*t+m) = sum_l x(16*l+m)*h^(t*l/256), (1.5)

i.e.,

do 256-point-FFT on the 16 sub-sequences

x(0), x(16),x(32),...,x(255*16)

x(1), x(17),x(33),...,x(255*16+1)

x(2), x(18),x(34),...,x(255*16+2)

......

x(15), x(31),x(47),...,x(255*16+15)

This equivalent to decimate input sequence in time.

The resultant 16 sequence are in frequency domain:

Y(0), Y(16),Y(32),...,Y(255*16)

Y(1), Y(17),Y(33),...,Y(255*16+1)

Y(2), Y(18),Y(34),...,Y(255*16+2)

......

Y(15),Y(31),Y(47),...,Y(255*16+15)

They are not 4096-point-FFT yet since each of them are only 256-point-FFT

(or they need to be refined to have total 4096 frequency bins).

We may still write the frequencies Y in the matrix form as in (2a), call it matrixY:

Y(0), Y(1), Y(2), ..., Y(15)

Y(16), Y(17), Y(18), ..., Y(31) (matrixY)

......

Y(255*16), Y(255*16+1), Y(255*16+2), ..., Y(255*16+15)

This is a matrix with row indexed by t in [0,255]

and column indexed by m in [0, 15]

(2b)

Find

Z(16*t+m) = h^(t*m/N)*Y(16*t+m) (1.6)

i.e., call it matrixZ

Z(0), Z(1), Z(2), ..., Z(15)

Z(16), Z(17), Z(18), ..., Z(31) (matrixZ)

......

Z(255*16), Z(255*16+1), Z(255*16+2), ..., Z(255*16+15)

=

g^0*Y(0), g^0*Y(1), g^0*Y(2), ..., g^0*Y(15)

g^0*Y(16), g^1*Y(17), g^2*Y(18), ..., g^15*Y(31)

......

g^0*Y(255*16), g^(255*1)*Y(255*16+1), g^(255*2)*Y(255*16+2), ..., g^(255*15)*Y(255*16+15)

where g=h^(1/N) = exp(-j*2*pi/4096)

(2c)

Do 16-point-FFT on each row of the matrix Z, there are total 256 of such FFT to do:

X(256*s+t) = sum_m { h^(s*m/16)*Z(16*t+m) } (1.7)

The resultant is the final 4096-point-FFT, call it matrixX:

X(256*0), X(256*1), X(256*2), ..., X(256*15)

X(256*0+1), X(256*1+1), X(256*2+1), ..., X(256*15+1) (matrixX)

......

X(256*0+255), X(256*1+255), X(256*2+255), ..., X(256*15+255)

This is the 4096-point-FFT, except that the order of the data arranged

need to be adjusted into, call it finalX

(this adjustment is simply transpose the matrix into finalX,

and then read out the matrix finalX row by row):

X(256*0), X(256*0+1), .…, X(256*0+255),

X(256*1), X(256*1+1), …., X(256*1+255), (finalX)

X(256*2), X(256*2+1), …., X(256*2+255),

......

X(256*15), X(256*15+1), ...., X(256*15+255)

Step 3, Implement 4096-point-FFT on BF50xThe procedure described in Step 2 is a general approach,

which is applicable for any processors assuming there

are enough L1 memory to work with. All it said in Step 2 is

that a 4096-point-FFT can be broken down into 16 256-point-FFT

followed by 256 16-point-FFT and followed by a matrix transpose.

Now we are considering BF50x which is constrained by the following

hardware it has :

32 kB L1 data memory;

16 kB instruction L1 memory (which can be used to store data)

+ 16 kB instruction L1 memory/cache

DMA channel between the L1 data and L1 instruction

The existing 32 kB L1 data memory is almost enough to implement

the algorithm in Step 2. The reason we say almost enough: the input data

takes 4k*4 B = 16 kB, where 4 is due to the complex data and

real/imag takes 2 bytes respectively. The computations up to matrix

can all be done “kind of in place”. The term “kind of in place” may

be explained by the computation of the 0

^{th}column of matrixY:We read out the 0

^{th}column of matrix_x –x(0), x(16),x(32),...,x(255*16)

do 256-point-FFT on it. We get

Y(0), Y(16), Y(32),..., Y(255*16)

We then write this array {Y(0), Y(16), Y(32),..., Y(255*16)}

back to the original location to override {x(0), x(16),x(32),...,x(255*16)}.

It is “in place” since matrixY takes the memory location of matrix_x.

It is “kind of in place”, since the 256-point-FFT can take any

kind of FFT implementation: in place, out of place. The choice of 256-point-FFT

has only limited impact on the memory usage (we may even use 32 bit data with

16 bit twiddle 256-point-FFT, only round the result back to 16 bits at

the end of 256-point FFT).

Similarly, we can implement, “kind of in place”, the 256 16-point-FFT to

get matrixX. You may ask where do you store h^(t*m/N) in

Z(16*t+m) = h^(t*m/N)*Y(16*t+m) (1.6)

and where do you store matrix Z ?

matrixZ may simply override matrixY. There are quiet a few options to hold

h^(t*m/N), where N=4096, t in [0 :255], m in [0 :15], threre are total

4 k elements (16 kB).

Option1. Store them in flash and DMA in row by row, costing 0 cyc.

Option2. Store only 256 elements h^(t/N) costing 1 kB in L1 data (i.e., letting m=1),

generating h^(t*m/N) on fly for m in [2:15] costing about 4k*2=8k cyc;

this will produce LSB (worst case) 3 bits error; but the 16 bit 256-point-FFT

may have similar level of error any way

(worst case 256-point-FFT will need to shift down by about 3 or 4 bits).

In the following we will assume Option1.

As we said above, there are total 32 kB L1 data memory. Input data

matrix_x takes 16 kB. We may allocate 4 kB to do 256-point-FFT/16-point-FFT.

So we used up 16+4 = 20 kB and left over 12 kB. This 20 kB will allow us

to do operations to obtain matrixX. At this point, we need some help

from instruction L1 memory since are short of 4 kB to do matrix transpose

(we assume that the twiddle factors for 256-point-FFT/16-point-FFT

are kept in L1 data memory all the time; if we wipe them out after each

4096-point-FFT, we then just have enough memory to do matrix transpose

and no need to use instruction L1 memory).

Assume we use the 16 kB instruction memory to store matrixX

(DMA matrixX from L1 data memory row by row to L1 instruction memory).

After matrixX is completely stored in L1 instruction memory,

we DMA matrix from L1 instruction memory to L1 data memory.

The DMA can be setup such that matrixX will be sent

column by column and the L1 data memory will store the data row by row.

This is to say that the DMA will do the matrix transpose from matrixX to finalX,

costing 0 cyc for the core.

Summarize above:

we need 20 kB in L1 data memory and 16 kB L1 instruction memory (total 36 kB).

We can estimate the cyc (all fft assumed to be 16 bit complex data and twiddle and no scaling) :

2337*16+2*4096+145*256+(2+3)*4096 = 103184 cyc ~= 103 k cyc

(4096-point-FFT costs 56033 cyc = 56 k cyc if there is enough L1 memory)

where

2337 = cyc for 256-point-FFT

16 = need to do 16 of 256-point-FFT

2 = cyc to multiply h^(t*m/N), from matrixY to matrixZ

4096 = # of data point

145 = cyc for 16-point-FFT

256 = need to do 256 of 16-point-FFT

2 = 1 + 1 need to read matrix_x column by column and maxtrixZ row by row before doing FFT

3 = some other overhead