2008-08-12 13:02:52     Optimization Help

Document created by Aaronwu Employee on Aug 7, 2013
Version 1Show Document
  • View in full screen mode

2008-08-12 13:02:52     Optimization Help

Kyle Schlansker (UNITED STATES)

Message: 60339   


Anyone up for an optimization problem?


I need to apply, in real-time, a chebychev filter on two channels of data that are interleaved in memory.  Right now, my code is taking a little more than 11 seconds to run on 10 seconds worth of data, on the BF548 EZ-KIT.  I'm looking for some suggestions on how to optimize this to get it down to under 10 seconds (and prefereably much lower...)


So far, I've tried:


1) putting the function in L1 SRAM (no speedup)


2) threading the algorithm by separating the interleaved data into individual channels and then decimating each channel in its own thread (no speedup)


3) Enabling write-back memory cache rather than write through (small speedup)




The for loop is what's taking all the time.  Perhaps there is a better memory layout that I could use?




Here's the function in question:




void apply_chebychev_filter_fast(float *data, int *sz) {

    int i;


    double *arr = malloc(  (*sz) * sizeof(double));

    if(!arr) {

        fprintf(stderr, "ERROR:%s:%d: error allocating memory for chebychev filter\n", __FUNCTION__, __LINE__);



    memset(arr, 0, *sz*sizeof(double));


    for(i = 16; i < *sz; i++) {

        arr[i] =

                    num_coeffs[0] * data[i-0] +  // b(0) * x(i-0)

                    num_coeffs[1] * data[i-2] +  // b(1) * x(i-1)

                    num_coeffs[2] * data[i-4] +

                    num_coeffs[3] * data[i-6] +

                    num_coeffs[4] * data[i-8] +

                    num_coeffs[5] * data[i-10] +

                    num_coeffs[6] * data[i-12] +

                    num_coeffs[7] * data[i-14] +

                    num_coeffs[8] * data[i-16] -


                    den_coeffs[1] * arr[i-2] -  // a(1) * y(i-1)

                    den_coeffs[2] * arr[i-4] -

                    den_coeffs[3] * arr[i-6] -

                    den_coeffs[4] * arr[i-8] -

                    den_coeffs[5] * arr[i-10] -

                    den_coeffs[6] * arr[i-12] -

                    den_coeffs[7] * arr[i-14] -

                    den_coeffs[8] * arr[i-16];



    // copy new data to data array

    for(i = 16; i < *sz; i++)

        data[i-16] = (float)arr[i];


    // update size of new array

    *sz = *sz - 16;


    // throw temp memory back into the pool







2008-08-12 13:20:55     Re: Optimization Help

Mike Frysinger (UNITED STATES)

Message: 60342   


since you're using float data types, most of the time is probably spent in gcc math functions emulating the floating point which is why putting that small function into L1 results in negligible performance change


you could try placing the buffers into L1 data


otherwise, the best bet is to convert things from using float to using fixed point math ... a lot of Blackfin docs on analog.com goes over this




2008-08-12 13:44:21     Re: Optimization Help


Message: 60343   




As Mike stated - you might get better performance if you can move things to fixed point fractional, and use some of the library functions:








2008-08-12 13:46:10     Re: Optimization Help

Kyle Schlansker (UNITED STATES)

Message: 60344   


Thanks Mike.  I will convert my code to use fixed point.


I'm relatively new to the Blackfin.  Are there any documents in particlular that you think would be beneficial for me to read?  I've really only been reading the hardware and peripheral references, but I'm interested in learning more about optimizing code for the Blackfin.






2008-08-13 09:18:03     Re: Optimization Help

Kyle Schlansker (UNITED STATES)

Message: 60396   


Mike & Robin,


Unfortunately, converting to 32-bit fixed point isn't as easy as I originally thought.  The range of my filter coefficients is prohibiting me from getting the accuracy I need.  The numerator coefficients are so small that they require a large number of fractional bits, but the the denominator coefficients need at least 5 bits for the integer part plus a sign bit.


I started to turn my fixed point interface into using a 64-bit (long long) representation, but soon realized that the hardware doesn't directly support this.


Any other suggestions?  Is there some tricky assembly I could do or am I SOL?




2008-08-13 09:29:13     Re: Optimization Help

Kyle Schlansker (UNITED STATES)

Message: 60397   


I should also note that the range of my input data that passes through the filter is [-1,1] so, if it wasn't for those denominator coefficients, I could get away with only a sign bit and a single integer bit and use the rest for the fractional part...




2008-08-14 12:31:38     Re: Optimization Help

Mike Frysinger (UNITED STATES)

Message: 60483   


you might be able to create a custom linker script so as to place the gcc math functions that are used to emulate the floating point into L1 ... but i'm not sure how much that'll really help.  you may be able to get just under 10 seconds, but i doubt it'll suddenly make the Blackfin crunch that floating point in fractions of a seconds ...




2008-08-14 17:42:40     Re: Optimization Help

Frank Van Hooft (CANADA)

Message: 60491   


You don't need to scale the numerator & denominator coefficients by the same amount. Simplistically, you can think of your math as:


arr = scale1(num_coeff*data) - scale2(den_coeff*arr)


scale1 can be a larger number to give you the resolution you need on the numerator coefficients.


scale2 can be a smaller number to allow 5 bits for the denominator integer portion.


If you de-scale the result of each half of the above equation before you do the subtraction (either descale all the way, or just partially descale them to some middle ground), you should be fine.




2008-08-21 10:11:21     Re: Optimization Help

Kyle Schlansker (UNITED STATES)

Message: 60839   


For the sake of thread-closure:


I was able to get things working using a new assmbly version of the iirdf1_fr32 function in libdsp.  Runtime is down to 0.6 seconds (0.06 x real time) from the original 11 seconds (1.1 x real time)!




Frank: Thanks for your suggestion.  I did try 3 different sets of quantization, but in the end I still ran into overflow problems.  I believe these were solved by the new libdsp function which makes use of all 40-bits in the accumulators.  Or perhaps I just had a bug




2008-08-21 10:40:56     Re: Optimization Help


Message: 60841   




Thanks for the update - glad things are working.