2008-08-12 13:02:52 Optimization Help
Kyle Schlansker (UNITED STATES)
Message: 60339
Anyone up for an optimization problem?
I need to apply, in real-time, a chebychev filter on two channels of data that are interleaved in memory. Right now, my code is taking a little more than 11 seconds to run on 10 seconds worth of data, on the BF548 EZ-KIT. I'm looking for some suggestions on how to optimize this to get it down to under 10 seconds (and prefereably much lower...)
So far, I've tried:
1) putting the function in L1 SRAM (no speedup)
2) threading the algorithm by separating the interleaved data into individual channels and then decimating each channel in its own thread (no speedup)
3) Enabling write-back memory cache rather than write through (small speedup)
The for loop is what's taking all the time. Perhaps there is a better memory layout that I could use?
Here's the function in question:
void apply_chebychev_filter_fast(float *data, int *sz) {
int i;
double *arr = malloc( (*sz) * sizeof(double));
if(!arr) {
fprintf(stderr, "ERROR:%s:%d: error allocating memory for chebychev filter\n", __FUNCTION__, __LINE__);
exit(1);
}
memset(arr, 0, *sz*sizeof(double));
for(i = 16; i < *sz; i++) {
arr[i] =
num_coeffs[0] * data[i-0] + // b(0) * x(i-0)
num_coeffs[1] * data[i-2] + // b(1) * x(i-1)
num_coeffs[2] * data[i-4] +
num_coeffs[3] * data[i-6] +
num_coeffs[4] * data[i-8] +
num_coeffs[5] * data[i-10] +
num_coeffs[6] * data[i-12] +
num_coeffs[7] * data[i-14] +
num_coeffs[8] * data[i-16] -
den_coeffs[1] * arr[i-2] - // a(1) * y(i-1)
den_coeffs[2] * arr[i-4] -
den_coeffs[3] * arr[i-6] -
den_coeffs[4] * arr[i-8] -
den_coeffs[5] * arr[i-10] -
den_coeffs[6] * arr[i-12] -
den_coeffs[7] * arr[i-14] -
den_coeffs[8] * arr[i-16];
}
// copy new data to data array
for(i = 16; i < *sz; i++)
data[i-16] = (float)arr[i];
// update size of new array
*sz = *sz - 16;
// throw temp memory back into the pool
free(arr);
}
QuoteReplyEditDelete
2008-08-12 13:20:55 Re: Optimization Help
Mike Frysinger (UNITED STATES)
Message: 60342
since you're using float data types, most of the time is probably spent in gcc math functions emulating the floating point which is why putting that small function into L1 results in negligible performance change
you could try placing the buffers into L1 data
otherwise, the best bet is to convert things from using float to using fixed point math ... a lot of Blackfin docs on analog.com goes over this
QuoteReplyEditDelete
2008-08-12 13:44:21 Re: Optimization Help
Robin Getz (UNITED STATES)
Message: 60343
Kyle:
As Mike stated - you might get better performance if you can move things to fixed point fractional, and use some of the library functions:
https://docs.blackfin.uclinux.org/doku.php?id=toolchain:libbfdsp#filters
-Robin
QuoteReplyEditDelete
2008-08-12 13:46:10 Re: Optimization Help
Kyle Schlansker (UNITED STATES)
Message: 60344
Thanks Mike. I will convert my code to use fixed point.
I'm relatively new to the Blackfin. Are there any documents in particlular that you think would be beneficial for me to read? I've really only been reading the hardware and peripheral references, but I'm interested in learning more about optimizing code for the Blackfin.
QuoteReplyEditDelete
2008-08-13 09:18:03 Re: Optimization Help
Kyle Schlansker (UNITED STATES)
Message: 60396
Mike & Robin,
Unfortunately, converting to 32-bit fixed point isn't as easy as I originally thought. The range of my filter coefficients is prohibiting me from getting the accuracy I need. The numerator coefficients are so small that they require a large number of fractional bits, but the the denominator coefficients need at least 5 bits for the integer part plus a sign bit.
I started to turn my fixed point interface into using a 64-bit (long long) representation, but soon realized that the hardware doesn't directly support this.
Any other suggestions? Is there some tricky assembly I could do or am I SOL?
QuoteReplyEditDelete
2008-08-13 09:29:13 Re: Optimization Help
Kyle Schlansker (UNITED STATES)
Message: 60397
I should also note that the range of my input data that passes through the filter is [-1,1] so, if it wasn't for those denominator coefficients, I could get away with only a sign bit and a single integer bit and use the rest for the fractional part...
QuoteReplyEditDelete
2008-08-14 12:31:38 Re: Optimization Help
Mike Frysinger (UNITED STATES)
Message: 60483
you might be able to create a custom linker script so as to place the gcc math functions that are used to emulate the floating point into L1 ... but i'm not sure how much that'll really help. you may be able to get just under 10 seconds, but i doubt it'll suddenly make the Blackfin crunch that floating point in fractions of a seconds ...
QuoteReplyEditDelete
2008-08-14 17:42:40 Re: Optimization Help
Frank Van Hooft (CANADA)
Message: 60491
You don't need to scale the numerator & denominator coefficients by the same amount. Simplistically, you can think of your math as:
arr = scale1(num_coeff*data) - scale2(den_coeff*arr)
scale1 can be a larger number to give you the resolution you need on the numerator coefficients.
scale2 can be a smaller number to allow 5 bits for the denominator integer portion.
If you de-scale the result of each half of the above equation before you do the subtraction (either descale all the way, or just partially descale them to some middle ground), you should be fine.
QuoteReplyEditDelete
2008-08-21 10:11:21 Re: Optimization Help
Kyle Schlansker (UNITED STATES)
Message: 60839
For the sake of thread-closure:
I was able to get things working using a new assmbly version of the iirdf1_fr32 function in libdsp. Runtime is down to 0.6 seconds (0.06 x real time) from the original 11 seconds (1.1 x real time)!
Frank: Thanks for your suggestion. I did try 3 different sets of quantization, but in the end I still ran into overflow problems. I believe these were solved by the new libdsp function which makes use of all 40-bits in the accumulators. Or perhaps I just had a bug
QuoteReplyEditDelete
2008-08-21 10:40:56 Re: Optimization Help
Robin Getz (UNITED STATES)
Message: 60841
Kyle:
Thanks for the update - glad things are working.
-Robin