I need to perform a 32x32 bit multiply on two data buffers within a filter routine for a BF518/12. So, for it to be as efficient as possible, I wrote it in assembly. The code within the primary loop is as follows:

A1 = R2.L * R1.H;

A1 += R2.H * R1.L;

A1 = A1 >>> 16; /* arithmetic right shift */

A1 += R2.H * R1.H;

A0 += A1;

R1 = [I1++];

R2 = [I0++];

Does anyone have any ideas for how I might execute some of these instructions in parallel? I'm still novice at using the blackfin's assembly instructions.

Hello David,

One suggestion would be to unwrap the loop one step, and parallelize the data loads (if you can afford to use two more dregs). Something like the following:

before loop:

R1 = [I1++];

R2 = [I0++];

loop:

A1 = R2.L * R1.H || R3 = [I1++];

A1 += R2.H * R1.L || R4 = [I0++];

A1 = A1 >>> 16; /* arithmetic right shift */

A1 += R2.H * R1.H;

A0 += A1;

A1 = R4.L * R3.H || R1 = [I1++] || MNOP;

A1 += R4.H * R3.L || R2 = [I0++] || MNOP;

A1 = A1 >>> 16; /* arithmetic right shift */

A1 += R4.H * R3.H;

A0 += A1;

and have this loop half as many times. Also, you probably know this, but ensure the address being accessed is in L1.

Unfortunately, all of your other instructions are 32-bit, and cannot be issued in parallel with each other. Without changing the functionality of your assembly code, or changing the way you are implementing your filter, I'm not sure how you could optimize further -- although someone else may have a better suggestion.

For more information, refer to the section on "issuing parallel instructions" in the processor programming reference, located at:

http://www.analog.com/static/imported-files/processor_manuals/blackfin_pgr.ref.man.rev1.3.pdf

Hope this helps,

Tom