Hello!

I need a fast multiplier routine for the BF536 for multiplication

of two 4.60 fractionals to produce a 4.60 result. Saturation is not

needed, but allowed.

Does anybody know a fast fixed point algorithm?

Thank you in advance for an answer

Hello!

I need a fast multiplier routine for the BF536 for multiplication

of two 4.60 fractionals to produce a 4.60 result. Saturation is not

needed, but allowed.

Does anybody know a fast fixed point algorithm?

Thank you in advance for an answer

Hello,

you are right, I do not necessarily need a fixed point solution, but I need the fastest possible way. Could you please help me find the right function for this job? I wrote pretty fast one in assembler, but it does only work with unsigned values. Hence the signs have to be handled extra and I think, this is not the optimum. If there is a standard function for this purpose, I would like to test if that one is faster.

>I need a fast multiplier routine for the BF536 for multiplication of two 4.60 fractionals to produce a 4.60 result.

Hi,

with pure 4.60 fractionnal numbers, multiplication would be a bit complicated to implement,

so I propose, if you can afford losing 3 bits,

to compute it with 1.63 arithmetics and shifting after

we want 4.60 * 4.60 -> 4.60.

if we apply 1.63 * 1.63 -> 1.63 on the same numbers,

we will actually have 4.60 * 4.60 -> 7.57, then you sature and shift 3 bits left to go back to 4.60

now, how to compute 1.63 * 1.63 -> 1.63 ?

as the processor can only do 16*16->32 bits multiplication,

just break down your numbers into 4 words,

x = x1 * 2^-15 + x2 * 2^-31 + x3 * 2^-47 + x4 * 2^-63

y = y1 * 2^-15 + y2 * 2^-31 + y3 * 2^-47 + y4 * 2^-63

and do the math on words like you did on digits in school, with the carry.

We actually keep the most significant bits:

z = x*y = x1 * y1 * 2 ^-30

+ x1 * y2 * 2 ^-46

+ x2 * y1 * 2 ^-46

+ x1 * y3 * 2 ^-62

+ x2 * y2 * 2 ^-62

+ x3 * y1 * 2 ^-62

x1 and y1 are 1.15 format (S), others should be looked as 0.16 format (U)

pseudo code:

Acc = x1 * y3 (SU)

Acc += x2 * y2 (UU)

Acc += x3 * y1 (US)

store Acc.L to z4

shift Acc 16 bits right

Acc += x1 * y2 (SU)

Acc += x1 * y2 (SU)

store Acc.L to z3

shift Acc 16 bits right

Acc += x1 * y1 (SS)

store Acc.L to z2

store Acc.H to z1

in blackfin assembly,

the processor uses little endian, so x1,x2,x3,x4 are stored in reversed order in memory

SS is the default option for MAC instruction

SU is actually option (M) and is available only on MAC1 unit, so use accumulator A1

Hi

here's an implementation of 1.63 * 1.63 multiplication.

it's C-callable assembly.

look at the 1 bit left shift of intermediary operations and propogation of the carry word and acc.X.

I think there isn't much room for optimization.

I you want to implement 4.60*4.60 directly,

you need to shift 4 bits instead of 1,

and shift the last multiplication 3 bits before adding.

Edit:

the routine in my first post was wrong.

I realized that you need to compute x1*y4 + x2*y3 + x3*y2 + x4*y1

as the 16 MSB of this 32 bits addition accounts for the 16 LSB of your 1.63 result...

Best regards.

Hello,

thank you, this is a nice and fast program. It always works precise with positive input values.

But I am afraid, it does not work very well with negative operands. Please try this main:int main( void )

{

long long a,b,c,d,e,f,g,h;a = 0x57D8B76A5C5EF129;

b = 0x7963d632C21B38DE;

c = mult_fr1x64x64(a,b);

d = mult_fr1x64x64(-a,-b);

e = 0x0275C39D680DB460;

f = 0x05AC3AA443A5A100;

g = mult_fr1x64x64(e,f);

h = mult_fr1x64x64(-e,-f);

return 0;

}You will see, there is only a very small difference between c and d, but a big one between g and h. The problem (or one problem) is a saturation in the 4th MAC operation of the assembler routine. Unfortunately, I am not as familar with the MAC to find the solution for this problem, maybe you are?

Thanks in advance for an answer.

Kind regards

Hi,

I was able to get better results after I slightly changed the code for the function as follows:

instead of first 6 lines in the original code

//

A1 = R1.H * R2.L (M); // x1*y4

A1 += R3.H * R0.L (M);// + y1*x4

a0 = a1;

A0 += R1.L * R2.H (FU);// + x2*y3

A0 += R3.L * R0.H (FU);// + y2*x3a0 = a0 >>> 15;

//

calculate this intermediate result in the different order:

// new code

A1 = R1.L * R2.H (FU);// + x2*y3

A1 += R3.L * R0.H (FU);// + y2*x3

A1 += R1.H * R2.L (M); // x1*y4

A1 += R3.H * R0.L (M);// + y1*x4A1 = A1 >>> 15;

a0 = a1;

//

somehow accumulator a0 becomes overflowed if you first do signed operations and then unsigned.

I tested some numbers in the same maner as you incuding your example and got close results for negative arguments.

Boris

>I need a fast multiplier routine for the BF536 for multiplication of two 4.60 fractionals to produce a 4.60 result.

Hi,

with pure 4.60 fractionnal numbers, multiplication would be a bit complicated to implement,

so I propose, if you can afford losing 3 bits,

to compute it with 1.63 arithmetics and shifting after

we want 4.60 * 4.60 -> 4.60.

if we apply 1.63 * 1.63 -> 1.63 on the same numbers,

we will actually have 4.60 * 4.60 -> 7.57, then you sature and shift 3 bits left to go back to 4.60

now, how to compute 1.63 * 1.63 -> 1.63 ?

as the processor can only do 16*16->32 bits multiplication,

just break down your numbers into 4 words,

x = x1 * 2^-15 + x2 * 2^-31 + x3 * 2^-47 + x4 * 2^-63

y = y1 * 2^-15 + y2 * 2^-31 + y3 * 2^-47 + y4 * 2^-63

and do the math on words like you did on digits in school, with the carry.

We actually keep the most significant bits:

z = x*y = x1 * y1 * 2 ^-30

+ x1 * y2 * 2 ^-46

+ x2 * y1 * 2 ^-46

+ x1 * y3 * 2 ^-62

+ x2 * y2 * 2 ^-62

+ x3 * y1 * 2 ^-62

x1 and y1 are 1.15 format (S), others should be looked as 0.16 format (U)

pseudo code:

Acc = x1 * y3 (SU)

Acc += x2 * y2 (UU)

Acc += x3 * y1 (US)

store Acc.L to z4

shift Acc 16 bits right

Acc += x1 * y2 (SU)

Acc += x1 * y2 (SU)

store Acc.L to z3

shift Acc 16 bits right

Acc += x1 * y1 (SS)

store Acc.L to z2

store Acc.H to z1

in blackfin assembly,

the processor uses little endian, so x1,x2,x3,x4 are stored in reversed order in memory

SS is the default option for MAC instruction

SU is actually option (M) and is available only on MAC1 unit, so use accumulator A1