Hi.

I've found this code in the documentation, for the dot product in assembly :

---

/* dot(int n, dm float *x, pm float *y);

Computes the dot product of two floating-point vectors of length n. One is stored in dm and the other in pm. Length n must be greater than 2.*/

#include <asm_sprt.h>

.section/pm seg_pmco;

.GLOBAL _dotASM;

_dotASM:

leaf_entry;

r0=r4-1,i4=r8; /* Load first vector address into I register, and load r0 with length -1 */

r0=r0-1,i12=r12; /* Load second vector address into I register and load r0 with length-2 (because the 2 iterations outside feed and drain the pipe */

f12=f12-f12,f2=dm(i4,m6),f4=pm(i12,m14); /* Zero the register that will hold the result and start feeding pipe */

f8=f2*f4, f2=dm(i4,m6),f4=pm(i12,m14); /* Second data set into pipeline, also do first multiply */

lcntr=r0, do dot_loop until lce; /* Loop length-2 times, three-stage pipeline: read, mult, add */

dot_loop:

f8=f2*f4, f12=f8+f12,f2=dm(i4,m6),f4=pm(i12,m14);

f8=f2*f4, f12=f8+f12;

f0=f8+f12;

/* drain the pipe and end with the result in r0, where it’ll be returned */

leaf_exit; /* restore the old frame pointer and return */

_dotASM.end:

---

I wanted to compare this code with a manually developed C code for the dot product :

---

float dotC(int n, dm float *x, pm float *y) {

int i; float z = 0.;

for(i = 0; i < n; i++) {

z += x[i]*y[i];

}

return z;

}

---

The results are the same, but the dotC is twice more faster than dotASM. I don't understand, It seems that the ASM function have way less instructions than the C one. Can you explain why the C one is faster?

You can have a look at my attached project.

Hello pereira

I have modified your code to enable SIMD for assembly functions as well. It result number of cycles taken for ASM is 548, whereas 'C' function has taken 567 cycles. Hope this is what you would like to achieve.

Best Regards,

Jithul