Thanks, Yosi -
Sure, base address is 32 bit aligned and with add with shift you saved 1 cycle.
Boris
//////////
Boris,
Do you agree that the two LSB of R7 are "0" (32bit align!)
You can save a cycle by:
Mine R7 == yours >>2!!!
r3 = (r7+r2) << 2 || r6 = p1; // r7 = CRC table base , r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
p1 = r3; // ptr into CRC update table, size 256*4 bytes
r5 = r0 >> 24 || || r1 = b[p0++]; // r1 - input byte, 8 MSB of CRC interacting with input
r2 = r1 ^ r5;
r5 = r0 << 8; // 24 LSB of old CRC interacting with CRC update
CRC32End:
r0 = r5 ^ r6;
and now you have 6 cyc/8bit == > 0.75cyc/bit
do you agree?
Thanks,
Yosi
Have to correct the code - add+ shift can't be done in parallel:
LOOP_BEGIN ...
r2 = (r7 + r2) << 2 ;
r3 = r0 >> 24 || r6 = [p1]; // 8 MSB of CRC interacting with input, r6 is 32-bit CRC corresponding to input byte
p1 = r2; // ptr into CRC update table, size 256*4 bytes
r5 = r0 << 8 || r1 = b[p0++]; // 24 LSB of old CRC interacting with CRC update
r2 = r3 ^ r1;
r0 = r5 ^ r6;
LOOP_END ...
Sorry, guys - I mislead you.
The code doesn't work because it has to have feedback - should "bite it's tail" and use LUT value in the same loop iteration (not in the next as it was).
Don't know how to avoid nop's:
CRC32Start:
r5 = r0 >> 24 || r1 = b[p0++]; // r1 - input byte, 8 MSB of CRC interacting with input
r2 = r1 ^ r5;
r3 = (r7+r2) << 2 ; // r7 = CRC table base
p1 = r3; // ptr into CRC update table, size 256*4 bytes
nop;
nop;
nop; // can't use p1 for 4 cycles
r5 = r0 << 8; // 24 LSB of old CRC interacting with CRC update
r6 = [p1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
CRC32End:
r0 = r5 ^ r6;
So it is 10 cycles /8bits, 1.25 c/bit as for now.
Boris
Hi,
Yosi pointed out how to do it faster.
Usually we work on messages that are not too small - at least, say, 64 bytes.
So if we break our original 64-byte packet into 2 parts - say, the first 32 bytes and the last 32 bytes - can work on 2 32-byte messages independently, recombining the results once in the end.
Initial conditions for the packets will be r0 = 0 (CRC32 of 0 message) and some precalculated const CRC32_32.
LOOP CRC32 LC0 = p4; // p4 = (32 - 1) in this case
p3 = p0;
r0 = 0;
r1 = CRC32_32; // CRC32 initial value corresponging to x^(8*32) message polynomial
p3 += 32; //
r5 = r0 >> 24 || r2 = b[p0++]; // r1 - input byte, 8 MSB of CRC interacting with input
r2 = r2 ^ r5;
r3 = (r7+r2) << 2 ; // r7 = CRC table base
i1 = r3; // ptr into CRC update table, size 256*4 bytes
r5 = r1 >> 24 || r2 = b[p3++];
r2 = r2 ^ r5;
r3 = (r7+r2) << 2 ; // r7 = CRC table base
i2 = r3;
r5 = r0 << 8; // 24 LSB of old CRC interacting with CRC update
r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
LOOP_BEGIN CRC32;
r0 = r5 ^ r3; // update r0
nop; // still 1 stall here...
r5 = r0 >> 24 || r2 = b[p0++] || r6 = [i2]; // r1 - input byte, 8 MSB of CRC interacting with input
r2 = r2 ^ r5;
r2 = (r7+r2) << 2 ; // r7 = CRC table base
i1 = r2; // ptr into CRC update table, size 256*4 bytes
r1 = r4 ^ r6; // update r1
r5 = r1 >> 24 || r2 = b[p3++];
r2 = r2 ^ r5;
r2 = (r7+r2) << 2 ; // r7 = CRC table base
i2 = r2;
r5 = r0 << 8; // 24 LSB of old CRC interacting with CRC update
r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
LOOP_END CRC32;
r0 = r0 ^ r1; //final r0 update
// back at ~ 13 cycles/2 bytes = 6.5 c/byte = 0.8 c/bit
Boris,
I think there is a way to remove the in-loop stall ( see the changes in red) and get it down to 12/2 bytes or 0.75 c/bit.
LOOP CRC32 LC0 = p4; // p4 = (32 - 1) in this case
p3 = p0;
r0 = 0;
r1 = r7<<2;
m1 = r1; //m1 = CRC table base.
r1 = CRC32_32; // CRC32 initial value corresponding to x^(8*32) message polynomial
p3 += 32; //
r5 = r0 >> 24 || r2 = b[p0++]; // r1 - input byte, 8 MSB of CRC interacting with input
r2 = r2 ^ r5;
r3 = (r7+r2) << 2 ; // r7 = CRC table base
i1 = r3; // ptr into CRC update table, size 256*4 bytes
r5 = r1 >> 24 || r2 = b[p3++];
r2 = r2 ^ r5;
r3 = (r7+r2) << 2 ; // r7 = CRC table base
i2 = r3;
r5 = r0 << 8; // 24 LSB of old CRC interacting with CRC update
r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
LOOP_BEGIN CRC32;
r0 = r5 ^ r3; // update r0
// nop; // still 1 stall here...
r5 = r0 >> 24 || r2 = b[p0++]; // r1 - input byte
r2 = r2 ^ r5;
r2 = r2 << 2 || r6 = [i2]; // r7 = CRC table base, 8 MSB of CRC interacting with input
i1 = r2; // ptr into CRC update table, size 256*4 bytes
r1 = r4 ^ r6; // update r1
r5 = r1 >> 24 || r2 = b[p3++];
r2 = r2 ^ r5;
r2 = (r7+r2) << 2; // r7 = CRC table base
i2 = r2;
r5 = r0 << 8 || i1+=m1 // 24 LSB of old CRC interacting with CRC update , update I1CRC tbl ptr
r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC
LOOP_END CRC32;
r0 = r0 ^ r1; //final r0 update
Thanks, Yosi - it sure works.
Boris