This Question is Possibly Answered

1 "correct" answer available (4 pts) 1 "helpful" answer available (2 pts)
20 Replies Last post: Sep 9, 2009 10:01 AM by boris liberol   Go to original post 1 2 Previous Next
boris liberol Analog Employee   13 posts since
Aug 19, 2009
Currently Being Moderated
15. Sep 3, 2009 3:01 PM in response to: ystein
Re: very fast CRC32 code wanted

Thanks, Yosi -

 

Sure, base address is 32 bit aligned and with add with shift you saved 1 cycle.

 

Boris

 

//////////

Boris,

Do you agree that the two LSB of R7 are "0" (32bit align!)

You can save a cycle by:

 

Mine  R7 == yours >>2!!!

 

 

   r3 = (r7+r2) << 2      || r6 = p1;   // r7 = CRC table base , r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC

   p1 = r3;                           // ptr into CRC update table, size 256*4 bytes

 

   r5 = r0 >> 24 || || r1 = b[p0++];  // r1 - input byte,  8 MSB of CRC interacting with input

   r2 = r1 ^ r5;

 

   r5 = r0 << 8;  // 24 LSB of old CRC interacting with CRC update

CRC32End:

   r0 = r5 ^ r6;

 

and now you have 6 cyc/8bit == > 0.75cyc/bit

 

do you agree?

 

Thanks,

Yosi

 

boris liberol Analog Employee   13 posts since
Aug 19, 2009
Currently Being Moderated
16. Sep 3, 2009 3:29 PM in response to: ystein
correction to the code

Have to correct the code - add+ shift can't be done in parallel:

 

LOOP_BEGIN ...

 

  r2 = (r7 + r2) << 2 ;

  r3 = r0 >> 24 || r6 = [p1];  // 8 MSB of CRC interacting with input,  r6 is 32-bit CRC corresponding to input byte
  p1 = r2;                          // ptr into CRC update table, size 256*4 bytes

 

  r5 = r0 << 8   || r1 = b[p0++];  // 24 LSB of old CRC interacting with CRC update         
 
  r2 = r3 ^ r1;
 
  r0 = r5 ^ r6;  

 

LOOP_END ...

boris liberol Analog Employee   13 posts since
Aug 19, 2009
Currently Being Moderated
17. Sep 8, 2009 9:56 AM in response to: boris liberol
fast CRC32 code correction

Sorry, guys - I mislead you.

 

The code doesn't work because it has to have feedback - should "bite it's tail" and use LUT value in the same loop iteration (not in the next as it was).

 

Don't know how to avoid nop's:

 

 

CRC32Start:

     

     r5 = r0 >> 24 || r1 = b[p0++];  // r1 - input byte,  8 MSB of CRC interacting with input

     r2 = r1 ^ r5;

     r3 = (r7+r2) << 2 ;   // r7 = CRC table base

     p1 = r3;                 // ptr into CRC update table, size 256*4 bytes

 

     nop;

     nop;

     nop; // can't use p1 for 4 cycles

 

     r5 = r0 << 8;  // 24 LSB of old CRC interacting with CRC update   

     r6 = [p1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC 

 

CRC32End:

     r0 = r5 ^ r6;

 

So it is 10 cycles /8bits, 1.25 c/bit as for now.

 

Boris

boris liberol Analog Employee   13 posts since
Aug 19, 2009
Currently Being Moderated
18. Sep 8, 2009 5:01 PM in response to: boris liberol
fast CRC32 code - back to ~0.8 c/bit

Hi,

 

Yosi pointed out how to do it faster.

 

Usually we work on messages  that are not too small - at least, say, 64 bytes.

 

So if we break our original 64-byte packet into 2 parts - say, the first 32 bytes and the last 32 bytes - can work on 2 32-byte messages independently, recombining the results once in the end.

 

Initial conditions for the packets will be r0 = 0 (CRC32 of 0 message) and some precalculated const CRC32_32.


    LOOP CRC32 LC0 = p4; // p4 = (32 - 1) in this case
   
    p3 = p0;
   
    r0 = 0;
   
    r1 = CRC32_32;  // CRC32 initial value corresponging to x^(8*32) message polynomial
   
    p3 += 32; //

 

    r5 = r0 >> 24 || r2 = b[p0++];  // r1 - input byte,  8 MSB of CRC interacting with input

    r2 = r2 ^ r5;

    r3 = (r7+r2) << 2 ;  // r7 = CRC table base

    i1 = r3;                // ptr into CRC update table, size 256*4 bytes

 

    r5 = r1 >> 24 || r2 = b[p3++];                

    r2 = r2 ^ r5;

    r3 = (r7+r2) << 2 ;  // r7 = CRC table base    
    i2 = r3;
  

    r5 = r0 << 8;  // 24 LSB of old CRC interacting with CRC update  

    r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC 

 
LOOP_BEGIN CRC32; 

    r0  = r5 ^ r3; // update r0

 

    nop; // still 1 stall here...


    r5 = r0 >> 24 || r2 = b[p0++] || r6 = [i2];  // r1 - input byte,  8 MSB of CRC interacting with input

    r2 = r2 ^ r5;

    r2 = (r7+r2) << 2 ;  // r7 = CRC table base

    i1 = r2;                // ptr into CRC update table, size 256*4 bytes

 

    r1 = r4 ^ r6;    // update r1

 

    r5 = r1 >> 24 || r2 = b[p3++];            
    r2 = r2 ^ r5;

    r2 = (r7+r2) << 2 ;  // r7 = CRC table base    
    i2 = r2;

 

    r5 = r0 << 8;  // 24 LSB of old CRC interacting with CRC update  

    r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC 
        
LOOP_END CRC32;

     
    r0 = r0 ^ r1; //final r0 update
  

// back at ~ 13 cycles/2 bytes = 6.5 c/byte = 0.8 c/bit

ystein Analog Employee   8 posts since
Aug 19, 2009
Currently Being Moderated
19. Sep 8, 2009 7:56 PM in response to: boris liberol
Re: fast CRC32 code - back to ~0.8 c/bit

Boris,

I think there is a way to remove the in-loop stall ( see the changes in red) and get it down to 12/2 bytes  or 0.75 c/bit.

 

LOOP CRC32 LC0 = p4; // p4 = (32 - 1) in this case
    
     p3 = p0;
    
     r0 = 0;

r1 = r7<<2;                                       

  m1 = r1;                             //m1 = CRC table base.
   
     r1 = CRC32_32;  // CRC32 initial value corresponding to x^(8*32) message polynomial
    
     p3 += 32; //

 

    r5 = r0 >> 24 || r2 = b[p0++];  // r1 - input byte,  8 MSB of CRC interacting with input

    r2 = r2 ^ r5;

    r3 = (r7+r2) << 2 ;  // r7 = CRC table base

    i1 = r3;                // ptr into CRC update table, size 256*4 bytes

 

    r5 = r1 >> 24 || r2 = b[p3++];               

    r2 = r2 ^ r5;

    r3 = (r7+r2) << 2 ;  // r7 = CRC table base    
     i2 = r3;

    r5 = r0 << 8;  // 24 LSB of old CRC interacting with CRC update 

    r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC

 
LOOP_BEGIN CRC32;

    r0  = r5 ^ r3; // update r0

 

//    nop; // still 1 stall here...


     r5 = r0 >> 24 || r2 = b[p0++];  // r1 - input byte

    r2 = r2 ^ r5;

    r2 = r2 << 2 || r6 = [i2];           // r7 = CRC table base, 8 MSB of CRC interacting with input

 

    i1 = r2;                                        // ptr into CRC update table, size 256*4 bytes

 

    r1 = r4 ^ r6;                             // update r1

 

    r5 = r1 >> 24 || r2 = b[p3++];            
     r2 = r2 ^ r5;

    r2 = (r7+r2) << 2;                   // r7 = CRC table base    
     i2 = r2;

 

    r5 = r0 << 8 || i1+=m1  // 24 LSB of old CRC interacting with CRC update , update I1CRC tbl ptr

    r4 = r1 << 8 || r3 = [i1]; // r6 is 32-bit CRC corresponding to input byte^MSbyte of CRC 
         
LOOP_END CRC32;

     
     r0 = r0 ^ r1; //final r0 update

boris liberol Analog Employee   13 posts since
Aug 19, 2009
Currently Being Moderated
20. Sep 9, 2009 10:01 AM in response to: ystein
Re: fast CRC32 code - back to ~0.8 c/bit

Thanks, Yosi - it sure works.

 

Boris

More Like This

  • Retrieving data ...