2008-10-02 22:02:48     An sssembly-language optimisation tip

Document created by Aaronwu Employee on Aug 8, 2013
Version 1Show Document
  • View in full screen mode

2008-10-02 22:02:48     An sssembly-language optimisation tip

Frank Van Hooft (CANADA)

Message: 63061   


Lately I've been doing some assembly-language optimisations of portions of our image-processing C code. I had the ironic experience where I identified one section of C code that was consuming a large number of cycles. So I rewrote that C code into assembler. That assembler code was a thing of beauty, delightfully written, making extensive use of parallel instructions. I thought it would fly like the wind. When I ran it, I discovered it actually consumed almost exactly the same number of cycles as the original C code. Darn!




Because of external SDRAM accesses. Due to the size of image files (typically around 1 MB in our case), they can't fit into internal RAM. So they have to live in external memory, and hence our image-processing routines are constantly reading & writing from external memory. So the cycle count for a routine can sometimes be limited, not by the actual "math" of the processing routine, but rather by CPU stalls due to the CPU core waiting on reads & writes to & from external memory.


You can find app notes on the ADI website talking about the mechanics of this. But the bottom line is it's really hard to estimate, because it's dependent upon many different factors. What I've found really helpful is that before I go & re-code a whole C routine into assembler, first I write a "memory access test" assembler function. The test function simply performs the same type & number of external memory accesses as the final function would. I benchmark the test function, and if it's about the same number of cycles as the C code, that tells me I shouldn't be blindly rewriting the C code into assembler, because I'm just going to waste my time.  (Again!  :-)


For example, if the C routine is going to:


- read byte 1, read byte 2

- perform some math

- save result as a 16-bit word

- repeat (752 x 240) times


then the memory access test function can look like this:






*    byte_read_write_bfin


*    void byte_read_write_bfin (char *bytebuff, short *wordbuff)


*    This function is for testing purposes only. It reads two bytes from

*    bytebuff, and writes a word to wordbuff. It repeats this

*    752 x 240  = 180,480 times

*    We do this purely to count how many cycles it takes.


*    The data is meaningless - data read is ignored, and junk is written. This

*    routine is simply for cycle counting, to get an idea of SDRAM read & write

*    speeds.


*    Result:  6,327,585 cycles on the BF537_STAMP




.global _byte_read_write_bfin;




    P0 = R0;                                // P0 points to bytebuff

    P1 = R1;                                // P1 points to wordbuff

    P2.H = 0x2;                              

    P2.L = 0xC100;                            // P2 now contains decimal 180,480


    /* Loop - we'll loop P2 number of times */

    lsetup (.brwt, .brwb) LC0 = P2;  




        R0 = B [P0]    (Z);                    // read a byte

        P0 += 3;

        R1 = B [P0] (Z);                    // read another byte, a little further along

        P0 += 3;


.brwb:    W [P1++] = R2;                        // write a word & move the write pointer along




You want to make sure that the external memory addressing in your test function is the same as what it'll be in the final function. Changing from byte accesses to word accesses, changing how the memory pointers increment or decrement, etc, can all significantly change the cycle count.


If you find that the cycle count of your test function is the same as the C function you're hoping to optimise, you've just gained a very valuable piece of information. Now you know any optimisations will first need to be algorithmic or memory-access improvements - in that case you already know, before you even do it, that simply rewriting into assembler isn't going to buy you anything.


This simple technique has proved very helpful for me - hopefully it'll also benefit others.




2008-10-02 22:21:30     Re: An sssembly-language optimisation tip

Mike Frysinger (UNITED STATES)

Message: 63063   


you can do immediate offsets as well:

R0 = B[P0] (Z);

R1 = B[P0 + 0x3] (Z);

P0 += 6;




2008-10-03 11:09:48     Re: An asssembly-language optimisation tip

Frank Van Hooft (CANADA)

Message: 63094   


Very true - the little test program can certainly be written more efficiently.


Of course it wouldn't actually help, and that fact illustrates the point I'm trying to convey very nicely. Look at the numbers above. 6.3 million cycles for 180,480 iterations of the loop equals 35 cycles per loop iteration.  35 cycles for a loop that's only 5 instructions long.  Why is it so slow?  Because the CPU is stalling, waiting for the external SDRAM accesses to complete.


That's the benefit of writing a little test program like this. If the code is going to do a bunch of external memory accesses, sometimes recoding it into "efficient" assembly is just a waste of time, because the CPU may be stalling, waiting around on external memory accesses, anyway. It sure is nice to know that *before* going through the effort of converting a bunch of C code into assembler.




2008-10-03 11:16:07     Re: An asssembly-language optimisation tip

Mike Frysinger (UNITED STATES)

Message: 63095   


this paper was written to cover the topic:





2008-10-17 16:18:58     Re: An sssembly-language optimisation tip


Message: 63867   


Frank, This is the exact issue that I have spotted in many bfin assembly programs that I wrote for image processing.


Reducing the instruction count alone is not always useful specially in image processing applications. Its equally important to exploit DMA: Move chunks of memory from SDRAM to L1 SRAM and process them. LOAD/STORE instructions to L1 SRAM execute in single cycle. Also DMA does not block the processor.




2008-10-17 17:56:06     Re: An sssembly-language optimisation tip

Frank Van Hooft (CANADA)

Message: 63871   


Very true. Just keep in mind that DMA'ing between external SDRAM and internal SRAM is not free - it's consuming cycles on the external SDRAM bus, which in these cases is already a bottleneck.


It's tough to make the tradeoff regarding when to DMA to & from external SDRAM, versus just leaving the data in external SDRAM & accessing it from there. I think the complexity of the algorithm & the data sizes are significant factors. I'd certainly advocate writing a simple little test program to get some rough performance numbers first, before taking the trouble to port over an entire algorithm.