Post Go back to editing

Migrate Assembler code from SHARC to SHARC+

Category: Software
Product Number: ADSP-SC598
Software Version: CCES 2.11.1

Hi all,

We have assembly code we created for ADSP 21489, optimized for this target.

Do you have any advice to migrate it to SC598 in term of performance ?

We have success to make it run functionally but it seems to use 2 times more MIPS. I suspect DM and PM code to be differently maps on SC598.. do you have advice on papers or manuals sections to follow to understand conceptual differences between sharc and sharc+

I thank you in advance,

Best regards,

Yohann.

Thread Notes

  • Hi Yohann,

    We recommend to refer the below linked FAQ for more details on Migrate Assembler code from SHARC to SHARC+.

    FAQ: Can the existing SHARC assembly codes be re-used as it is with SHARC+ Core?
    ez.analog.com/.../can-the-existing-sharc-assembly-codes-be-re-used-as-it-is-with-sharc-core

    Hope this helps.

    Best Regards,
    Santhakumari.V

  • Hi,

    Ok, this help.

    Do you have a thread or manual to define PM and DM in SC598 ?

    To be more specific, there is two thing I am not able to do :

    • modify a memory section from mem_DMCO_SDRAM_A1 { TYPE(BW RAM) START(0x80000000) END(0x802fffff) WIDTH(8)} to mem_DMCO_SDRAM_A1 { TYPE(PM RAM) START(???) END(???) WIDTH(32).  On sharc the memory address just need to be divide by the ratio of WIDTH, but it seems now it is a different approach that I need.. I see in processor-manuals some sections about memory address but I am a bit lost in this big doc.
    • Place specific bunch of code in a PM section like :
            dxe_block_pm PM
            {
               INPUT_SECTION_ALIGN(4)
               INPUT_SECTIONS( $OBJS_LIBS(Seg_StormAudioPP_PM) )
            } > mem_DMCO_SDRAM_A1

      this never work.. the linker said that Seg_StormAudioPP_PM isn't place in memory as it expects to place it in SW. So it ends to work with :
            dxe_block2_pm SW
            {
               INPUT_SECTION_ALIGN(4)
               INPUT_SECTIONS( $OBJS_LIBS(Seg_StormAudioPP_PM) )
            } > mem_DMCO_SDRAM_A4
      as mem_DMCO_SDRAM_A4 is SHARCO VISA code, 3MB... but If I place my PM code in a SW section.. I lost the optimization of dual bus loading.

    The same behavior with DM that is placed in BW to work instead to be really place in a DM section memory.

    I thank you in advance,

    Best regards,

    Yohann.

  • Hi Yogann,

    The way you have converted was correct. The corresponding 32-bit memory address for mem_DMCO_SDRAM_A1 { TYPE(BW RAM) START(0x80000000) END(0x802fffff) WIDTH(8)} is below.

    mem_DMC0_SDRAM_A1 { TYPE(DM RAM) START(0x10000000) END(0x102FFFFF) WIDTH(32) }

    As this NW word address access is only for Data access, the type qualifier you have to specify is DM.

    Due to this you were not able to place the code section in this memory. So, you have to place the code section in mem_DMC0_SDRAM_A4 which is for SW code access.

    You can refer the table 6. in the below linked datasheet for more information.
    www.analog.com/.../adsp-sc596-adsp-sc598.pdf

    Hope this helps.

    Regards,
    Santhakumari.K

  • Hi,

    There is something that I can't understand.


    I defined as you mention section like this :
       mem_DMCO_SDRAM_PM48{ TYPE(PM RAM) START(0x00400000) END(0x004FFFFF) WIDTH(48) } // Extended Precision/ ISA Code (48 Bits)
       mem_DMCO_SDRAM_DM  { TYPE(DM RAM) START(0x100C0000) END(0x1013FFFF) WIDTH(32) }
       mem_DMCO_SDRAM_SW  { TYPE(SW RAM) START(0x00800000) END(0x00AFFFFF) WIDTH(16) }
       mem_DMCO_SDRAM_PM  { TYPE(DM RAM) START(0x10200000) END(0x13FFFFFF) WIDTH(32) }
       mem_DMCO_SDRAM_BW1 { TYPE(BW RAM) START(0x81000000) END(0x82dfffff) WIDTH(8) }
       mem_DMCO_SDRAM_BW2 { TYPE(BW RAM) START(0x82e00000) END(0x84ffffff) WIDTH(8) }
       mem_DMCO_SDRAM_BW3 { TYPE(BW RAM) START(0x85000000) END(0x87ffffff) WIDTH(8) }


    it follow the table your share and seems to link.

    But when I move these variables from BW to DM :

    from

    #pragma align ALIGN
    section("Seg_CustomerPP_Data")    PeqCtrl_t                    sPeqCtrl;
    // State
    #pragma align ALIGN
    section("Seg_CustomerPP_Data")    PeqState_t                    sPeqState;
    // Params
    #pragma align ALIGN
    section("Seg_CustomerPP_Data")    PeqParams_t                    sPeqParams[TEST_N_RUNS][2];        // Note: will be filled in code

    // Processing buffer
    #pragma align ALIGN
    section("Seg_CustomerPP_Data")    float                        fProcBuf[PROCESS_BLOCK_LENGTH * N_TX_CHANNELS];

    // Channel maps
    #pragma align ALIGN
    section("Seg_CustomerPP_Data")    unsigned int                uiTxChannelMap[N_TX_CHANNELS] =         // Output channels logical to physical map
    { [...]

    to

    #pragma align ALIGN
    section("Seg_CustomerPP_DM")    dm PeqCtrl_t                    sPeqCtrl;
    // State
    #pragma align ALIGN
    section("Seg_CustomerPP_DM")    dm PeqState_t                    sPeqState;
    // Params
    #pragma align ALIGN
    section("Seg_CustomerPP_DM")    dm PeqParams_t                    sPeqParams[TEST_N_RUNS][2];        // Note: will be filled in code

    // Processing buffer
    #pragma align ALIGN
    section("Seg_CustomerPP_DM")    dm float                        fProcBuf[PROCESS_BLOCK_LENGTH * N_TX_CHANNELS];

    // Channel maps
    #pragma align ALIGN
    section("Seg_CustomerPP_DM")    dm unsigned int                uiTxChannelMap[N_TX_CHANNELS] =         // Output channels logical to physical map
    { [...]

    with in my ldf :

    from :
          SDRAM_EXT_DM BW
          {
             INPUT_SECTION_ALIGN(4)
             INPUT_SECTIONS( $OBJS_LIBS(Seg_CustomerPP_DM) )
          } > mem_DMCO_SDRAM_BW1

    to:
          SDRAM_EXT_DM DM
          {
             INPUT_SECTION_ALIGN(4)
             INPUT_SECTIONS( $OBJS_LIBS(Seg_CustomerPP_DM) )
          } > mem_DMCO_SDRAM_DM

    The linker show an error : 
    [Error li1060]  The following symbols are referenced, but not mapped:
            'sPeqState.' referenced from src\peq_test_SC598_Core1.doj(Seg_Test_Framework_Code)
            'sPeqParams.' referenced from src\peq_test_SC598_Core1.doj(Seg_Test_Framework_Code)
            'uiTxChannelMap.' referenced from src\peq_test_SC598_Core1.doj(Seg_Test_Framework_Code)
            'fProcBuf.' referenced from src\peq_test_SC598_Core1.doj(Seg_Test_Framework_Code)
            'sPeqCtrl.' referenced from src\peq_test_SC598_Core1.doj(Seg_Test_Framework_Code)

    why when I mapped these variables in DM memory section then the linker say that they are not mapped ?
    I need to do that because on sharc 21489 my code were faster and I think that it is slower on SC598 because data are in BW and not in DM/PM.

    Could you explain how a section define as DM and mapped in a BW width(8) segment can work ? by default the ldf contain PM and DM section that are respectively mapped on SW and BW memory segment.. i never see any variable in the linker_log.xml mapped in these segment.. how does it work ?

    Best regards,

    Yohann.

  • Hi Yogann,

    Could you please share us the project along with the custom LDF file which replicates this issue. This will help us to assist you further.

    Regards,
    Nishanthi.V

  • Hi Nishanthi,

    It takes time to me to reduce a project to the issue without giving you our whole code. 0488.IssueLDF.zip in this zip file you will find a complete project that, when compiling, show what I don't understand.

    I am not asking for a fixed project, as I said, if I put the var in BW it works. The thing is that I want to understand how to manage DM and PM to be able to optimize our code. Until now we use 2 times more MIPS on SC598 than on ADSP21489 for the same code and this is not acceptable for us. I feel that the only issue is that BW address is 8bits width but DM on 21489 is 32bits width and I feel that this is where 4 address access are done instead of one. Am I correct ? is the DM bus used in parallel to the PM bus as in 21489 is var are in BW segment ? more than align pragma, how to force var to be consider as 32bits words ?

    I thank you in advance for your helps,

    Best regards,

    Yohann.

  • Hi Yogann,

    On SHARC+ parts, attempting to define normal-word/word-addressed variables in byte-addressed code by placing them in the 32-bit section using the section keyword or #pragma section is unsafe.

    Word-addressed variables and functions need to be defined in word-addressed source files, i.e. files compiled with -char-size-32. No section qualifier is required when doing that.

    The correct way to define normal-word/word-addressed variables and arrays is by defining them in a separate source file which is built with the char size set to 32-bit (-char-size-32 compiler switch) via Project > Properties > Crosscore SHARC C/C++ Build > Settings > Tool Settings > Compiler > Processor > Char size

    These variables should be declared in byte-addressed code using #pragma word_addressed on the line immediately before the extern specifier.

    You can refer to the detailed response from "AndyK2" in the linked ezone thread below for additional information.
    https://ez.analog.com/dsp/software-and-development-tools/cces/f/q-a/550466/having-a-word-adressable-data-buffer-in-pm-memory-via-seg_pmda_nw-produces-linker-error/435159#435159

    We tried with the char-size-32 setting in your project, which resulted in a successful build and appropriate mapping of the variables to the DM section.

    Regards,
    Nishanthi.V

  • Hi Nishanthi,

    Sorry for the late reply, a lot to test on this project.

    It seems I succeeded following your advises to place data in DM. But my issue still remain, if we can call it an issue. On a 21489 the same code use something around 39MIPS, here on a SC598 it use around 63MIPS. I can't figure out where the issue come from.

    I think two path can be followed :

    • I am not able to place my ASM code in PM, so I feel that it can deteriorate MIPS consumption face to the 21489 32bit word instruction and data management. Could you help me doing that or study around that ?
    • Maybe, the ASM instruction are valid (because I compare output audio signal processed by the ASM processing on 21489 and SC598 and they are the same) but the ASM isn't optimized as expected for SC598.

    As I am very lost on this and can't find what to search or study, here is my ASM code (drive.google.com/.../view, develop for biquad processing. I hope it will be possible for you to have a look to instructions and tell me if something isn't well define for SC598 causing this MIPS over consumption.

    I know that ADI provide dedicated biquad function in CCES but this is simply to learn how to optimize code on SC598, we have other ASM that also need to be rework and that I can't share here.

    I thank you in advance for your time,

    Best regards,

    Yohann.

  • Hi Yogann,

    The new 11-stage pipeline been defined in such a way that it remains backwards-compatible at the assembly code level with previous generations of SHARC devices. However, due to some phenomena like pipeline stage splitting, stack dependencies, data hazards, and stall conditions, some corner case combinations of code flow will be handled differently from the previous design, which may result in performance degradation if not modified to work well with the new pipeline.

    We would suggest you to refer the below Application note, which will be useful for migration from Legacy SHARC.

    Migrating Legacy SHARC to ADSP-SC58x/2158x SHARC+ Processors (EE-375)
    https://www.analog.com/media/en/technical-documentation/application-notes/EE375v01.pdf


    Regards,
    Nishanthi.V

  • hi Nishanthi,

    You answers help. We are now able to create DM/PM, place them in memory and manage ldf properly.
    What we found is that between Sharc and Sharc+, we will loose MIPS most of the time if we not rewrite the whole processing. In our case we are already SIMD, dm/pm bloc spread and align+interleaved. Thus, not a lot of optimization can be expected more.. as the pipeline is bigger a lot a situation gives more stall.. :/

    So we try to understand the cache behavior and its implication on this MIPS lost. We follow this FAQ  Cache Performance Analysis Audit.

    Then, question appears. When we use your test project we observe lot of cache miss and hit. When we test our functions (most of the data are in L2/L3, buffers in L1), we select the maximum cache size (128k), we observe not a lot of cache miss and hit :

    Test: 1, round: 0, samples: 128, Number of cycles: 1420739, MIPS: 532.777100
    Test: 1, round: 1, samples: 128, Number of cycles: 1403805, MIPS: 526.426900
    Test: 1, round: 2, samples: 128, Number of cycles: 1403697, MIPS: 526.386350
    Test: 1, round: 3, samples: 128, Number of cycles: 1403481, MIPS: 526.305350
    DM-Cache is enabled
    PM-Cache is enabled
    I-Cache is enabled

    Testing DM-Cache performance monitor
    ------------------------------------
           0 occurrences of DM-Cache hits
           0 occurrences of Crosscheck hits
           1 occurrences of DM-Cache miss without writeback
           0 occurrences of DM-Cache miss with writeback

    Testing PM-Cache performance monitor
    ------------------------------------
           0 occurrences of PM-Cache hits
           0 occurrences of Crosscheck hits
           0 occurrences of PM-Cache miss without writeback
           0 occurrences of PM-Cache miss with writeback

    Testing I-Cache performance monitor
    ------------------------------------
          62 occurrences of I-Cache hits
           9 occurrences of I-Cache misses
    Test FAILED


    This processing cost 323MIPS on a SHARC21489, this mean we have a big lost in MIPS but not a lot of cache error..
    We suspect that our understanding of the cache isn't good.

    Is the monitoring of the cache following the example code of the tread "Cache Performance Analysis Audit" is the good way to understand mips lost ?

    Is the cache is clear somewhere else ? Is there any reason why we observe such a performance degradation without any cache error on DM/PM ?

    I tried to put all my data in DDR L3 and the whole instruction in L1. What we observe is no more I-cache hits and misses, no hits and missis in DM or PM but the number of MIPS increase to 1000MIPS..

    Here is how we count mips :
    cycle_t start_count;
    cycle_t final_count;
        for (int ii = 0; ii < 4; ++ii)
        {
            START_CYCLE_COUNT(start_count);
            doStuff();
            STOP_CYCLE_COUNT(final_count,start_count);
            printf("Test: %i, round: %i, samples: %i, Number of cycles: %i, MIPS: %f\n", 1, ii, N_SAMPLES, final_count, (float)final_count * (48000.0 / 1000000.0) / (float)128);
            cacheProfil(doStuff);
        }
    With the -DDO_CYCLE_COUNTS add to the compiler.

    I thank you in advance for your help,
    Best regards,

    Yohann.