Post Go back to editing

DSP library matrix performance (matmmltf)

Hello, 

I have been benchmarking performance of the matmmltf function call from the DSP library and have noted a decrease in performance when the matrices reach certain sizes. Naturally I would like to maintain the optimal cycle count. 

Would anyone have an explanation for what is being observed? 

Are there any documented limitations of the matmmltf function call?

Is there an optimal way to prepare the data being multiplied?

Does this function use SIMD optimisation? 

My current findings are below. Performance has been measured by raising and lowering a GPIO pin at the start and end of the function call and measuring the time T via a scope. As you can see at column size 130 and then 200 the performance degrades dramatically. 

Any thoughts would be appreciated.

Thanks, 

SHARC-21593 1GHz
cMat = aMat * bMat;
A_COL A_ROW B_ROW B_COL T(ms)
10 30 10 96 0.053
11 30 11 96 0.056
20 30 20 96 0.096
21 30 21 96 0.099
40 30 40 96 0.183
41 30 41 96 0.186
80 30 80 96 0.327
81 30 81 96 0.339
100 30 100 96 0.552
101 30 101 96 0.564
120 30 120 96 0.715
121 30 121 96 0.723
125 30 125 96 0.754
126 30 126 96 0.762
127 30 127 96 0.77
128 30 128 96 0.775
129 30 129 96 0.96
130 30 130 96 1.146
131 30 131 96 1.332
132 30 132 96 1.521
133 30 133 96 1.706
134 30 134 96 1.896
135 30 135 96 1.897
136 30 136 96 2.272
137 30 137 96 2.467
138 30 138 96 2.671
139 30 139 96 2.868
140 30 140 96 3.076
141 30 141 96 3.279
150 30 150 96 5.06
160 30 160 96 6.99
161 30 161 96 7.195
200 30 200 96 13.77
300 30 300 96 20.57
400 30 400 96 27.33

  • Hello again, 

    Further investigation appears to show that the function call is causing the data pointers to escape bounds and cause a fatal error or hard crash.

    I hadn't noticed this previously as the matrix data was inside a larger struct and stripping the struct back to just required data started to highlight this unwanted behaviour. 

    So now I need to solve this first before heading back to the cycle count issue.

    Below is an example piece of code from a clean crosscore project, based on the example code in the SHARC library documentation.

    When A_COL_SIZE is small the function appears to work correctly.

    Increasing A_COL_SIZE starts to cause different behaviour - e.g. 350 causes PC to jump to __fatal_error.

    Again, any thoughts would be appreciated. 

    Thanks, 

    M

    /*****************************************************************************
     * 21593_Frame_Core1.c
     *****************************************************************************/
    
    #include <sys/platform.h>
    #include <sys/adi_core.h>
    #include "adi_initialize.h"
    #include "21593_Frame_Core1.h"
    #include "matrix.h"
    #include "stdio.h"
    /** 
     * If you want to use command program arguments, then place them in the following string. 
     */
    char __argv_string[] = "";
    
    int main(int argc, char *argv[])
    {
    	/**
    	 * Initialize managed drivers and/or services that have been added to 
    	 * the project.
    	 * @return zero on success 
    	 */
    	adi_initComponents();
    	
    	/**
    	 * The default startup code does not include any functionality to allow
    	 * core 1 to enable core 2. A convenient way to enable
    	 * core 2 is to use the adi_core_enable function. 
    	 */
    	adi_core_enable(ADI_CORE_SHARC1);
    
    	/* Begin adding your custom code here */
    
    	// A_COL_SIZE causes a 'fatal error'
    	#define A_COL_SIZE			350
    	#define A_ROW_SIZE			30
    	#define B_COL_SIZE			96
    
    	float 	aMat[A_ROW_SIZE][A_COL_SIZE];
    	float 	bMat[A_COL_SIZE][B_COL_SIZE];
    	float	cMat[A_ROW_SIZE][B_COL_SIZE];
    
    	float *a_p = (float*)(&aMat);
    	float *b_p = (float*)(&bMat);
    	float *c_p = (float*)(&cMat);
    
    	float 	index;
    
    	// Fill matrix A & B with data
    	index = 1;
    	for (uint16_t row = 0; row < A_ROW_SIZE; ++row)
    	{
    		for (uint16_t col = 0; col < A_COL_SIZE; ++col)
    		{
    			aMat[row][col] = index;
    			index += 1.0f;
    		}
    	}
    
    	index = 1;
    	for (uint16_t row = 0; row < A_COL_SIZE; ++row)
    	{
    		for (uint16_t col = 0; col < B_COL_SIZE; ++col)
    		{
    			bMat[row][col] = index;
    			index += 1.0f;
    		}
    	}
    
    	// Clear matrix C
    	for (uint16_t row = 0; row < A_ROW_SIZE; ++row)
    	{
    		for (uint16_t col = 0; col < B_COL_SIZE; ++col)
    		{
    			cMat[row][col] = 0.0f;
    		}
    	}
    
    
    
    	matmmltf(c_p, a_p, b_p, A_ROW_SIZE, A_COL_SIZE, B_COL_SIZE);
    	printf("\rFinished");
    
    	return 0;
    }
    
    

  • Hello again. 

    Some extra information... and just to add to all the questions. 

    I've had the opportunity to run the same code on a ADSP-21488 based design and clean default Crosscore project and the function returns normally. 

    I'm running Crosscore  2.10.0.0, which I believe is the latest version.  Does this version provide library support for the ADSP-21593? 

    Thanks, 

    M

  • Hello, 

    Additional information. I returned to the ADSP-21488 processor to asses performance and hopefully work on a core that worked well with the DSP library.

    Unfortunately I have found the above project also exhibits unwanted behaviour (fails to return from matrix function) when compiled with optimisation enabled (100) and raising A_COL_SIZE above 305.

    Thanks, 

    M

  • Hi,

    #1: Fatal error and library support for the ADSP-21593 :
    >> From your project, we understand you are using larger array size in code and also all the variables which are declared as local.

    Local variables are stored in Stack memory. By default, Stack resides in L1 memory. Variables on the stack get placed in L1 memory.

    The compiler uses the run-time stack as the storage area for local variables and returns addresses. During a function call, the calling function pushes the return address onto the stack.

    Please note, CCES 2.10.0 provides support for the new family of ADSP-2159x/ADSP-SC59x processors. This fatal error can be seen when the application is calling a NULL function pointer or jumping to address 0x0, It indicates_adi_bad_reset_detected error.

    Could you please try to increase the stack memory usage. Also, the compiler provides support for detecting stack overflows, which can be particularly trouble some bugs in the limited environment of an embedded system. Since you are using larger array size in local, A stack overflow is used to detect when the stack not being large enough for the application. This effects of a stack overflow are undefined; the effects can vary from data corruption to a catastrophic software crash.

    Once it has been identified that a stack overflow is the cause of your application failure, correcting the problem can be as simple as increasing the amount of memory reserved for your stack.

    If, due to hardware memory restrictions, you are unable to increase the amount of memory used for the stack, then conduct a review of your application, examining your use of local arrays, function calling and other program code that leads to a stack overflow.

    You can increase stack memory by following the below steps:
    1. Double-click the system.svc file of the project.
    2. Click the Startup Code/LDF tab.
    3. In the navigation pane on the left side of the page, click LDF.
    4. Enable the Customize the system stack option.
    a. In Custom system stack size, enter size, and choose units and memory type.
    b. In Custom system stack memory type, choose a processor-specific internal memory option, or external memory if Use external memory (SDRAM) is enabled on the LDF Configuration page.

    You can enable stack overflow detection via Project > Properties > C/C++ Build > Settings > Tool Settings > Compiler > Run-time Checks > Generate code to catch a Stack Overflow or -rtcheck-stack.

    Please refer in CCES help for more information about Stack Overflows:
    CrossCore® Embedded Studio <version> > SHARC® Development Tools Documentation > C/C++ Compiler Manual for SHARC® Processors > Optimal Performance from C/C++ Source Code > Analyzing Your Application > Stack Overflow Detection > Stack Overflow Detection Facility

    We recommend to refer the section "Managing the Stack", "Run-Time Stack Storage" "About Stack Overflows" in the Compiler Manual for SHARC Processors linked below
    www.analog.com/.../cces-SharcCompiler-manual.pdf

    After increasing the stack memory could you please let us know whether the fatal error got recovered.

    Also, could you please try to place the variable in global declaration. If you define the variables before the main() execution, it will act as the global variable and it will be stored on a fixed location decided by the compiler.

    #2: Memory usage:
    In global declaration, If a single source file contains large buffers, the linker will try and place those both in the same section. If no memory section has enough space to house, the linker will fail.

    While you're in the Linker project options, enable the option 'Generate symbol Map' option under the 'General' options for the Linker. This will produce a "project_name.map.xml" file in 'debug' folder the project that can be opened in Internet Explorer. It will show all your memory sections, and how much free/unused space there is.

    If your project exhaust the available internal memory, you would need to look at making use of SDRAM, if your target has external memory. Select Use external memory (SDRAM) to enable the size and partitioning controls under system.svc > Startupcode/LDF > LDF > External Memory

    You can refer the Memory map address range for each block(Pg. No: 8/111) on the below linked in ADSP-21593 datasheet manual for more details.
    www.analog.com/.../adsp-21591-adsp-21593.pdf

    You can also refer the Memory map address range for each block(Pg. No: 6/71) on the below linked in ADSP-21488 datasheet manual for more details.
    www.analog.com/.../adsp-21483_21486_21487_21488_21489.pdf

    Byte-Addressing supported sharc+ processors have the ability to access individual 8-bit bytes in memory. This feature is known as byte-addressing. Each individual 8-bit byte has a unique address, in contrast to earlier SHARC families which, in the C runtime environment, permitted access to a minimum of 32 bits at a time (known as word-addressing). Byte-addressing permits compatibility with code written assuming that the C/C++ char type is 8 bits in length, and the C/C++ short type is 16 bits in length.

    For more details about byte addressing and word addressing mode ,please refer the compiler manual for SHARC processors from the below link.

    www.analog.com/.../cces-SharcCompiler-manual.pdf

    #3. matmmltf: Would anyone have an explanation for what is being observed? Are there any documented limitations of the matmmltf function call? Is there an optimal way to prepare the data being multiplied?

    The function computes the product of successive elements in each row of X with successive elements in each column of Y and then calculates their sum which it stores in the output array.

    This implementation relies on each of X and Y and the output being handled as a circular buffer. Each full iteration through X will generate one column of the output array; also each full iteration
    through X will use the same column in Y.

    After a column of the output array has been computed, the computation for the next column is initiated by incrementing the current pointer of both the output array and of Y by 1 - this has the effect of moving the pointer from the end of one column to the beginning of the next one.

    We have matmmltf library functions which computes the product of the input matrices matrix_x and matrix_y, and stores the result in the matrix product.The sources for these functions are available in the CCES installation path.

    We would suggest you to refer the source file (matmmltf_21XI) to know more about "matmmltf" algorithm and implementation in below path.
    <CCES installation directory>\Analog Devices\CrossCore Embedded Studio <version>\Sharc\lib\src\libdsp

    Also, This source is designed for SHARC-XI, it does not contain support for silicon anomalies that are known to be present in earlier versions of the SHARC architecture.

    You can refer the below CCES help path, which might be helpful to you.
    CrossCore® Embedded Studio 2.x.x> Sharc® Development Tools Documentation > C/C++ Library Manual for Sharc® Processors > DSP Run-Time Library > DSP Run-Time Library Reference

    #4. Does this function use SIMD optimization?
    >> This function is not using the SIMD Feature.

    Please refer the Library manual for SHARC processors in "Table 3-6: Functions Not Using the SIMD" Feature from the below link.

    www.analog.com/.../cces-sharclibrary-manual.pdf

    #5 For Benchmark performance:
    We provide Cycle Count Macros for profiling code execution.

    You can use the Cycle Count Macros for calculating number of cycles spent for the specific function and time.h header file to measure the time spent in a program.as documented at below CCES help page.

    CrossCore® Embedded Studio > SHARC® Development Tools Documentation > C/C++ Library Manual for SHARC® Processors > C/C++ Run-Time Library > C and C++ Run-Time Libraries Guide > Measuring Cycle Counts > Using time.h to Measure Cycle Count

    CrossCore® Embedded Studio > SHARC® Development Tools Documentation > C/C++ Library Manual for SHARC® Processors > C/C++ Run-Time Library > C and C++ Run-Time Libraries Guide > Measuring Cycle Counts > Basic Cycle Counting Facility

    When doing so, ensure that "DO_CYCLE_COUNTS" macro is specified in the Properties. This can either be added as "DO_CYCLE_COUNTS" to the 'Preprocessor Definitions' under "Properties > CrossCore SHARC C/C++ Compiler > Preprocessor", or as "-DDO_CYCLE_COUNTS" to the 'Additional Options' under "Properties > CrossCore SHARC C/C++ Compiler"

    Regards,
    Santhakumari.K

  • Hello, 

    Thanks for the detailed reply, it is very helpful and very much appreciated. I will go over it in detail and report my findings.

    Regards, 

    M