Post Go back to editing

Use SIMD with odd addresses

Category: Software
Product Number: sharc21569


I'm trying to install a function that can do the simple dot product. Similar to ADI fir() function, I have an array coeffs[] to save my filter coefficients and another array delay[] to save my input samples. delay[] works as a circular buffer to save input samples and keep updating. One thing special in my case is that, the buffer size is 1. So every now and then, the delay[] will start from an odd address. According to your document, SIMD wouldn't work for odd addresses. But we have the need to leverage SIMD in our application to make it more efficiently.

I searched through EngineerZone and people talk about the trick to leverage SIMD is to create delay[] with the size to be the length of the filter + 1. For example, if coeffs[LEN], delay[] will be the size of LEN+1. And we could save the value in delay[0] to delay[LEN]. I've run experiments like this and added the #pragma SIMD_for. SIMD works as expected. But this is undefined performance and it might change with different compilers. I will attach my code here. But I'm interested to know, is there C code to get the same results but without relying on this undefined behavior? 

Thanks in advance!

inline float processFir(float input, const pm float * coefs, float dm * delay, int len)
		// Read delay line index from the last element of delay line array
		int index = (int)delay[len];
		int index_prev = index;
		delay[index] = input;		// Feed the latest input value to where the index points to
		index = circindex(index, 1, len);	// coefs[0] stores the filter coefficient for the oldest value in delay line
											// Increase index by 1 for d[index] to get the oldest value in delay line
		delay[len] = delay[0];		// Update the last element of delay line with the first element of delay line for SIMD to work with odd address
		float dm * d = &delay[0];

		float sum = 0.0f;
#pragma SIMD_for
		for(int n=0;n<len;n++)
			sum += coefs[n] * d[index];
			index = circindex(index, 1, len);

		delay[len] = circindex(index_prev, 1, len);		// Increase index_prev by 1 to be ready for the next input sample
		return sum;

Thread Notes

  • Hi!

    Thanks for the interesting question.

    The problem here isn't so much alignment - SHARC+ parts can perform misaligned SIMD accesses without problems - but rather the fact that a circular buffer is being used and the odd index causes the access at the end of the buffer to straddle the wrap-around point; the first word needs to come from the end of the buffer and the second from the start, and the hardware can't do that.

    The idea of copying the first element d[0] to d[len] is a good one. But as you correctly point out, the dangerous step is to use #pragma SIMD_for to tell the compiiler that the wraparound mentioned above doesn't happen - wraparound does happen, but you've fixed it up so it works anyway. I assume that len is always even.

    It's never a good idea to fib to the compiler in this way, for the reasons you mention. But you can use the same technique, but get everything back above board, by adjusting the base address of the buffer.

    if (index % 2 == 0) {
      d = &delay[0];
    } else {
      d = &delay[1];

    You still need to copy the first element to the last, but now the assertions made via the SIMD_for pragma are strictly true, because the access to the buffer no longer straddles the end point.

    Hope this helps,


  • Hey Mike,

    Thanks for your reply! Your answers resolved our concerns very well!