2010-10-13 08:10:06 Can't get SRC algorithm to fly
Daniel Persson (SWEDEN)
Message: 94465
Hello
I am using uClinux with a BF526 on a custom board, with the core clock at 226 MHz. I need a sample rate converter on my system, so I found a VDSP++ src implementation by Jeff Sondermeyer described in EE183. On the osdir.com forum I found a rewriting of the code to compile with the gcc toolchain ( osdir.com/ml/linux.hardware.blackfin.kernel.devel/2008-05/msg00016.html). The src is working, but the problem I have is that I get really bad MIPS (73 to 102 MIPS when converting from 44100 to 48000 at 16 bit!!!) so I guess something must be wrong.
The src was initially developed for BF53x. Is it a problem that I am using a BF526? Are there any conciderations that has to be made due to that?
I have enabled the caches in the uClinux setup and the conversion buffers are located in l1 ram.
Blackfin Processor Options --->
[*] Enable ICACHE
[*] Enable DCACHE
Are there anything special that needs to be taken into consideration when trying to get an algorithm to perform well under uClinux on Blackfin?
Are there any alternate open source src implementations for uClinux on Blackfin?
The attached archive contains the source files that are also listed below, plus some other files that just contain filter constants and input data constants.
I am using the following versions (checked out from blackfin.uclinux.org svn server replica):
TOOLCHAIN_VERSION=tags/toolchain_09r1.1_rc2
LINUX_KERNEL_VERSION=tags/2009R1.1-RC4
UCLINUX_DIST_VERSION=tags/2009R1.1-RC4
Kind Regards, Daniel Persson
Here is the source i'm using:
Makefile:
KERNELDIR := ../../external/linux-kernel/modified/linux-2.6.x
TFTPBOOT := /var/lib/tftpboot
PWD := $(shell pwd)
ARCH := blackfin
CROSS_COMPILE := bfin-linux-uclibc-
EXTRA_CFLAGS := -O3 -I$(KERNELDIR)/arch/blackfin/include -I$(KERNELDIR)/include
TARGETS := SRC
all: install
install: $(TARGETS)
cp -pf $^ $(TFTPBOOT)
SRC: SRC.c src_init.S src_flt.S
$(CROSS_COMPILE)gcc $(EXTRA_CFLAGS) $^ -o $@
clean:
rm -f *.o a.out $(TARGETS)
SRC.c:
/****************************************************************************
* File: SRC.c
* Date: Sept 26 2002
* Created: Jeff Sondermeyer
****************************************************************************/
/*
(C) Copyright 2002 - Analog Devices, Inc. All rights reserved.
File Name: SRC.c
Date Modified: 12/28/2005 Jeff Sondermeyer Rev 0.5
Purpose: The Sample Rate Converter (SRC) and Main Program Shell was developed using the ADSP-21535 EZ-KIT Lite
Evaluation Platform. However, with Rev 0.5, I have verified the code works on the BF533 EZKIT and on ALL
Blackfins - BF53x. I removed the LDF from the project so this code will work "out of box" with just about any
VDSP version. Note that if the user would like to use the other "precanned" filters and include files please
apply changes to src_xxxtoxxx.h per "IPDC comment" in the active project directory. I leave this as
an exercise to the user :-)
This C shell contains function calls and routines to initialize the state of the BF53x as well as the SRC.
This program assumes input data comes from a 16-bit buffer (initialized as 'x' in this shell). This data is
copied into a 32-bit buffer 'in1' within src_flt.asm. At the end of src_flt.asm, the last 32-bit
buffer 'inx' (where 'x' is the last stage) is copied into a 16-bit buffer ('y' in this shell). These
16-bit input/output buffers can be eliminated to conserve data space. In this case, you will need to
undefine 'BUFIN' and preload 'in1' with 32-bit data and then use the 32-bit output data from 'inx'.
The converter was designed to convert between any of the following rates:
48000, 44100, 32000, 22050, 16000, 11025, and 8000. If you have the SRC program from Momentum Data
Systems you can generate coefficients for any SRC. Follow #3 below.
One might use workspaces within VDSP to verify all necessary plots of the input/output stages as well in the
intermediate buffers. You can look at the data in the time domain or apply the built-in FFT plotting
function to analyze the frequency domain. Load "*.vdw" for the example SRC.
I have generated a "SINE_xxxxx_16bit_1024.dat" input file to test every SRC. This is a 16-bit, 1024-sample, 1KHz
sine wave at the input sample rate. These were generated using MATLAB (see 'gen_sine_wave_comma_16.m').
It's easy to verify proper conversion by counting samples in one period at both the input rate (in the 'x'
plot) and the output rate (in the 'y' plot) in workspace #2.
Notes: 1. You can modify the size of NINPS and NOUTS in each 'src_xxxxtoxxxx.h' file. However, it MUST be the
same multiple of the GCD.
2. Buffer sizes, NINPS and NOUTS must be at least half of the filter coefficient sizes times the INTPx
value to ensure valid output data.
3. Do the following to convert the decimal filter coefficients from Momentum Data Systems SRC *.dsp
file to properly format this data as 32-bit Hexidecimal value. This is then read into the
corresponding variable at initialization:
a. Use Excel to import the *.dsp file (space delimited). Select the "D" column and erase
everything else. Save the file as a "Formatted Text (Space Delimited)(*.prn)" file.
b. Use the MATLATB program "dec_file_to_hex_file_converter.m". This MATLAB program
will read in decimal (exponential) data from a file (*.prn) and convert to a 32-bit
Hexidecimal format (*.dat file) suitable to be read by VisualDSP within a data
initialization section.
4. When 'BUFIN' is undefined, the program assumes that 'in1' is preloaded with 32-bit input data AFTER the
src_init is accomplished (buffer zeroing). This requires that the shell program preload 'in1' from a 32-bit
source. Define 'BUFIN' to include the 16-bit buffer transfer code within src_flt.asm. x and y 16-bit
buffers are nice for debug and prototyping but it is just another chunk of memory that is necessary.
5. To "zero" out filter delays, use the following equations as offsets to first valid data:
1st Offset = (LENG1-1)/(2*DOWN1)
2nd Offset = INTP2/DOWN2*1st Offset + (LENG2-1)/(2*DOWN2)
3rd Offset = INTP3/DOWN3*2nd Offset + (LENG3-1)/(2*DOWN3)
See the constants generated in the 'src_xxxxtoxxxx.h' files.
6. DOFSx (in src_xxxxtoxxxx.h) is the offset and also is the number of valid output data samples. This will allow
you to figure how often this routine needs to be executed in a block-processed system. Be careful with this number.
The preprocessor in VDSP will not generate fractional constants. Therefore, depending on the math here, DOFSx
could have an error of 1. For a particular SRC, check the first sample in 'y' and adjust the DOFSx accordingly.
7. One idea of reducing the number of intermediate buffers is to call a 'zero_buf' function that would rezero
the buffers between filter sections. This would reduce the number of intermediate buffers to two at the expense
of more MIPs. However, the MIPs increase would be negligable and is on the order of the size of the buffer. These
two intermediate buffers should be sized to the maximum needed for any SRC.
8. If there is a big interpolation constant, this severely reduces the number of valid data samples in the
final output buffer. For example, in the 44.1K to 48K case, there is an interpolation constant of 16 in the 3rd stage.
If we only use L1 data sections (max = 4096 bytes) we only get 111 valid data samples in the final output buffer.
However, if I use L2 and make this intermediate buffer as large as 4096 words (16K bytes), I can get a relatively
large number of valid output data samples. The point here is that.. depending on interpolation constants, the limiting
factor appears to be the L1 section size. I can maximize all my filters based on this L1 section size (4096 bytes or 1024
words) ...OR.. assume someone can use L2 and make the intermediate buffers larger. In the later case, the number of VALID
output data samples greatly increases.
9. The half band code was not implemented. Therefore, the HALFB define is not used.
10. 11025to16, 16to2204, and 8to11025 produced corrupted data with 3-stage filters. Had to use 2-stages. MDS filter
generator produces corrupted 3rd stage output for close sample rate conversions that required up sampling??? Not sure why.
11. The latest revision of the code was debugged on a Momentum Systems Hawk PCI board. All FileIO was done over the PCI bus.
Several things need to change in this code to work with the Hawk board:
a. Define 'HAWK'
b. Add idle.c and the basiccrt.s file for the Hawk board to the project.
*/
/* ------------------------------------------------------------------------ */
#include "fract_math.h"
//#include <defBF533.h>
#include "SRC_inc.h"
#include "src_441to48.h"
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
/* ------------------------------------------------------------------------ */
// 16-bit input/output buffers
short __attribute__((l1_data_B)) x[NINPS];
short __attribute__((l1_data_B)) y[NOUTS];
FILE *inFile, *outFile;
// Filter Coefficients
fract32 __attribute__((l1_data_B)) filter_h1[MLEN1] =
{
#include "441to48_32bit_flt1.dat"
};
#if STAGE>=2
fract32 __attribute__((l1_data_B)) filter_h2[MLEN2] =
{
#include "441to48_32bit_flt2.dat"
};
#endif
#if STAGE==3
fract32 __attribute__((l1_data_B)) filter_h3[MLEN3] =
{
#include "441to48_32bit_flt3.dat"
};
#endif
/* ------------------------------------------------------------------------ */
short sin_data[] =
{
#include "sin_data.dat"
};
static void *init_first_stage( STAGE_ENTRY *V, void *buffers )
{
V->in_s = buffers;
V->in_z = SZIN1;
buffers += SZIN1 * sizeof(int);
V->out_s = buffers;
V->out_z = SZIN2;
V->h = filter_h1;
V->plen = PLEN1 - 1;
V->up = INTP1;
V->dn = DOWN1;
V->nis = NINP1;
V->nos = NINP2;
V->nshft = SHFT1;
V->in_c = V->in_s;
V->out_c = V->out_s;
V->space = V->nis;
V->available = 0;
return buffers;
}
#if STAGE >= 2
static void *init_sec_stage(STAGE_ENTRY *M, void *buffers)
{
M->in_s = buffers;
M->in_z = SZIN2;
buffers += SZIN2 * sizeof (int);
M->out_s = buffers;
M->out_z = SZIN3;
M->h = filter_h2;
M->plen = PLEN2 - 1;
M->up = INTP2;
M->dn = DOWN2;
M->nis = NINP2;
M->nos = NINP3;
M->nshft = SHFT2;
M->in_c = M->in_s;
M->out_c = M->out_s;
M->space = M->nis;
M->available = 0;
return buffers;
}
#endif
#if STAGE == 3
static void *init_last_stage(STAGE_ENTRY *L, void *buffers)
{
L->in_s = buffers;
L->in_z = SZIN3;
buffers += SZIN3 * sizeof (int);
L->out_s = buffers;
L->out_z = SZIN4;
L->h = filter_h3;
L->plen = PLEN3 - 1;
L->up = INTP3;
L->dn = DOWN3;
L->nis = NINP3;
L->nos = NINP4;
L->nshft = SHFT3;
L->in_c = L->in_s;
L->out_c = L->out_s;
L->space = L->nis;
L->available = 0;
return buffers;
}
#endif
/* ------------------------------------------------------------------------ */
void init_src( FUNDAMENT_DATA_ENTRY *F, STAGE_HANDLE *S )
{
F->S = S;
F->half_band = HALFB;
F->up_stage = NUPST;
F->pivot_stage = PVTFL;
F->down_stage = NDWNS;
F->nstages = STAGE;
F->ninputs = NINPS;
F->noutputs = NOUTS;
src_init( F );
}
#include <bfin_sram.h>
FUNDAMENT_DATA_ENTRY *alloc_441to48( void )
{
void *p = malloc( sizeof(FUNDAMENT_DATA_ENTRY) + sizeof(STAGE_HANDLE)
+ STAGE * sizeof(STAGE_ENTRY)
/*, L1_DATA_B_SRAM */);
FUNDAMENT_DATA_ENTRY *F = p;
STAGE_HANDLE *S = p + sizeof(FUNDAMENT_DATA_ENTRY);
STAGE_ENTRY *E1 = (void *) S + sizeof(STAGE_HANDLE);
#if STAGE > 1
STAGE_ENTRY *E2 = (void *)E1 + sizeof (STAGE_ENTRY);
#if STAGE > 2
STAGE_ENTRY *E3 = (void *)E2 + sizeof (STAGE_ENTRY);
#endif
#endif
void *buffers = malloc( (SZIN1 + SZIN2
#if STAGE > 1
+ SZIN3
#if STAGE > 2
+ SZIN4
#endif
#endif
) * sizeof(int) /*, L1_DATA_A_SRAM */);
F->input = buffers;
S->V = E1;
buffers = init_first_stage( S->V, buffers );
S->M = E2;
buffers = init_sec_stage( S->M, buffers );
S->L = E3;
buffers = init_last_stage( S->L, buffers );
F->output = buffers;
init_src( F, S );
return F;
}
void free_src( FUNDAMENT_DATA_ENTRY *F )
{
free( F->input );
free( F );
}
int cycles()
{
int ret;
__asm__ __volatile__
(
"%0 = CYCLES;\n\t"
: "=&d" (ret)
:
: "R1"
);
return ret;
}
int main()
{
FUNDAMENT_DATA_ENTRY* vfd;
long unsigned int old_cycles, new_cycles, diff;
long int count;
STAGE_ENTRY *last_stage_entry;
vfd = alloc_441to48();
old_cycles = cycles();
for ( count = 0; count < 44100; count += NINPS )
{
memcpy( x, &sin_data[count], NINPS * 2 );
src_flt_bufin( x, vfd, 2, NINPS );
last_stage_entry = src_flt( vfd );
src_flt_bufout( y, last_stage_entry, 2, NOUTS );
}
new_cycles = cycles();
if ( new_cycles > old_cycles )
diff = new_cycles - old_cycles;
else
diff = 0xFFFFFFFF - old_cycles + new_cycles;
printf( "cycles: %d\n", diff );
}
src_441to48.h:
//Include file for 44.1KHz to 48KHz. Greatest Common Denominator (GCD) = 147/160.
#define HALFB 0 // Half band flag
#define NUPST 2 // Number of up stages
#define PVTFL 1 // Pivot flag
#define NDWNS 0 // Number of down stages
#define STAGE 3 // Number of total stages
#define NINPS (147*1) // Number of imput samples (Should be a even multiple of the GCD)
#define NOUTS (160*1) // Number of output samples (Should be the same multiple as above of the GCD)
#define INTP1 2
#define DOWN1 1
#define LENG1 509 // LENG1 = length of stage filter
#define PLEN1 255 // PLEN1 = MLEN1/INTP1 (polyphase length)
#define MLEN1 510 // MLEN1 = LENG1 + enough to make even length for polyphase
#define SHFT1 0
#define NINP1 NINPS // NINPS (...or NOUTS) = 160
#define SZIN1 (NINP1 + ((LENG1-1)/INTP1) + 1) // 160 + 48/147 + 1 = 161
#define INTP2 5
#define DOWN2 1
#define LENG2 61 // LENG2 = length of stage filter
#define PLEN2 13 // PLEN2 = MLEN2/INTP2 (polyphase length)
#define MLEN2 65 // MLEN2 = LENG2 + enough to make even length for polyphase
#define SHFT2 1
#define NINP2 ((NINP1*INTP1)/DOWN1) // (NINPx*INTPx)/DOWNx = 160*147/16 = 1470
#define SZIN2 (NINP2 + ((LENG2-1)/INTP2) + 1) // 1470 + 26/1 + 1 = 1497
#define INTP3 16
#define DOWN3 147
#define LENG3 113 // LENG3 = length of stage filter
#define PLEN3 8 // PLEN2 = MLEN2/INTP2 (polyphase length)
#define MLEN3 128 // MLEN2 = LENG2 + enough to make even length for polyphase
#define SHFT3 0
#define NINP3 ((NINP2*INTP2)/DOWN2) // (NINPx*INTPx)/DOWNx 1470*1/5 = 294
#define SZIN3 (NINP3 + ((LENG3-1)/INTP3) + 1) // 294 + 222/1 + 1 = 517
#define NINP4 ((NINP3*INTP3)/DOWN3) // (NINPx*INTPx)/DOWNx = 294*1/2 = 147
#define SZIN4 NINP4 + 1 // for last decimation stage only = 148
#define OFFS1 (LENG1-1)/(2*DOWN1) //
#define OFFS2 (LENG2-1)/(2*DOWN2) //
#define OFFS3 (LENG3-1)/(2*DOWN3) //
#if OFFS3 < 1
#define OF2S3 1
#else
#define OF2S3 OFFS3
#endif
#define TOFS1 OFFS1 //
#define TOFS2 ((INTP2*TOFS1)/DOWN2 + OFFS2) //
#define TOFS3 ((INTP3*TOFS2)/DOWN3 + OF2S3) //
/*********IPDC comment *******/
//#define DOFS3 (NOUTS-TOFS3) // Used to strip filter delays off buffers
/******************************/
/*********IPDC addition*******/
#define DOFS3 NOUTS
/******************************/
SRC_inc.h:
/****************************************************************************
* File: SRC_inc.h
* Date: Sept 26 2002
* Created: Jeff Sondermeyer
****************************************************************************/
#include <fract.h>
/* data structure for each stage */
typedef struct {
int *in_s; // input signal buffer
int in_z; // input signal buffer size
int *out_s; // output signal buffer
int out_z; // output signal buffer size
fract32 *h; // filter
int plen; // filter polyphase length
int up; // interpolation factor
int dn; // decimatation factor
int nis; // number of inputs
int nos; // number of outputs
int nshft; // number of shifts
int *in_c; /* base address of input signal */
int *out_c; /* base address of output signal*/
/* Members used for copying into input and out of output buffers. */
int space; // Space left for copy into buffer.
int available; /* Number of valid samples in output buffer. */
} STAGE_ENTRY; // (nos * dn) = (nis * up) is required
/* fundamental structure for sample rate conversion */
typedef struct {
STAGE_ENTRY *V; // first stage
STAGE_ENTRY *M; // middle stage
STAGE_ENTRY *L; // last stage
} STAGE_HANDLE;
typedef struct {
STAGE_HANDLE *S;
int half_band; // half band flag
int up_stage; // number of pure up stages
int pivot_stage; // pivot stage flag
int down_stage; // number of pure down stages
int nstages; // total number of stages
int ninputs; // number of input samples per block
int noutputs; // number of output samples per block
int *input;
int *output;
} FUNDAMENT_DATA_ENTRY;
/* Copies N samples from INPUTS into the input buffer of the sample rate
converter described by F. N should be less than or equal to the
"space" field in the first stage of F.
When reading INPUTS, STRIDE is used as an increment (in bytes) after
each sample. For normal mono data, STRIDE should be 2, while it should
be 4 for stereo data which has the samples interleaved in the same
buffer. */
void src_flt_bufin (fract16 *inputs, FUNDAMENT_DATA_ENTRY *F, int stride, int n);
/* Perform sample rate conversion described by F. This assumes that the
input buffer has been filled, either using src_flt_bufin or manually.
This function returns a pointer to the last processed stage, which can
be passed to src_flt_bufout to retrieve the data. */
STAGE_ENTRY *src_flt (FUNDAMENT_DATA_ENTRY *F);
/* Copies N samples from stage E (which should be obtained from the return
value of src_flt) into the buffer OUTPUTS. N should be less or equal to
the "available" field in E.
When writing OUTPUTS, STRIDE is used as an increment (in bytes) after
each sample. For normal mono data, STRIDE should be 2, while it should
be 4 for stereo data which has the samples interleaved in the same
buffer. */
void src_flt_bufout (fract16 *outputs, STAGE_ENTRY *E, int stride, int n);
/* Allocate a sample rate converter for 44100Hz to 48000Hz conversion. */
extern FUNDAMENT_DATA_ENTRY *alloc_441to48 (void);
/* Free a sample rate converter. */
void free_src (FUNDAMENT_DATA_ENTRY *);
/* Initialize all buffers for sample rate converter F. */
void src_init (FUNDAMENT_DATA_ENTRY *F);
src_flt.S:
/* File: src_flt.asm Version 0.1
fundemental structure order:
1. stage data handle
2. half band flag (0,1, or 2)
3. number of up stages
4. pivot flag (0 or 1)
5. number of down stages
6. number of stages (total)
7. number of input samples per block
8. number of output samples per block
P0 -> fundamental structure
P1 -> input samples
P2 -> output samples
P3 -> memory storage and retreival
P4 = temporary pointer
P5 = loop counter
R0 = Loop counters
R1 = temporary storage
R2 = Loop counters
R3 = Shift count
R4 = inner loop calculations
R5 = inner loop calculations
R6 = temporary storage
R7 = temporary storage
I0 = dedicated to input buffer 'inx'
I1 = general use...reading 'inputData' plus others
I2 = general use...reading 'inx' for output data
I3 = general use...
Input Data Structure (VAR_SIZE words)
AIS: address of input signal (circular), updated after return,
SIS: circular size of AIS,
AOS: address of output signal (circular), updated after return,
SOS: circular size of AOS,
AFA: address of filter array,
LEN: poly-phase filter length,
UPR: up sample rate >= 2,
DNR: down sample rate = 1 is assumed
NIS: number of input signals
NOS: number of output signals
SHF: number of shift counter, 0 or 1
*/
#define STAGE_in_s 0
#define STAGE_in_z 4
#define STAGE_out_s 8
#define STAGE_out_z 12
#define STAGE_h 16
#define STAGE_plen 20
#define STAGE_up 24
#define STAGE_dn 28
#define STAGE_nis 32
#define STAGE_nos 36
#define STAGE_nshft 40
#define STAGE_in_c 44
#define STAGE_out_c 48
#define STAGE_space 52
#define STAGE_available 56
.text
.globl _src_flt_bufin;
_src_flt_bufin:
P1 = R0; // Address of input data
P0 = R1; // Address of fundemental structure
M0 = R2;
R0 = [sp + 12];
p0 = [p0]; // stage handle
[--SP]=(R7:4);
p0 = [p0]; // p0 -> first data structure
r6 = [p0 + STAGE_in_s]; // r6 -> first input buffer 'inx'
i0 = r6; // i0 -> first input buffer 'inx'
r6 = [p0 + STAGE_in_c];
b0 = r6; // b0 -> base of first input circular buffer
r6 = [p0 + STAGE_in_z];
r6 = r6 << 2; // double length (4 bytes per word)
l0 = r6; // l0 = first input circular buffer size 'SZINx'
r7 = [p0 + STAGE_space];
r5 = [p0 + STAGE_nis];
/* Update space left after this copy. */
r1 = r7 - r0;
[p0 + STAGE_space] = r1;
/* Compute destination pointer from buffer pointer and space
left. */
r7 = r5 - r7;
r7 = r7 << 2;
m2 = r7;
i0 += m2;
p2 = r0; // p2 = number of input samples per block
i1 = p1; // load i1 with address of 'inputData'
l1 = 0;
r6.l = 0;
LSETUP(READ_INPUTS_BEGIN, READ_INPUTS_END) LC0 = p2;
READ_INPUTS_BEGIN:
r6.h = w[i1]; // read the input buffer 'inputData'
i1 += m0;
READ_INPUTS_END:
[i0++] = r6; // write input into input buffer 'inx'
(R7:4)=[SP++]; // Pop R7 ...P5
L0=0;
L1=0;
L2=0;
L3=0;
RTS;
/* STACK LAYOUT
Local variables: [0..16[
Saved registers: SP + [16..44[
Input args: SP + [44 .. 48[
+44 FUNDAMENTAL STRUCTURE */
#define OFF_PT_FUNDST 0
#define OFF_PT2_FUNDST 4
#define OFF_ST_HANDLE 8
#define OFF_FUNDST 44
.globl _src_flt;
_src_flt:
P0 = R0; // Address of fundemental structure
[--SP]=(R7:4,P5:3);
SP += -16;
r7 = [p0++];
[SP + OFF_ST_HANDLE] = r7; // store stage data handle
r6 = [p0++]; // r6 = half band flag (move past this for now)
r2 = [p0++]; // r2 = # of up stages
[SP + OFF_PT2_FUNDST] = p0; // save pointer to current fundemental structure
CC = r2 <= 0;
IF CC JUMP over_upstage; // if upstage = 0, jump over
UPSTAGE_BEGIN:
p4 = [SP + OFF_ST_HANDLE]; // p4 -> current stage data handle
r7 = [p4++];
p0 = r7; // p0 -> stage data
[SP + OFF_ST_HANDLE] = p4; // save pointer to stage data handle
up_src:
r7 = [p0 + STAGE_in_s]; // r7 -> input signal 'inx'
r5 = [p0 + STAGE_in_c];
b3 = r5; // b3 set for circular buffering
r5 = [p0 + STAGE_in_z];
r5 = r5 << 2; // double the length (4 bytes per word)
l3 = r5; // l3 = Size of Input Stage (SIS)
r6 = [p0 + STAGE_out_s];
i2 = r6; // i2 -> output signal 'inx'+1 buffer (output buffer)
r6 = [p0 + STAGE_out_c];
b2 = r6; // b2 set for circular buffering
r6 = [p0 + STAGE_out_z];
r6 = r6 << 2; // double the output size (4 bytes per word)
l2 = r6; // l2 = Size of Output Stage (SOS)
r3 = [p0 + STAGE_h]; // r3 -> the filter coefficients
p3 = [p0 + STAGE_plen]; // p3 = poly-phase filter size
p4 = 8; // always skip over DNR (2*4bytes) in the up SRC
p5 = [p0 + STAGE_up]; // p5 = Up Sample Rate (UPR)
r0 = [p0 + STAGE_nis]; // r0 = NIS
[p0 + STAGE_space] = r0; // free up input space
r6 = [p0 + STAGE_nshft]; // r6 = number of shifts (always a arithmatic left shift..upshift)
m2 = r6; // Save in m2
UP_SRC_OUTER_BEGIN:
i1 = r3; // i1 -> filter coefficients
l1 = 0; // linear addressing???
LSETUP(UP_SAMPLE_BEGIN, UP_SAMPLE_END) LC0 = p5;
UP_SAMPLE_BEGIN:
i3 = r7; // i3 - > 'in' buffer
A1=A0=0 || R6=[I1++] || R5=[I3--]; // r6=filter coef, r5='inx' buffer
LSETUP(POLY_PHASE_BEGIN, POLY_PHASE_END) LC1 = p3;
POLY_PHASE_BEGIN: R4=(A0+=R6.H*R5.H), A1+=R6.H*R5.L (M);
POLY_PHASE_END: R1=(A1+=R5.H*R6.L) (M) || R6=[I1++] || R5=[I3--];
// R1=R1>>16;
// R4=R4+R1 (S);
r5=m2; // load r5 with number of shifts
/******** IPDC comment *******/
// A1 = A1>>16;
/******************************/
/*********IPDC addition*******/
A1=A1>>>15;
/******************************/
A0+=A1;
A0 = ASHIFT A0 BY r5.l;
r4 = A0; // high half-word extraction with 16-bit saturation. Rounding cntrl by
// RND_MOD. 0 = unbiased rounding = default
// A0 = A0 >>> 1;
// R4 = A0;
UP_SAMPLE_END:
[i2++] = R4; // save output into 'inx'+1
i3 = r7; // get input back at beginning of 'inx'
m3 = 4;
i3 += m3; // increment by 1 word (4 bytes)
r7 = i3; // update r7 -> 'inx' buffer
UP_SRC_OUTER_END:
r0 += -1; // Check number of input samples (NIS)
CC = r0 <= 0;
IF !CC JUMP UP_SRC_OUTER_BEGIN; // if NIS equal to 0, jump to UP_SRC_OUTER_BEGIN
[p0 + STAGE_in_s] = r7; // save the input signal address
r6 = i2;
[p0 + STAGE_out_s] = r6; // save the output signal address
UPSTAGE_END:
r2 += -1; // Check number of stages
CC = r2 <= 0;
IF !CC JUMP UPSTAGE_BEGIN; // if upstage not equal to 0, jump to UPSTAGE_BEGIN
over_upstage:
p0 = [SP + OFF_PT2_FUNDST]; // p0 -> fundamental structure
r6 = [p0++]; // r6 = pivot flag
[SP + OFF_PT2_FUNDST] = p0; // save fundamental structure
CC = r6 <= 0;
IF CC JUMP over_pivotstage; // if pivotstage = 0, jump over
p4 = [SP + OFF_ST_HANDLE]; // p4 -> current stage data handle
r7 = [p4++];
p0 = r7; // p0 -> stage data
[SP + OFF_ST_HANDLE] = p4; // save pointer to stage data handle
pvt_src:
r7 = [p0 + STAGE_in_s]; // r7 -> input signal 'inx'
i3 = r7;
r5 = [p0 + STAGE_in_c];
b3 = r5;
r5 = [p0 + STAGE_in_z];
r5 = r5 << 2; // double the length (4 bytes per word)
l3 = r5; // l3 = Size of Input Stage (SIS)
r6 = [p0 + STAGE_out_s];
i2 = r6; // i2 -> output signal 'inx'+1 buffer (output buffer)
r6 = [p0 + STAGE_out_c];
b2 = r6; // b2 set for circular buffering
r6 = [p0 + STAGE_out_z];
r6 = r6 << 2; // double the output size (4 bytes per word)
l2 = r6; // l2 = Size of Output Stage (SOS)
r3 = [p0 + STAGE_h]; // r3 -> the filter coefficients
r6 = [p0 + STAGE_plen]; // r6 = poly-phase filter size
p3 = r6; // save poly-phase into p3
r6 = [p0 + STAGE_up]; // r6 = UPR (filter step)
r6 = r6 << 2; // post increment must be two bytes
m1 = r6; // post increment set to UPR
// always skip over UPR (2*4bytes) in the up SRC
r0 = [p0 + STAGE_nis]; // r0 = NIS
[p0 + STAGE_space] = r0; // free up input space
r0 = [p0 + STAGE_dn]; // r0 = DNR
r0 = r0 << 2; // four bytes per word
p5 = [p0 + STAGE_nos]; // p5 = Number of Outputs (NOS)
[p0 + STAGE_available] = p5;
// r1.l = w[p0 + STAGE_nshft]; // r1.l = Number of shifts (can be left shift=upshift or right shift=downshift)
r6 = [p0 + STAGE_nshft];
m2 = r6; // m2 = Number of shifts
r2 = 0; // set poly index value to 0
i1 = r3; // i1 -> filter coefficients
// CC = r6 <= 0;
// IF !CC JUMP pvt_positive; // if # of shifts > 0, jump over
// CC = r6 < 0;
// IF CC JUMP pvt_negative; // if # of shifts < 0, jump over
LSETUP(PVT_OUT_BEGIN, PVT_OUT_END) LC0 = p5;
PVT_OUT_BEGIN:
m3 = i3; // save i3 into m3;
A1=A0=0 || R6=[I1] || R5=[I3--]; // r6=filter coef, r5='inx' buffer
// i1 += m1;
LSETUP(PVT_FILTER_BEGIN, PVT_FILTER_END) LC1 = p3;
PVT_FILTER_BEGIN:
R4=(A0+=R6.H*R5.H), A1+=R6.H*R5.L (M)||i1 += m1;
PVT_FILTER_END:
R1=(A1+=R5.H*R6.L) (M) || R6=[I1] || R5=[I3--];
i1 += m1;
// R1=R1>>16;
// R4=R4+R1 (S);
r5 = m2;
/******** IPDC comment *******/
// A1 = A1>>16;
/******************************/
/*********IPDC addition*******/
A1=A1>>>15;
/******************************/
A0+=A1;
A0 = ASHIFT A0 BY r5.l;
r6 = A0; // high half-word extraction with 16-bit saturation. Rounding cntrl by
// RND_MOD. 0 = unbiased rounding = default
[i2++] = r6; // save output into 'inx'+1
// R6.H=(A1+=R6.L*R5.H) || NOP || NOP;
// [i2++] = r4; // save output into 'inx'+1
//new_poly:
r7 = r2; // r7 = poly_index
r7 = r7 + r0; // r7 = poly_index + DNR
i3 = m3; // restore i3
r6 = m1; // r6 = UPR
test_address:
r5 = r7 - r6; // (poly_index + DNR)-UPR
CC = r5 < 0;
IF CC JUMP next_address; // if true, jump over
r7 = r5; // r7 = new poly_index
m0 = 4;
i3 += m0; // increment by 1 word (4 bytes)
JUMP test_address; // test the new poly_index
next_address:
r2 = r7; // save the new address
r7 = r7 + r3; // r7 -> adjusted filter address
PVT_OUT_END:
i1 = r7; // i1 -> poly-phase filter
pvt_return:
r7 = i3; // update r7 -> 'inx' buffer
p4 = 8; // 2 words (2*4bytes per word)
[p0++p4] = r7; // save the input signal
r6 = i2;
[p0] = r6; // save the output signal address
over_pivotstage:
p0 = [SP + OFF_PT2_FUNDST]; // p0 -> fundamental structure
r0 = [p0++]; // r0 = number of down stages
CC = r0 <= 0;
IF CC JUMP return_src_core; // if number of down stages = 0, RTS
DOWNSTAGE_BEGIN:
p4 = [SP + OFF_ST_HANDLE]; // p4 -> current stage data handle
p0 = [p4++]; // p0 -> stage data
[SP + OFF_ST_HANDLE] = p4; // save pointer to stage data handle
dn_src:
r7 = [p0 + STAGE_in_s]; // r7 -> input signal 'inx'
i3 = r7;
r5 = [p0 + STAGE_in_c];
b3 = r5;
r5 = [p0 + STAGE_in_z];
r5 = r5 << 2; // double the length (4 bytes per word)
l3 = r5; // l3 = Size of Input Stage (SIS)
r6 = [p0 + STAGE_out_s];
i2 = r6; // i2 -> output signal 'inx'+1 buffer (output buffer)
r6 = [p0 + STAGE_out_c];
b2 = r6; // b2 set for circular buffering
r6 = [p0 + STAGE_out_z];
r6 = r6 << 2; // double the output size (2 bytes per word)
l2 = r6; // l2 = Size of Output Stage (SOS)
p4 = 8; // always skip over DNR (2*4bytes) in the up SRC
r3 = [p0 + STAGE_h]; // r3 -> the filter coefficients
r6 = [p0 + STAGE_plen]; // r6 = filter length
p3 = r6;
r4 = [p0 + STAGE_dn]; // r4 = DNR
r4 = r4 << 2; // Four bytes per word
m3 = r4;
p5 = [p0 + STAGE_nos]; // p5 = number of outputs
[p0 + STAGE_available] = p5;
r2 = [p0 + STAGE_nshft]; // r2 = number of shifts
LSETUP(DN_OUT_BEGIN, DN_OUT_END) LC0 = p5;
DN_OUT_BEGIN:
i1 = r3; // i1 -> filter coefficients
m1 = i3; // save i3 into m1
A1=A0=0 || R6=[I1++] || R5=[I3--]; // r6=filter coef, r5='inx' buffer
LSETUP(DOWN_FILTER_BEGIN, DOWN_FILTER_END) LC1 = p3;
DOWN_FILTER_BEGIN:
R4=(A0+=R6.H*R5.H), A1+=R6.H*R5.L (M);
DOWN_FILTER_END:
R1=(A1+=R5.H*R6.L) (M) || R6=[I1++] || R5=[I3--];
// R1=R1>>16;
// R4=R4+R1 (S);
/******** IPDC comment *******/
// A1 = A1>>16;
/******************************/
/*********IPDC addition*******/
A1=A1>>>15;
/******************************/
A0+=A1;
A0 = ASHIFT A0 BY r2.l;
r6 = A0; // high half-word extraction with 16-bit saturation. Rounding cntrl by
// RND_MOD. 0 = unbiased rounding = default
[i2++] = r6; // save output into 'inx'+1
// JUMP shiftDone;
//shiftPos: // Left Shift = Up shift = positive number
// A1 = ASHIFT A1 BY r2.l;
// r6.h = A1; // high half-word extraction with 16-bit saturation. Rounding cntrl by
// RND_MOD. 0 = unbiased rounding = default
// w[i2++] = r6.h; // save output into 'inx'+1
//shiftDone:
i3 = m1; // restore i3
DN_OUT_END:
i3 += m3; // increment by 4 bytes per word
r7 = i3;
p4 = 8; // 2 words (2*4bytes per word)
[p0 + STAGE_in_s] = r7; // save the input signal address
r6 = i2;
[p0 + STAGE_out_s] = r6; // save the output signal address
DOWNSTAGE_END:
r0 += -1; // Check number of downstages
CC = r0 <= 0;
IF !CC JUMP DOWNSTAGE_BEGIN; // if # equal to 0, jump to DOWNSTAGE_BEGIN
return_src_core:
/* Return a pointer to the last stage we processed. */
p0 = [SP + OFF_ST_HANDLE];
p0 += -4;
r0 = [p0];
SP += 16;
(R7:4,P5:3)=[SP++]; // Pop R7 ...P5
L0=0;
L1=0;
L2=0;
L3=0;
RTS;
_src_flt.end:
.global _src_flt_bufout
_src_flt_bufout:
p1 = r0; // destination buffer
p0 = r1; // stage pointer
[--sp] = (p5:5);
r0 = [p0 + STAGE_out_c];
b2 = r0;
r0 = [p0 + STAGE_out_z];
r0 = r0 << 2; // double the output size (2 bytes per word)
l2 = r0; // l2 = Size of Output Stage (SOS)
r0 = [p0 + STAGE_out_s];
i2 = r0;
p5 = [p0 + STAGE_available];
p2 = p5 << 2;
m0 = p2;
i2 -= m0;
p2 = [sp + 16];
p5 -= p2;
[p0 + STAGE_available] = p5;
p0 = r2; // stride
LSETUP(READ_OUTS_BEGIN, READ_OUTS_END) LC0 = p2;
READ_OUTS_BEGIN:
r0 = [i2++]; // get 32-bit output from buffer
READ_OUTS_END:
w[p1 ++ p0] = r0.h; // write 16-bit output to 'outputData'
(P5:5)=[SP++]; // Pop R7 ...P5
L2=0;
RTS;
src_init.S:
/*initial.asm: Sample Rate Conversion Version 0.1
P0 -> a fundamental structure
Registers used: P0, P1, P2, P5, R2, R3, R4, R5, R6, R7
*/
.text;
.align 4;
.global _src_init;
/*
initialize all the buffers (inputs and delay)
P0 -> a fundamental structure
*/
_src_init:
[--SP]=(R7:4,P5:3); // Push R7 and
P0 = R0; // Address of fundemental structure
p5 = 20; // 5*4 bytes = 20 byte-wide increment
r6 = [p0++p5]; // Pointer to fundemental structure 'fs_x' post increment of 5 32-bit words
//jws p1 = r2; // p1 = 32-bit pointer 'st_handle'
r7 = [p0++]; // load number of stages
p5 = r7;
p1 = r6; // p1 = 32-bit pointer 'st_handle'
LSETUP(0f, 1f) LC0 = p5;
0:
r2 = [p1++];
p2 = r2; // p2 -> 'datax'
r3 = [p2++]; // r3 -> first element 'inx'
r4 = [p2++]; // r4 = length 'SZINx'
i0 = r3; // i0 -> 'inx' buffer
p5 = r4;
r5 = 0;
l0 = 0; // l0 = length of 'inx' buffer SZINx
LSETUP(2f, 3f) LC1 = p5;
2:
3:
[i0++] = r5; // zero out a 32-bit word
1:
nop;
(R7:4,P5:3)=[SP++]; // Pop R7 and P5
RTS;
_src_init.end:
SRC_test.tar.bz2
QuoteReplyEditDelete
2010-10-13 09:42:46 Re: Can't get SRC algorithm to fly
Mike Frysinger (UNITED STATES)
Message: 94471
all Blackfin cores are the same so any core behavior is going to be exactly the same across variants. the only real difference you'd see would be due to diff in bus sizes or memory types, but the BF52x and BF53x are the same there.
you could try moving the funcs to l1 too with the l1_text attribute.
QuoteReplyEditDelete
2010-10-13 12:29:22 Re: Can't get SRC algorithm to fly
Daniel Persson (SWEDEN)
Message: 94486
Thanks for the reply.
I have tested to put all code in l1 instruction ram. Now I get a solid 73 MIPS (before there was seemingly random alternation between 73 and 102 MIPS), which is still far more than I can afford.
Jeff Sondermeyer wrote in EE183 that the VDSP++ implementation consumed 2 MIPS. So I assume there has to be something really wrong with my build. Do you have any more ideas of what might be wrong?
QuoteReplyEditDelete
2010-10-13 13:56:18 Re: Can't get SRC algorithm to fly
Mike Frysinger (UNITED STATES)
Message: 94488
did you only change the C files ? or did you also change the assembly from ".text" to ".section .l1.text" ?
otherwise, i really know nothing of signal processing. someone else will have to field that aspect.
QuoteReplyEditDelete
2010-10-13 20:14:29 Re: Can't get SRC algorithm to fly
Simon Brewer (AUSTRALIA)
Message: 94493
I had a look at the code, and can make another suggestion. You need to ensure that the filter and data are in separate L1 data banks or else you will get extra stalls in the filter implementation i.e. banka and bankb
QuoteReplyEditDelete
2010-10-13 20:36:22 Re: Can't get SRC algorithm to fly
Simon Brewer (AUSTRALIA)
Message: 94494
One other thing to keep in mind. This code is running for about 0.14 seconds on my test board. During that time there will be ~35 timer interrupts that occur during that time period (assuming a 250Hz timer int rate). The cycle count will be measuring the total cycles for everything that occurs in that 0.14 second period.
QuoteReplyEditDelete
2010-10-14 05:02:52 Re: Can't get SRC algorithm to fly
Daniel Persson (SWEDEN)
Message: 94506
Mike, I have put both assembler and c code in instruction cache and verified it by looking in the map file.
Simon, thanks for the suggestion. I have tested to put data and filter in separeat l1 banks but unfortunately it didn't make any difference for my application. I verified that the data and filter eneded up in separate l1 banks by looking at the map file.
You say that the code is running for 0.14 s on your test board, which is much faster then the time it takes to run through the application on my board:
real 0m 0.48s
user 0m 0.46s
sys 0m 0.02s
What core clock are you using? What cycle count do you get as output from the application?
As you say there will be a linux overhead, (the timer interrupts I guess?) but I was hoping that the overhead should be small enough not to have any major impact on the system performance, and that I would be able to perform realtime signal processing (like src for several parallell audio channels).
KR, Daniel
QuoteReplyEditDelete
2010-10-15 02:03:13 Re: Can't get SRC algorithm to fly
Simon Brewer (AUSTRALIA)
Message: 94540
Hi Daniel,
I had some time today to look at this in more detail. Running on my board I get about 70MIPs as well. I am running a core clock of 525 MHz.
I guess the point I was making about the cycle counter was that it can be useful, but could be extremely misleading in some cases (if the process is pre-empted for example).
Anyway I dug into the algorithm a little more (by stepping using GDB). The inner loop is 2* 0xfe* 2+5*0xc*2 instructions per sample. With a few other loops added in, the overall is 0x470 cycles per sample. Which is about 55 MIPS (at 48kHz). So we are in the ball park....
Working from the other direction, IF the MIPS budget was 2 then at 48kHz there would be 41 cycles per sample available. Now given this is 32 bit arithmetic (2 cycles for MAC), that leaves ~20 cycles per sample. In other words a 20 tap filter; this is not going to give very good performance... Using 32 bit arithmetic with a 20 tap FIR is, err, a little bit of a waste ;-)
So in summary the SRC is probably working as expected. I think the document is probably misleading when it mentions 2 MIPS.
Simon
QuoteReplyEditDelete
2010-10-15 04:09:06 Re: Can't get SRC algorithm to fly
Daniel Persson (SWEDEN)
Message: 94553
Hi Simon
Thanks for the time and expertise you have put in.
As you might have guessed I am new to digital signal processing, hence the following question.
The DSP requirements of my project is to be able to, in worst case, simultaneously handle two src's, and an EQ filter on top of that (preferably several EQ filters). Would you say that it is possible to achive this with 150 MIPS budget? If you think it is possible, what kind of filter would you use for the src's and EQ and how many taps would they have. The audio quality I am currently aming at is 16 bit at 44100 Hz. If you don't think it is possible, what kind of MIPS would be required to make it possible?
Daniel
QuoteReplyEditDelete
2010-10-17 20:09:28 Re: Can't get SRC algorithm to fly
Simon Brewer (AUSTRALIA)
Message: 94639
Hi Daniel,
how many frequency bands do you want for your EQ?
I think 150MIPs is ok i.e. around 1700 cycles per sample, although quality compromises may need to be made.
What sort of delay can you handle through the system? What sort of audio quality are you expecting? If the quality requirements are very high, then the filter lengths will need to be longer.
Simon
QuoteReplyEditDelete
2010-10-25 09:27:18 Re: Can't get SRC algorithm to fly
Daniel Persson (SWEDEN)
Message: 95137
Hi Simon
I will need a two band EQ. The use case for the EQ is that the sound is received through a microphone and the environment may be noisy. The quality I aim for (if it is possible) is a samplerate at 44100 Hz, and word length at 16 bit, and since the user will hear himself via the microphone and headset, the latency has to be real short, like 10 ms (confilicting requirements, I know).
Daniel
QuoteReplyEditDelete
2010-10-26 02:08:45 Re: Can't get SRC algorithm to fly
Simon Brewer (AUSTRALIA)
Message: 95157
Hi
A couple of comments. If your audio environment is noisy, it might be possible to compromise on filter quality. A two band EQ is not that expensive. For example you could implement and FIR based QMF filter.
Where in the system do you need to 44.1 -> 48kHz conversion?
For a latency of 10ms, at 44.1kHz, gives a delay of 441 samples. This is pretty aggressive, and probably not attainable. I would do some measurements on your Linux system and figure out the delay through the audio paths.
Simon
QuoteReplyEditDelete