# Cyclecount and assembly

Question asked by Tausen on Apr 28, 2013
Latest reply on May 30, 2013 by HakanCallido

Hey everyone,

I'm very new to Blackfin processors, so please bear with me.

I'm trying to make a simple assembly function and measure how many cycles this function takes. The purpose of the function is to calculate x^2+y^2, and here is my code:

# first parameter: length, placed in R0

# second parameter: destination, placed in R1

.global _envelope

_envelope:

# save frame & return pointer on stack, and allocate 16 bytes for local vars

[--SP]  = (R7:4);                       # save non-scratch regs on stack

SP     += -16;                          # allocate space for outgoing args

[FP+8]  = P3;                           # preserve P3

# P0, P1 and P2 are scratch registers

P1.H = 0x0160;                          # I from filter in 0x1600000

P1.L = 0x0000;

P2.H = 0x0168;                          # Q from filter in 0x1680000

P2.L = 0x0000;

P0 = R1;                                # result pointer

P3 = R0;                                # place length parameter in P3

LSETUP (.loopstart, .loopend) LC0 = P3; # setup HW loop, iterate P3 times

.loopstart:

R0 = W [P1++] (Z);                      # load 16 bits indirect from P1 and P2, zero extend

R1 = W [P2++] (Z);                      # and increment pointers

A0 = R0.L * R0.L (IS);                  # x^2 -> A0

R2 = (A0 += R1.L * R1.L ) (IS);         # A0 += y^2 -> R6

.loopend:

W [P0++] = R2;                          # write 16 bits to where P0 points, increment pointer

P3 = [FP+8];                            # restore P3

SP += 16;                               # reclaim space on stack used for outgoing args

(R7:4) = [SP++];                        # restore regs from save area

UNLINK;                                 # restore frame and stack pointer

As you can probably see, I'm not quite sure about the function prologue and epilogue - since I'm not actually returning anything, I guess I don't need to do SP += -16 and SP += 16. At http://docs.blackfin.uclinux.org/doku.php?id=toolchain:application_binary_interface, I'm advised to always allocate 12 bytes.

I then went ahead and tried to call my function along with the START_CYCLE_COUNT and STOP_CYCLE_COUNT macros:

/* Includes */

#include "main.h"

#include <cycle_count.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <fract.h>               // Fract data type

#include <string.h>

#define SIZE_16M (0x1000000)

#define SIZE_8M ((SIZE_16M)/2)

#define SIZE_4M ((SIZE_8M)/2)

#define SIZE_2M ((SIZE_4M)/2)

#define SIZE_1M ((SIZE_2M)/2)

#define SMEM 0x1000000

/* Sample buffer */

#define SAMBUFSTART  SMEM

#define SAMBLOCK     SIZE_2M

#define SAMBLOCKS    3

#define SAMBUFSIZE   SAMBLOCK*SAMBLOCKS

#define SAMBUFEND    SAMBUFSTART+SAMBUFSIZE

/* Memory for processing */

#define PROCMEMSTART SAMBUFEND

#define PROCMEMEND   PROCMEMSTART+SAMBLOCK

extern void envelope(uint32_t length, uint32_t dest);

int main(int argc, char*argv[]) {

fract16 *tmp1, *tmp2;

fract16 *envLP, *envHP;

uint32_t n;

memset((int *)SMEM,0x0000,(size_t)SIZE_16M);

tmp1  = (fract16*)(PROCMEMSTART);

tmp2  = (fract16*)(PROCMEMSTART+SAMBLOCK/4);

envLP = (fract16*)(PROCMEMSTART+SAMBLOCK/4*2);

envHP = (fract16*)(PROCMEMSTART+SAMBLOCK/4*3);

for (n = 0; n < SAMBLOCK/2/4; n++) {

tmp1[n] = 15;

tmp2[n] = 25;

}

cycle_t start_count;

cycle_t final_count;

START_CYCLE_COUNT(start_count);

envelope(SAMBLOCK/2/4, (uint32_t)envLP);

STOP_CYCLE_COUNT(final_count,start_count);

PRINT_CYCLES("Number of  cycles: ",final_count);

printf("Done.\n");

return 0;

}

The output of running the above is:

Number of  cycles: 96398509

Done.

When I store the data, all values are 850, so the code seems to work correctly.

My question is: why is the cyclecount so high? Does each operation not take only one cycle? In the main loop of the envelope function, there's only 5 operations, so I figured the cyclecount should not be much above 5*2*1024^2/2/4=1310720 cycles?