Logomatic Code Optimization

So the point of this exercise is speed. At one point when I had the maximum sample rate at about 10,000 SPS I decided to look into the ADC conversion. The stock firmware set the ADC clock to be the main clock (58.98MHz) divided by 256. This has the advantage of providing a longer sampling time (one clock I assume although I could find nothing in the documentation to support that) it means that conversions take quite a while.

Doing the math showed that this was limiting the data rate a lot. At about 20,000 SPS total the CPU would be spending all of its time doing the conversions, mostly polling the conversion done bit, leaving no time to handle the flash memory card. Increasing the ADC clock was an obvious step here. In the extreme the burst conversion mode of the ADC could be used which would eliminate the polling loop at the cost of interrupt overhead.

But of course the SPI code has the same problem. After dumping a byte into the SPI transmit register, it polls the SPI port to see when it is done. Unfortunately there is no hardware help here. Making this interrupt driven might improve things a bit but what is really needed is DMA and that isn't available on the LPC2138. So I decided it was time to look at the code generated by the C compiler to see if there was anything that could be done there to speed things up.

The code for transmitting a byte of data is:

unsigned char TxRxbyte(unsigned char data)
{
  S0SPDR  = data;                   // send next SPI channel 0 data
  while (!(S0SPSR & 0x80))      // wait for transfer completed
    ;
  return S0SPDR;

}

The compiler produced the following assembler source:

        .align  2
        .global TxRxbyte
        .type   TxRxbyte, %function
TxRxbyte:
.LFB3:
        .loc 1 25 0
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 4
        @ frame_needed = 1, uses_anonymous_args = 0
        mov     ip, sp
.LCFI3:
        stmfd   sp!, {fp, ip, lr, pc}
.LCFI4:
        sub     fp, ip, #4
.LCFI5:
        sub     sp, sp, #4
.LCFI6:
        mov     r3, r0
        strb    r3, [fp, #-16]
        .loc 1 26 0
        mov     r3, #-536870904
        add     r3, r3, #131072
        ldrb    r2, [fp, #-16]
        strb    r2, [r3, #0]
.L4:
        .loc 1 27 0
        mov     r3, #-536870908
        add     r3, r3, #131072
        ldrb    r3, [r3, #0]
        and     r3, r3, #255
        mov     r3, r3, asl #24
        mov     r3, r3, asr #24
        cmp     r3, #0
        bge     .L4
        .loc 1 29 0
        mov     r3, #-536870904
        add     r3, r3, #131072
        ldrb    r3, [r3, #0]
        and     r3, r3, #255
        .loc 1 31 0
        mov     r0, r3
        sub     sp, fp, #12
        ldmfd   sp, {fp, sp, lr}
        bx      lr
.LFE3:

Ignoring for the moment that I know nothing about ARM assembly language, this still seems very inefficient. So I checked the compiler options and no optimization was selected. After setting the optimization level to 2 the results were:

        .align  2
        .global TxRxbyte
        .type   TxRxbyte, %function
TxRxbyte:
.LFB3:
        .loc 1 25 0
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
.LVL0:
        .loc 1 26 0
        mov     r3, #-536870912
        add     r3, r3, #131072
        .loc 1 25 0
        and     r0, r0, #255
        .loc 1 26 0
        strb    r0, [r3, #8]
        .loc 1 25 0
        @ lr needed for prologue
        mov     r2, r3
.L4:
        .loc 1 27 0
        ldrb    r3, [r2, #4]    @ zero_extendqisi2
        tst     r3, #128
        beq     .L4
        .loc 1 29 0
        ldrb    r0, [r2, #8]    @ zero_extendqisi2
.LVL1:
        .loc 1 31 0
        bx      lr
.LFE3:

That looks a lot better. Focusing in on the polling loop, the two versions are:

.L4:
        .loc 1 27 0
        mov     r3, #-536870908
        add     r3, r3, #131072
        ldrb    r3, [r3, #0]
        and     r3, r3, #255
        mov     r3, r3, asl #24
        mov     r3, r3, asr #24
        cmp     r3, #0
        bge     .L4
.L4:
        .loc 1 27 0
        ldrb    r3, [r2, #4]    @ zero_extendqisi2
        tst     r3, #128
        beq     .L4

Eight instructions for the first loop and only 3 in the second. But because it takes 64 CPU clocks to shift out a byte, this shorter loop will not significantly improve transfer speeds. But it does show that a routine optimized to transmit one sector of data might be useful.


void TxSector(unsigned char * data, int num)
{
  do
    {
      S0SPDR  = *data++;      
      while (!(S0SPSR & 0x80))
        ;
    } while(--num >= 0);
}
        .align  2
        .global TxSector
        .type   TxSector, %function
TxSector:
.LFB4:
        .loc 1 38 0
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
.LVL2:
        mov     r2, #-536870912
        @ lr needed for prologue
        add     r2, r2, #131072
.L12:
        .loc 1 41 0
        ldrb    r3, [r0], #1    @ zero_extendqisi2
        strb    r3, [r2, #8]
.L13:
        .loc 1 42 0
        ldrb    r3, [r2, #4]    @ zero_extendqisi2
        tst     r3, #128
        beq     .L13
        .loc 1 44 0
        subs    r1, r1, #1
        bpl     .L12
        .loc 1 45 0
        bx      lr
.LFE4:

That could run very fast indeed. It will still not run at the full SPI clock rate because there is not a buffer register in the SPI hardware. Since there is always a delay between noticing that the transfer is complete and writing the next byte to transmit, it can't run at full speed. But with this code, the difference is a lot less.

Home