Doing the math showed that this was limiting the data rate a lot. At about 20,000 SPS total the CPU would be spending all of its time doing the conversions, mostly polling the conversion done bit, leaving no time to handle the flash memory card. Increasing the ADC clock was an obvious step here. In the extreme the burst conversion mode of the ADC could be used which would eliminate the polling loop at the cost of interrupt overhead.
But of course the SPI code has the same problem. After dumping a byte into the SPI transmit register, it polls the SPI port to see when it is done. Unfortunately there is no hardware help here. Making this interrupt driven might improve things a bit but what is really needed is DMA and that isn't available on the LPC2138. So I decided it was time to look at the code generated by the C compiler to see if there was anything that could be done there to speed things up.
The code for transmitting a byte of data is:
unsigned char TxRxbyte(unsigned char data) { S0SPDR = data; // send next SPI channel 0 data while (!(S0SPSR & 0x80)) // wait for transfer completed ; return S0SPDR; } |
The compiler produced the following assembler source:
.align 2 .global TxRxbyte .type TxRxbyte, %function TxRxbyte: .LFB3: .loc 1 25 0 @ Function supports interworking. @ args = 0, pretend = 0, frame = 4 @ frame_needed = 1, uses_anonymous_args = 0 mov ip, sp .LCFI3: stmfd sp!, {fp, ip, lr, pc} .LCFI4: sub fp, ip, #4 .LCFI5: sub sp, sp, #4 .LCFI6: mov r3, r0 strb r3, [fp, #-16] .loc 1 26 0 mov r3, #-536870904 add r3, r3, #131072 ldrb r2, [fp, #-16] strb r2, [r3, #0] .L4: .loc 1 27 0 mov r3, #-536870908 add r3, r3, #131072 ldrb r3, [r3, #0] and r3, r3, #255 mov r3, r3, asl #24 mov r3, r3, asr #24 cmp r3, #0 bge .L4 .loc 1 29 0 mov r3, #-536870904 add r3, r3, #131072 ldrb r3, [r3, #0] and r3, r3, #255 .loc 1 31 0 mov r0, r3 sub sp, fp, #12 ldmfd sp, {fp, sp, lr} bx lr .LFE3: |
Ignoring for the moment that I know nothing about ARM assembly language, this still seems very inefficient. So I checked the compiler options and no optimization was selected. After setting the optimization level to 2 the results were:
.align 2 .global TxRxbyte .type TxRxbyte, %function TxRxbyte: .LFB3: .loc 1 25 0 @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. .LVL0: .loc 1 26 0 mov r3, #-536870912 add r3, r3, #131072 .loc 1 25 0 and r0, r0, #255 .loc 1 26 0 strb r0, [r3, #8] .loc 1 25 0 @ lr needed for prologue mov r2, r3 .L4: .loc 1 27 0 ldrb r3, [r2, #4] @ zero_extendqisi2 tst r3, #128 beq .L4 .loc 1 29 0 ldrb r0, [r2, #8] @ zero_extendqisi2 .LVL1: .loc 1 31 0 bx lr .LFE3: |
That looks a lot better. Focusing in on the polling loop, the two versions are:
.L4: .loc 1 27 0 mov r3, #-536870908 add r3, r3, #131072 ldrb r3, [r3, #0] and r3, r3, #255 mov r3, r3, asl #24 mov r3, r3, asr #24 cmp r3, #0 bge .L4 |
.L4: .loc 1 27 0 ldrb r3, [r2, #4] @ zero_extendqisi2 tst r3, #128 beq .L4 |
Eight instructions for the first loop and only 3 in the second. But because it takes 64 CPU clocks to shift out a byte, this shorter loop will not significantly improve transfer speeds. But it does show that a routine optimized to transmit one sector of data might be useful.
void TxSector(unsigned char * data, int num) { do { S0SPDR = *data++; while (!(S0SPSR & 0x80)) ; } while(--num >= 0); } |
.align 2 .global TxSector .type TxSector, %function TxSector: .LFB4: .loc 1 38 0 @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. .LVL2: mov r2, #-536870912 @ lr needed for prologue add r2, r2, #131072 .L12: .loc 1 41 0 ldrb r3, [r0], #1 @ zero_extendqisi2 strb r3, [r2, #8] .L13: .loc 1 42 0 ldrb r3, [r2, #4] @ zero_extendqisi2 tst r3, #128 beq .L13 .loc 1 44 0 subs r1, r1, #1 bpl .L12 .loc 1 45 0 bx lr .LFE4: |
That could run very fast indeed. It will still not run at the full SPI clock rate because there is not a buffer register in the SPI hardware. Since there is always a delay between noticing that the transfer is complete and writing the next byte to transmit, it can't run at full speed. But with this code, the difference is a lot less.