Das BlinkenBoard code, Room for Improvement

I am thinking about designing a variant of the Das BlinkenBoard. While thinking about the hardware I thought that a good place to start the software would be to duplicate the functionality of the Blinkenboard. So I grabbed the BlinkenBoard code for a look. I have never used an AVR before so I had to dig into the data sheets to answer some questions and I quickly realized there was trouble in the interrupt code. Either I do not understand the ATTINY84 at all or the code is messed up. Here is the portion of the code I am having trouble with:

volatile uint8_t led; //volatile uint8_t pwm[8];
volatile int8_t pwm[8]; //volatile int seed;
volatile int8_t direction[8];

void ledOff(unsigned char pos)
	//PORTA = 0b00000000;
	//PORTA |= _BV(0);

	if (pos>4)
		PORTB |= _BV(7-pos);
		//PORTB = 0b00000001;
		if (pos==4) { pos=7; }
		PORTA |= _BV(pos);

	if (led>100) { led=0; }
	// very important here to do PORTA as 0b01110000 it will
	// turn off all LEDS and Keep those internal pullups on.
	// if you turn off the internal pullups the jumpers
	// start reading noise.
	if (led==0) { PORTA&=0b01110000; PORTB=0b00000000; }
  	if (led>pwm[0]) ledOff(0);
  	if (led>pwm[1]) ledOff(1);
  	if (led>pwm[2]) ledOff(2);
  	if (led>pwm[3]) ledOff(3);
  	if (led>pwm[4]) ledOff(4);
  	if (led>pwm[5]) ledOff(5);
  	if (led>pwm[6]) ledOff(6);
  	if (led>pwm[7]) ledOff(7);

The problems with this code are not immediately apparent unless you look at the code generated by the compiler. You can do this in a couple of ways. The most obvious is to use the C compiler with the -S option to produce a listing in assembly. The other is to use the objdump program (or avr-objdump in this case) to disassemble the object file.

Here is the code generated by the line if (led>pwm[0]) ledOff(0); in the interrupt routine:

        lds r18,led
        lds r24,pwm
        ldi r19,lo8(0)
        clr r25
        sbrc r24,7
        com r25
        cp r24,r18
        cpc r25,r19
        brge .L92
        ldi r24,lo8(0)
        rcall ledOff

The first four lines load the values for the variables "led" and "pwm" into the registers. They also promote them from type char to type int. Not exactly what one would expect since they are both byte sized variables. Except that one is signed and the other is unsigned. Making them both the same type eliminates this extra code.

A more subtle problem is that the value of led gets reloaded into r18 before each comparison. This is a result of it being declared as "volatile". Which means that the value can change when you aren't looking. Except that it can't since it is only changed in this interrupt routine. If the main program ever referenced this variable it would have to be considered volatile but it doesn't. The result is 2 cycles wasted on every comparison.

Another problem is that because the conditional is greater than, the function ledOff() will be called every time the interrupt routine executes once the led counter is large enough. Using the equality conditional would fix this but since led is incremented by 4 rather than 1 there is a good chance you could miss it if the value in pwm[] isn't a multiple of 4. (which happens frequently in the code) This doesn't seem like too big a deal until you look at the code generated for ledOff():

void ledOff(unsigned char pos)
        //PORTA = 0b00000000;
        //PORTA |= _BV(0);

        if (pos>4)
  9c:   85 30           cpi     r24, 0x05       ; 5
  9e:   78 f0           brcs    .+30            ; 0xbe (ledOff+0x22>
                PORTB |= _BV(7-pos);
  a0:   48 b3           in      r20, 0x18       ; 24
  a2:   27 e0           ldi     r18, 0x07       ; 7
  a4:   30 e0           ldi     r19, 0x00       ; 0
  a6:   28 1b           sub     r18, r24
  a8:   31 09           sbc     r19, r1
  aa:   81 e0           ldi     r24, 0x01       ; 1
  ac:   90 e0           ldi     r25, 0x00       ; 0
  ae:   02 c0           rjmp    .+4             ; 0xb4 (ledOff+0x18>
  b0:   88 0f           add     r24, r24
  b2:   99 1f           adc     r25, r25
  b4:   2a 95           dec     r18
  b6:   e2 f7           brpl    .-8             ; 0xb0 (ledOff+0x14>
  b8:   48 2b           or      r20, r24
  ba:   48 bb           out     0x18, r20       ; 24
  bc:   08 95           ret
                //PORTB = 0b00000001;
                if (pos==4) { pos=7; }
  be:   84 30           cpi     r24, 0x04       ; 4
  c0:   09 f4           brne    .+2             ; 0xc4 (ledOff+0x28>
  c2:   87 e0           ldi     r24, 0x07       ; 7
                PORTA |= _BV(pos);
  c4:   9b b3           in      r25, 0x1b       ; 27
  c6:   21 e0           ldi     r18, 0x01       ; 1
  c8:   30 e0           ldi     r19, 0x00       ; 0
  ca:   02 c0           rjmp    .+4             ; 0xd0 (ledOff+0x34>
  cc:   22 0f           add     r18, r18
  ce:   33 1f           adc     r19, r19
  d0:   8a 95           dec     r24
  d2:   e2 f7           brpl    .-8             ; 0xcc (ledOff+0x30>
  d4:   92 2b           or      r25, r18
  d6:   9b bb           out     0x1b, r25       ; 27
  d8:   08 95           ret

For something that only has to set one bit this seems a bit excessive. And slow. The biggest problem is that it uses a loop to shift a bit into position. Slowly.

The last line of the interrupt routine is another thing that troubles me. Looking at the initialization code it appears as though the timer has been configured to run from the CPU clock in the continuous mode. When the counter is equal to 0x1f it triggers an interrupt but keeps on counting. The interrupt code executes (resetting the interrupt) and at the end sets the timer counter to zero. There are many potential problems here.

The first is that the timer is of course still running. In only 256 clock cycles the interrupt will be generated again and since the timer is running from the same clock as the CPU that means 256 instruction cycles. The code is so poorly written that it is nearly certain (especially if ledOff() is called) that it will not complete the interrupt routine before another interrupt is triggered. Writing to TCNT0 appears to reset any pending timer interrupt which is good.

But the compare register is set such that only 32 instruction cycles will go by before the next interrupt. There are enough pop instructions at the end of the interrupt routine to consume all of those cycles. Which means that as soon as the reti instruction executes, there is another interrupt waiting. So how does the main routine execute anything?

The data sheet says that if there is a pending interrupt when the reti is executed, exactly one instruction is executed before the interrupt is serviced. This explains how perceptible delays can be generated by calls to _delay_ms(1). Which generates a simple two instruction loop that executes 250 times. (_delay_ms() is called frequently in the pattern generation code.) It executes 500 instructions, one per interrupt.

While this does appear to work it has the additional problem (or perhaps it is a feature) that timing depends on how many outputs are on.


I have probably spent too much time pissing and moaning about how bad the code is. Do I have a fix?

Step one is to tighten up the interrupt code so it doesn't take forever to execute. Once that is done the timer can be reconfigured to provide reasonable, repeatable interrupts. The idea is to have an interrupt occurring at a set interval. The interval is chosen so that after the interrupt code is finished there are still some clock cycles left over for the main routine. With that done the delay functions have to be fixed. Once completed this will result in a better performing system that is easier to change and maintain. Plus it will not give me SIWOTI syndrome.

Fixing ledOff()

The first order of business is to fix the subroutine ledOff(). This is easy to do with a look-up table. The slow bit shift loops are replaced with a fast index into an array. This array is what maps the index to the individual port bits. Not only is this faster than the bit shifting but it is more flexible since arbitrary mappings are possible. The new source code:

  Lookup table to map bits to Ports A and B. Port A is the low byte and port B
  is the high.
const unsigned int bit_tab[] = {
  0x100 };

void ledOff(unsigned char pos)
  unsigned int temp;

  temp = bit_tab[pos];
  PORTA |= temp;
  PORTB |= temp >> 8;

And the generated code:

void ledOff(unsigned char pos)
  unsigned int temp;

  temp = bit_tab[pos];
  5a:   e8 2f           mov     r30, r24
  5c:   f0 e0           ldi     r31, 0x00       ; 0
  5e:   ee 0f           add     r30, r30
  60:   ff 1f           adc     r31, r31
  62:   e8 59           subi    r30, 0x98       ; 152
  64:   ff 4f           sbci    r31, 0xFF       ; 255
  66:   20 81           ld      r18, Z
  68:   91 81           ldd     r25, Z+1        ; 0x01
  PORTA |= temp;
  6a:   8b b3           in      r24, 0x1b       ; 27
  6c:   82 2b           or      r24, r18
  6e:   8b bb           out     0x1b, r24       ; 27
  PORTB |= temp >> 8;
  70:   88 b3           in      r24, 0x18       ; 24
  72:   89 2b           or      r24, r25
  74:   88 bb           out     0x18, r24       ; 24
  76:   08 95           ret

Shorter and a lot faster since it lacks loops. But it isn't as fast as possible. The problem is with indexing into the look-up table. While this routine is general purpose that generality isn't used anywhere. The interrupt routine just steps through the outputs which means that it is incrementing the loop index. That time consuming array index operation could be replaced with a simple pointer which is incremented.

New Interrupt Routine

Combining the new look-up table with a few other changes...

  static unsigned int portbits = 0x0070;
  static char phase;

  unsigned int temp;

  // First write data to output ports computed last time through
  PORTA = portbits;
  PORTB = portbits >> 8;

    First update phase counter. Limited to a maximum of 100. When
    reset to zero also turn off the outputs.
  phase += PHASE_INC;
  if( phase > 100)
      phase = 0;
      portbits = 0x0070;

  /*  Scan through the output bits turning them on as required.  */
  temp = 0;

  if( phase > pwm[0] )    // bit 0
    temp |= bit_tab[0];

  if( phase > pwm[1] )    // bit 1
    temp |= bit_tab[1];

  if( phase > pwm[2] )    // bit 2
    temp |= bit_tab[2];

  if( phase > pwm[3] )    // bit 3
    temp |= bit_tab[3];

  if( phase > pwm[4] )    // bit 4
    temp |= bit_tab[4];

  if( phase > pwm[5] )    // bit 5
    temp |= bit_tab[5];

  if( phase > pwm[6] )    // bit 6
    temp |= bit_tab[6];

  if( phase > pwm[7] )    // bit 7
    temp |= bit_tab[6];

  portbits |= temp;

  int_count++;    // increment interrupt counter (used for timing)


Note that this rolls the ledOff() function into it. Not a problem since this is the only place it was called. (ledOn and ledToggle are never called) There are quite a few changes here which require some explanation.

The mapping between the channel number and the bits in PORTA and PORTB are encoded in the look-up table bit_tab[]. Another change is that after figuring out the new state of the output bits, that state is not written to the output ports. Instead it is saved and written at the start of the next interrupt. This reduces jitter in the timing. Not that jitter is a big problem in this application.

Rather than include all of the generated code I thought I would focus on a representative sample. This is the code to check one output bit:

        lds r21,pwm+1
        cp r21,r20
        brge .L332
        ori r24,lo8(2)

It is pretty interesting because the complier has optimized the look up table array references into constants and reduced it from a 16 bit to 8 bit operation. It is actually very good and as fast as possible but it took a while to get here. I had to add the temp variable because the compiler insisted on updating the value of portbits in memory after each bit.

Each bit check requires 5 instruction cycles. For a total of 46 cycles for all 8 bits. (6 extra cycles to update portbits). The instructions leading up to this (including pushing registers to the stack) require 36 instruction cycles or 40 if the phase is reset to 0. The last few instructions require 33 instruction cycles. The total (using the longest option) is then 119 cycles. Add in the 7 cycles required to service the interrupt and the grand total worst case is 126 cycles. (Give or take. I counted by hand and may have missed something. Compiled with -O3 rather than -Os. )

So if the timer was setup to generate an interrupt every 200 cycles there would be about 70 cycles left for the main routine. (35,000 instruction cycles/second) Interrupts would happen at a 5,000 Hz rate and the PWM period would be 5ms. The 200Hz PWM rate is fast enough to prevent flicker from being visible.

New Delay Routine

With the interrupt routine improved a new way to generate delays is required. One based on the timer interrupt and not delay loops dependent on only executing one instruction per interrupt. This is pretty simple because the interrupt routine also increments a counter.

#define TIMER_RATE 200     // micro-seconds per timer interrupt
#define DELAY_MS(D_TIME_MS) (delay_counts(D_TIME_MS*1000L/TIMER_RATE))
volatile unsigned int int_count = 0;

  Basic timer based delay routine. Generates delays of up to 65,536 counts
  of the timer. This of course depends on the timer configuration. If the timer
  generates an interrupt every 200 us then delays of up to ~16 seconds are
void delay_counts(unsigned int counts)
  unsigned int last;

  last = int_count + counts;

  while(int_count != last)

A macro is also defined which takes an argument in milli-seconds. This is a little handier to use.

Configuring the timer to generate interrupts at the desired rate and correcting the delay calls in the pattern generator code is left as an exercise for the student.

While this code does compile without error I cannot verify that it actually works since I don't own a dasBlinkenBoard or an AVR programmer. Nor do I have much interest in acquiring them.

Even Better

While that is better there is a method that has even lower overhead because it generates interrupts only for output state changes. Because the PWM table changes infrequently (at least compared to the PWM cycle time), the timing also changes infrequently. By computing the times of the changes and generating interrupts only at those times, the total number of interrupts falls. The downside is that the main routine has to do a bit more work to generate the tables required by the interrupt routine.

Three tables are required. One to hold output port values, one for the delays, and the last is a state table. The state table is required because the number of changes can be anything from 0 to 8. If all the outputs are off then the tables have only one valid entry and the next state is the same as the current state (zero). As an alternative, you could include dummy entries and replace the state table with a simple counter.

unsigned char state;

  static unsigned int portbits = 0x0070;

  PORTA = portbits;
  PORTB = portbits >> 8;

  OCR0A  = TCNT0 + delay_table[state];
  portbits = port_table[state];
  state = state_table[state];

  if(state == 0)

This requires ~90 instruction cycles to execute. One tricky item is updating the tables. The main routine should wait until the state is zero, disable interrupts, update the tables, and finally enable interrupts. Otherwise odd things could happen.

Operation is constrained by the 8 bit timer. It must be able to count through a full PWM cycle without overflowing and the resolution of the PWM duty cycle control is at best one timer tick. On the ATTINY84, using the divide by 64 prescaler output appears to be the best choice. Using a PWM cycle of 100 ticks results in a cycle time of 6.4ms which should keep flicker under control. Care must be taken to make sure that there are at least two ticks between interrupts so that the interrupt routine can finish. This limits you to 2% steps on the duty cycle or 1% if you insert an extra cycle as required.

Overhead to run the interrupt routine now varies from 1.4% to 12.6% which is much better than the original near 100% and leaves lots of cycles left over to do other things.