Friday 27 January 2012

Gave a second thought and got an improved code

Stabilizing firmware/software has its own challenges which keep the troubleshooters biting their nails all the time. We decided to instrument the code and add debug logs in the firmware, which will give us some history of execution and make our life easy while troubleshooting.

The  debug traces,as I developed, were spread in frequently used system calls and each trace contained a timestamp( as traces usually do) .At first instance, it was not accepted with a reason that it may affect system stabilty and performance.

I felt challenged and de-motivated as I was unable to understand that how a read only data will affect this way; but when I gave it a second  thought, I really found that it had side-effects. Following were my mistakes:-
  • Timestamping: The routine to get timestamp, used to disable interrupts and thus increase the interrupt latency. As it became a part of frequent system calls, you can imagine the number of times it will disable interrupt.
  • Logging redundant information: Logged some data which could have been derived from other logs. The logging overhead should be minimum.
  • Using many global variables and operations on them: On a load-store architecture (specially without cache) this will surely add extra processing cycles.
  • Use of modulo operator and a table size not a power of two: This also introduced extra instructions.
  • Debug log code scattered in different files.
I improved this by:-
  • using a reference counter and a transaction based logging instead of timestamping.
  • logging minimum and critical data only.
  • using register  local variables ( in C) and using bitwise operations.
  • keeping log table size a power of two and using & operator instead of % operator.
  • centralizing the logging code in a single function in order to get cache benefit.

Will be back with a blog on a fresh and less talked topic, next weekend .........

 

Wednesday 25 January 2012

When I again met the evil known as STACK OVERFLOW

Around 3 years back, I worked in a firmware/operating system development team for highly memory constrained devices. The OS was supposed to be developed for two different targets. One of them had a RISC processor and the other had CISC.

One fine day we were asked to reproduce and fix a critical bug on the CISC target . Me, my colleague and my manager sat together with an emulator to see whats going wrong. We were able to reproduce the problem but we had no clue about what is happening. Everytime we started execution from a fixed point in code, setting some breakpoints, ran the firmware for sometime and observed that it always entered into an invalid code path.

My manager poured-in all his experience and I also squeezed out all my fresh brain on it, except to closely watch changes in registers, specially the stack pointer, at the point when the software entered invalid code path.

We were running short of time as the bug was to be fixed and the firmware was to be released the other day.However, in due course, my colleague (a laconic person, mostly answering OK :) ), was observing the stack pointer and saw that the 16 bit stack pointer was overflowing which was affecting program counter update in next few cycles (after a return statement from the current function).

Tired, we asked the silent figure sitting next to us, "do you have any idea whats going wrong?". He told,"It is STACK OVERFLOW !!!".(we expected OK :)). All of us started laughing, as this guy again uttered when he was asked, even if  he knew the problem at the very beginning. Thanks to him for pointing at my first real world encounter with the much talked software monster.

Well, the root cause was really surprising, it was not a recursive call (and thus not a large context saving). It was due to jumps in code. On every cross-bank jump MMU(memory management unit) used to push context restoration information on stack which was opaque for the programmer.MMU left this data on stack until the next return statement was executed.

This was the instance when I learnt that if your intention is to write great software then develop thorough understanding of your underlying system. The RISC team was made alert for this monster.And to our surprise stackoverflow was detected there also. However, the reason was something else. Management had a clear vision that a generic software bug, if found on one target should not occur on the other and for that they had established a good forum for learning and sharing.

A year back I was working on a multi tasking system which has many complex state machines, frameworks and...blah blah( big terms hammers my mind :)). We observed that task A's stack overflowed into the adjacent tasks stack area.The second encounter. The scope of debugging was vague for me and I was new to that system(developing understanding of the system).This was fixed by an expert.

It hit back again on a different platfrom.Well, this time, no SP overflow.
This software has a service library for processing requests from multiple task. The service library is blind while processing the requests. I mean, it can process task A's (queued) request while task B is running. There was a recursive call in the framework of the service library, for e.g.


This time, I had a self commitment to know what is causing stackoverflow. The large amount of context saving on stack(part of function prologue) as a result of recursive function was the immediate reason. However, the actual reason was deeply rooted at queueing theory, load balancing and framework violation.

As I understand,knowing the system thoroughly and applying it during development, is a good weapon for keeping this monster away from your software.