One fine day we were asked to reproduce and fix a critical bug on the CISC target . Me, my colleague and my manager sat together with an emulator to see whats going wrong. We were able to reproduce the problem but we had no clue about what is happening. Everytime we started execution from a fixed point in code, setting some breakpoints, ran the firmware for sometime and observed that it always entered into an invalid code path.
My manager poured-in all his experience and I also squeezed out all my fresh brain on it, except to closely watch changes in registers, specially the stack pointer, at the point when the software entered invalid code path.
We were running short of time as the bug was to be fixed and the firmware was to be released the other day.However, in due course, my colleague (a laconic person, mostly answering OK :) ), was observing the stack pointer and saw that the 16 bit stack pointer was overflowing which was affecting program counter update in next few cycles (after a return statement from the current function).
Tired, we asked the silent figure sitting next to us, "do you have any idea whats going wrong?". He told,"It is STACK OVERFLOW !!!".(we expected OK :)). All of us started laughing, as this guy again uttered when he was asked, even if he knew the problem at the very beginning. Thanks to him for pointing at my first real world encounter with the much talked software monster.
Well, the root cause was really surprising, it was not a recursive call (and thus not a large context saving). It was due to jumps in code. On every cross-bank jump MMU(memory management unit) used to push context restoration information on stack which was opaque for the programmer.MMU left this data on stack until the next return statement was executed.
This was the instance when I learnt that if your intention is to write great software then develop thorough understanding of your underlying system. The RISC team was made alert for this monster.And to our surprise stackoverflow was detected there also. However, the reason was something else. Management had a clear vision that a generic software bug, if found on one target should not occur on the other and for that they had established a good forum for learning and sharing.
A year back I was working on a multi tasking system which has many complex state machines, frameworks and...blah blah( big terms hammers my mind :)). We observed that task A's stack overflowed into the adjacent tasks stack area.The second encounter. The scope of debugging was vague for me and I was new to that system(developing understanding of the system).This was fixed by an expert.
It hit back again on a different platfrom.Well, this time, no SP overflow.
This software has a service library for processing requests from multiple task. The service library is blind while processing the requests. I mean, it can process task A's (queued) request while task B is running. There was a recursive call in the framework of the service library, for e.g.
This time, I had a self commitment to know what is causing stackoverflow. The large amount of context saving on stack(part of function prologue) as a result of recursive function was the immediate reason. However, the actual reason was deeply rooted at queueing theory, load balancing and framework violation.
As I understand,knowing the system thoroughly and applying it during development, is a good weapon for keeping this monster away from your software.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.