Friday, 23 March 2012

I imagine a software troubleshooting team should look like this.....

Troubleshooting (Debugging) at system level has a great role in embedded firmware/software development. Sometimes, there is a dedicated team for it. In my opinion, the players of this team are special in the sense they perform their debugging activities i.e. moving from hypothication to proving and defining the problem and then suggesting a solution, the dynamism of the problems they face, and their readiness to explore new system areas.

A team like this may not be in limelight always and their activities also may not be much visible always to the other teams (involved in product development) during the entire product development lifecycle.Sometime this team is recalled only when any trouble becomes a blocking issue.

I spared a lot of time thinking how a troubleshooting team should surface out and establish its existence as a front end role player throughout the product development lifecycle.

Well, a lot has been said and written on the art of firmware/software troubleshooting but one thing which I have not yet come across is "TROUBLE-SHOOTERS DIARY" . Every troubleshooter should maintain a troubleshooting diary which should contain :-
  • A QRC (quick reference card) of all debugging commands.
  • Short notes on tool usage.
  • Basic data flow and control flow models of the modules in the system.
  • Short notes on verification methods.
  • Instrumentation/debugging code. 
  • Parsing scripts.
  • Statistics of issues debugged in a week and in a month(type of issue, frequency of occurence, preventive/corrective action taken).
  • (If the issue is new) A case study. 
  • Templates/Auotomation scripts for all kinds of reports he/she prepares.
  • there can be more..........
Unit test bug found and fixed by a system level troubleshooter, should be highlighted in the statistics corner of his diary; as finding a unit testing bug at system level (that too when your system is concurrent) is tedious and should bring him an accolade.

The troubleshooting team should call a "DIARY-MEETING" every week or every month. The team should discuss each others diary entry of the week/month and then should present a quantified data/statistics of their effort and issues fixed/analyzed and actions to the entire team.This meeting should also include identification of new/strange issues and a case study preparation on the same.

The prepared case study should be shared with the development team and newbies in the troubleshooting team so that they can learn from their mistakes and the issues never recur or if recur, the analysis should not take much time.

  • Every patch released by the integration team should be verified and stamped by this team for release. This can save a lot of painstaking effort.
  • Troubleshooters should also be called for module level design reviews as over the time they become more acquainted with the system.

This may be an existing model in many organizations where the process model supports it.
But it may be a missing link in some organizations where time to market is a critical factor and multiple products are pipelined.

Apart from this there should be a "Troubleshooting Skills Improvement Plan" which should contain schedules for:-
  • Workshops on various components of the system.
  • Brainstroming sessions.
  • Data Analysis sessions.
  • Sessions by troubleshooting Pandits.
  • and more.......
 My imagination looks perfect in making a troubleshooting team front end role player in software product development.

Having said all this, I would also say that I am not an expert in this area.However, I aspire to be.

Friday, 10 February 2012

Start tracking a bug and you can see more waiting for their turn to ruin your software....

It so happened with me that I was tracking my important speed post and I found it delivered to a wrong address by my area postman.Imagine what worse could happen.He cannot track the person to whom he has delivered it. A big bug in postal service.

This post had a link with my bank account. When I went to update this post's missing information in bank, I got to know that my file/credentials are migrated to different branch, without any notification to me. A bigger bug in bank service.

Why don't they hire an experienced IT guy, who will review their workflow and check all exceptions handled?

Anyway, let me tell you what I saw in a code base the other day I was tracking a bug. A knowledgeable but lazy developer calls a function and doesnot extern the function prototype; biased by an idea that the compiler and linker should bear this load (as they really do).For most of the people, who keep extern keyword blinking in their mind, in series with linker, I would say, compiler also has to mark the externed symbol as unresolved while creating the symbol table in a .O file.

He called a function to allocate memory on heap passing unsigned 64 bit datatype(unsigned long long int). The function has the formal parameter as unsigned int. Something like this:
processMessageA(........, ......, ......, length);
where
processMessageA(....., ......, ......., u64 len)
{
       --------------
       ---------------
       memAlloc(......, 2*len);
       ---------------
}

The developer also assumed that the  2*len will be implicitly truncated to unsigned int while passing argument to memAlloc.Since the extern declaration of  memAlloc was missing, ARM compiler assumed the formal parameter of memAlloc to be same as that of the size of actual argument passed i.e. 8 bytes. In caller (processMessageA)context,  it picks the length value(8 bytes) from stack frame of processMessageA multiplied it with two and passed it as argument to memAlloc without truncating to unsigned int.

As per Arm Thumb Procedure Call, this 8 bytes size argument was passed in two registers to the memAlloc function. However , the implementation of  memAlloc expected size in one register and some other data in the other register. As a result, the program crashed while executing memAlloc.

While browsing through the source code, I saw a function call which resembled the call mentioned below:-
getMessageRef(poolId)->Body = data;

I found one case  when getMessageRef returned NULL, which if occured will cause a DATA ABORT on ARM(and SEGMENT VIOLATION on gnu/Linux). A hidden bug waiting its turn.

I donot feel any harm in making call to getMessageRef seperately like this:-
 if((pMessage = getMessageRef(poolId)) != NULL)
{
     pMessage->Body = data;
}

On same line, I feel, it is NOT safe and good practice to call functions in following ways :-
  • setEmployeeAge(getEmployeeId(COMPANY_A,pName),35) : A function call in argument list of another function.
  • getTaxPlanner(COMPANY_A, employeeB_id)(FN_YEAR_2012) : Two function call hidden in same line of code.
AAh, Finally I am out to publish this particular blog which was under draft since last 2 weeks. An IT guy working in India can surely understand what a weekend mean to him :) .

 

Friday, 27 January 2012

Gave a second thought and got an improved code

Stabilizing firmware/software has its own challenges which keep the troubleshooters biting their nails all the time. We decided to instrument the code and add debug logs in the firmware, which will give us some history of execution and make our life easy while troubleshooting.

The  debug traces,as I developed, were spread in frequently used system calls and each trace contained a timestamp( as traces usually do) .At first instance, it was not accepted with a reason that it may affect system stabilty and performance.

I felt challenged and de-motivated as I was unable to understand that how a read only data will affect this way; but when I gave it a second  thought, I really found that it had side-effects. Following were my mistakes:-
  • Timestamping: The routine to get timestamp, used to disable interrupts and thus increase the interrupt latency. As it became a part of frequent system calls, you can imagine the number of times it will disable interrupt.
  • Logging redundant information: Logged some data which could have been derived from other logs. The logging overhead should be minimum.
  • Using many global variables and operations on them: On a load-store architecture (specially without cache) this will surely add extra processing cycles.
  • Use of modulo operator and a table size not a power of two: This also introduced extra instructions.
  • Debug log code scattered in different files.
I improved this by:-
  • using a reference counter and a transaction based logging instead of timestamping.
  • logging minimum and critical data only.
  • using register  local variables ( in C) and using bitwise operations.
  • keeping log table size a power of two and using & operator instead of % operator.
  • centralizing the logging code in a single function in order to get cache benefit.

Will be back with a blog on a fresh and less talked topic, next weekend .........

 

Wednesday, 25 January 2012

When I again met the evil known as STACK OVERFLOW

Around 3 years back, I worked in a firmware/operating system development team for highly memory constrained devices. The OS was supposed to be developed for two different targets. One of them had a RISC processor and the other had CISC.

One fine day we were asked to reproduce and fix a critical bug on the CISC target . Me, my colleague and my manager sat together with an emulator to see whats going wrong. We were able to reproduce the problem but we had no clue about what is happening. Everytime we started execution from a fixed point in code, setting some breakpoints, ran the firmware for sometime and observed that it always entered into an invalid code path.

My manager poured-in all his experience and I also squeezed out all my fresh brain on it, except to closely watch changes in registers, specially the stack pointer, at the point when the software entered invalid code path.

We were running short of time as the bug was to be fixed and the firmware was to be released the other day.However, in due course, my colleague (a laconic person, mostly answering OK :) ), was observing the stack pointer and saw that the 16 bit stack pointer was overflowing which was affecting program counter update in next few cycles (after a return statement from the current function).

Tired, we asked the silent figure sitting next to us, "do you have any idea whats going wrong?". He told,"It is STACK OVERFLOW !!!".(we expected OK :)). All of us started laughing, as this guy again uttered when he was asked, even if  he knew the problem at the very beginning. Thanks to him for pointing at my first real world encounter with the much talked software monster.

Well, the root cause was really surprising, it was not a recursive call (and thus not a large context saving). It was due to jumps in code. On every cross-bank jump MMU(memory management unit) used to push context restoration information on stack which was opaque for the programmer.MMU left this data on stack until the next return statement was executed.

This was the instance when I learnt that if your intention is to write great software then develop thorough understanding of your underlying system. The RISC team was made alert for this monster.And to our surprise stackoverflow was detected there also. However, the reason was something else. Management had a clear vision that a generic software bug, if found on one target should not occur on the other and for that they had established a good forum for learning and sharing.

A year back I was working on a multi tasking system which has many complex state machines, frameworks and...blah blah( big terms hammers my mind :)). We observed that task A's stack overflowed into the adjacent tasks stack area.The second encounter. The scope of debugging was vague for me and I was new to that system(developing understanding of the system).This was fixed by an expert.

It hit back again on a different platfrom.Well, this time, no SP overflow.
This software has a service library for processing requests from multiple task. The service library is blind while processing the requests. I mean, it can process task A's (queued) request while task B is running. There was a recursive call in the framework of the service library, for e.g.


This time, I had a self commitment to know what is causing stackoverflow. The large amount of context saving on stack(part of function prologue) as a result of recursive function was the immediate reason. However, the actual reason was deeply rooted at queueing theory, load balancing and framework violation.

As I understand,knowing the system thoroughly and applying it during development, is a good weapon for keeping this monster away from your software.