Troubleshooting Examples
In Effective Software Troubleshooting, I described the high-level steps for troubleshooting software systems. In this article, I describe how I utilized those steps to troubleshoot problems that surfaced in an engineering development environment. Don’t let the brevity of each section fool you. Each example consumed several hours of research. This is why I mentioned earlier that perseverance is your most ardent weapon against these type of issues.
All of these, except the last one are from long ago. However, they stick out in my mind because they taught me something new and made me a better engineer.
For brevity I cut out two other examples where rubberducking really saved me. Maybe I will write about them later. One important take away is that rubberducking is especially useful when you use it to focus on different layers of your software. For example, once I was tracking down a race condition in our commit process that was causing 1/5 commits to fail. At first, while rubberducking, I narrated what our tools were doing, glossing over our calls to git commit and git push. I was treating Git has black box. I did that over and over again until I felt I had ruled out all causes for the error. Then I told myself that in my next rubberducking session I should think a little bit more deeply into what happens when I call git commit and git push. When I did that, I remembered that git push was going to trigger a synchronization job that replayed the commit into read-only replicas. That’s when I remembered that our Git URL actually pointed at a load balancer which redirected the caller to one of those read-only replicas. Finally, I remembered that because of the replication delay, the states of the read-only replicas were not completely synchronized. As you can imagine that broke some assumptions that our software was making and caused the aforementioned intermittent failure rate.
The first example below highlights a race condition that manifested on Linux machines. The key takeaway from that example was the importance of reproducing the problem. In this case, that was done by invoking a tool repeatedly, over and over again enough times until the problem manifested.
The second example highlights a case where instead of focusing on what was broken, I focused on what would make the software work. That led to an insight to the root cause of the problem.
Finally, the third example highlights some of the OS diagnostic tools used in troubleshooting software. More importantly, the rubberducking step was crucial to the resolution.
Intermittent Failures On Linux Machines
Once a user reported a problem where a tool we maintained spit out a slew of errors from Perl’s Cwd library and then crashed. The Cwd::getcwd() function was called from a lot of different places in the code, making it very hard to identify the source of the error. Also, the amount of text printed to the screen was very hard to decipher, and it was something we couldn’t control because it was emanating from a third-party library. However, from the first few lines of output, it did appear that the crash was triggered when the current working directory didn’t exist. This made no sense. How could the current working directory not exist? The script doesn’t remove the directory in which it is invoked. The second piece of evidence that jumped out at me was that the tool was being run on a Linux box and Linux support in our tools was just recently introduced. The development environment was just beginning a transition away from Solaris servers to Linux ones. I should reiterate that this happened a long time ago in case you’re wondering why we would still be using Solaris. :) This problem just sticks out in my mind because it was my first time diagnosing a race condition.
The first step was to try and reproduce the problem. I had no luck reproducing the issue on my own Linux box or the user’s Linux box. I also couldn’t reproduce the issue while using sudo to run the command as the user himself. Next, I got confirmation from the user that he also couldn’t reproduce the issue.
Then I searched past cases for the problem. Nothing surfaced. If this was indeed a Linux problem that would make sense since Linux use was a new thing in our environment. Based on the nature of our tools, this didn’t smell like an OS specific issue. In the process of adding Linux support we hit issues like certain binaries and or dependencies not being available or residing in different locations than they did on Solaris boxes. There was no reason why Cwd::getcwd() would behave differently on Linux than it does on Solaris.
I had a hypothesis that Cwd::getcwd() was being called in a non-existent directory. The next step was to test that hypothesis. I wrote a script that invoked Cwd::getcwd(). Next, I changed my directory to a temp directory, removed that directory in another shell, and then returned to the previous shell and invoked my script. Sure enough, the script bombarded the screen with a screenful of errors before it crashed. I tried the same steps on Solaris and reproduced the issue. That told me that Cwd::getcwd() was behaving the same on both platforms, and it must be the way that function was called was different.
Now I had confidence that Cwd::getcwd() was being called in a directory that was removed. I wondered if this was just a fluke. I didn’t have to wonder very long. A day or two later a second user reported the same issue and sure enough, the error was encountered on a Linux box.
The next step was to try and reproduce the error. Since it seemed to be an intermittent race condition, I wrote a small script to invoke the failing command on Linux multiple times in a row. When the number of iterations in the script was large enough, I was able to reproduce the problem for the first time. I ran the same script on Solaris and no matter how many iterations I tried, I was not able to reproduce the error. I was glad we had a reproducible test case. If this issue had only been reproducible “in the wild” then we may have needed to enhance this particular command to log as much information as it could so that when reported again our team would have the necessary breadcrumbs to track down the root cause.
Cwd::getcwd() was called from too many places in the code. It wasn’t worth investigating each one looking for a root cause. I thought about overriding Cwd::getcwd() with my own implementation that could maybe tell me who its caller was. Then it dawned on me that the ‘current directory’ being removed was probably a temporary directory created during runtime. I searched the code for any calls related to creating and changing to directories. I started seeing a pattern of chdir() calls right before a call to fork() which creates a background process. The presence of the fork() calls started to produce a certain code smell that we identify as a race condition.
Next, I zeroed in to calls to fork() and realized that before calls to fork() were made, the code created some temporary directories which were then stored in a data structure which was later passed to a clean-up routine near the end of the program. Once I saw the cleanup routine, I had a vague idea as to what was happening. The code must have been changing the current working directory to one of the newly created temporary directories and leaving it there. After completing the processing in the temporary directories, the code forked off another background process to clean up the previously created temporary directories. Thus, the cleanup process was in effect removing the current working directory. Since the cleanup routine was happening in the background, we had a race between:
Why did this surface on Linux? Linux boxes were faster and cheaper than the Solaris boxes that we normally had in production. On Solaris the background clean-up routine was slow allowing the parent process to make its call to Cwd::getcwd() and do any other processing before the background process could remove the current working directory. On Linux that background process was removing the current working way sooner than what the code was used to.
The solution was to not change the current working directory to a temporary directory and then leave it there. The code should have taken precautions to change the working directory back to the original working directory. Another solution would be to perform the chdir() in a background process, leaving the parent process unaffected. Thus, the cleanup routine would be free to remove the temporary directories it knows about without affecting any other parts of the program.
For this issue, it was vitally important that we had a reproducible test case. Through code inspection I was pretty confident that my fix solved the issue. However, my confidence level increased when I used my test script to call the fixed tool a few hundred times in a row without encountering the Cwd::getcwd() error.
The Time Stamp Bug
In another case, our team was responsible for version control tools that used ClearCase as a back-end version control system. A developer reported an issue to our team whereby his committed changes were not visible in the branch to which the code was committed. Nothing like this had ever been reported before. Additionally, the code regarding file version selection had been stable for over 10 years, used by thousands of engineers, in a global distributed environment.
After reproducing the issue, the first step was to identify the component that was failing. Our tools cached recently accessed file versions outside the version control system to speed up subsequent checkout operations. My first thought was that the cache was corrupted. I cleared the tool’s cache and regenerated it from ClearCase. The issue persisted leading to ClearCase as the root cause or so I thought.
To understand the root cause, I’ll briefly explain how these particular tools worked. Unlike version control tools like Git, base ClearCase doesn’t store commit records. Instead, it tracks a loosely coupled set of file versions. To compensate for this limitation, our tools stored their own commit records. For each commit, the tools would extract the time stamp of the last committed file version and store it with some other data associated with the commit. When checking out the source code, our tools would then use that time stamp and other data to query ClearCase for all file versions on a given branch whose time stamps were less than or equal to the time stamp in the commit record.
Let’s suppose, the time stamp for the commit in question was 11/4/2018 01:15:00. When our tools fed that time stamp to ClearCase, ClearCase would return the file versions that existed prior to the user’s commit, but not the ones that were part of the commit itself. Next, I confirmed that the file versions were actually committed to ClearCase. I queried ClearCase for ALL the latest versions on the given branch, with no timestamp qualifer. That query did in fact return the newly created file versions that the user was looking for.
At this point it appeared that we had isolated the root cause to ClearCase itself. i.e ClearCase was reporting a file version’s time stamp as 01:15:00. However, if I queried ClearCase for the file version that was less than or equal to 01:15:00 it was returning the file version’s predecessor. Even if I queried for a time stamp that was a few seconds or even minutes later than that file version’s creation time, I couldn’t get ClearCase to return it.
Despite, the evidence that ClearCase was at fault, I still felt I was missing something. So, I started considering other possible root causes. The first thing I considered was time zone issues. Our tool set was a distributed system, and I wondered what would happen if a client and server were in different time zones.
As a result, I studied the logs and the source code carefully. It turns out all the machines involved in the commit were in the Pacific time zone. Additionally, if the client and server were in different time zones, there was some code to account for that. Also, before storing the time stamp in a commit record, we converted it to a UTC time and when the tools fed that time stamp back to ClearCase we converted it to a fully qualified time stamp with a time zone offset, leaving no ambiguity. The code that performed these time stamp conversions looked solid and as mentioned above had been stable and error free for over a decade.
At this point, it was important to unblock the user. A new commit record with a more recent time stamp selected the user’s missing changes. Thus, the workaround at this point was to either update the time stamp of the user’s commit record to a later time or create a new commit record. After creating a new commit record whose timestamp did “see” the user’s latest commit, I had unblocked the user.
Recommended by LinkedIn
With the user unblocked, it was still important to identify the root cause of this issue. Even though it looked like an issue with ClearCase and I was getting close to opening a case with Rational Software, I still felt I was missing something. My next strategy, knowing that more recent time stamps did in fact select the file versions that we wanted, was to try different time stamps after November 4, 01:15:00 2018 until I arrived at one that worked as expected. I knew that time stamps a few seconds or minutes after 01:15 didn’t work. However, I knew that time stamps several hours after that time were fine. After only a couple of iterations, I discovered that a time stamp exactly one hour later than 01:15:00 was able to select the file versions that we were interested in.
1 hour! I knew exactly what this was connected to, and I was frustrated with myself for not seeing the connection earlier. The commit in question was created on a day when daylight savings ended, and clocks rolled back by 1 hour. I still didn’t understand the root cause. As discussed, we were storing a UTC time and converting it to a fully qualified time stamp when querying ClearCase. However, I had a lead.
After pouring over the documentation for things like gmtime(), timegm(), localtime(), and adding ‘daylight savings’ to my internet searches, I finally came across some documentation for Perl’s Time::Local module that discusses ambiguous local times. Because at 02:00:00 on a day when daylight savings ends the clock gets reset to 01:00:00, the hour between 01:00:00 and 01:59 gets repeated twice. Thus, all those local times are ambiguous. The documentation for Time::Local says that when given an ambiguous local time, the earlier of the two possible times is returned.
Now I had the root cause. When our tools requested the time stamp of the last checked in file version, ClearCase returned November 4, 2018 01:15:00. However, this was an ambiguous local time. Perl’s Time::Local module would have no way of determining whether this was the first 01:15:00 or the second, and thus it just chooses the earlier of the two. When that gets converted to a UTC time, it actually represents a time that is 1 hour earlier than when the commit actually happened. That’s why the time in the ClearCase query had to be moved exactly 1 hour later in order to “see” the latest commit. Now I understood why we hadn’t seen this before. It was only going to happen when commits happened during the hour daylight savings end.
This led to a fairly easy solution for a fix. When querying ClearCase for the time stamp of the last checked in file version, we had to make sure it returned a fully qualified timestamp with a time zone offset. Such a time stamp can be converted to a UTC time stamp with no ambiguity.
This taught me a valuable lesson about timestamps. A system should always store and process fully qualified timestamps by default. The only time it should return a local timestamp is when a user asks for it explicitly. However, if there’s any chance that local timestamp will be processed by a machine, then a bug like this one might surface.
The Mini Hang
In a third case, one of our tools was pausing anywhere from a few minutes to 20 minutes before proceeding with the remainder of the command. This was unacceptable. The pause would happen when our tool would invoke another tool, capture it’s output in a variable and print the output to stdout. For those that are curious, the code which implemented this functionality looked something like:
$pid = open($read_handle, "-|", @some_command);
while ( $read_len = sysread($read_handle, $output, 4096, $output_length)) {
# Do stuff with the $output
}
For those not familiar with Perl, this code executes @some_command in the background and opens up a pipe to its output. It then performs reads on that pipe 4096 bytes at a time, until the handle is closed
This code was stable and was being used to invoke many different kinds of commands without any hangs or pauses. The pause seemed to only manifest when calling a command called ‘cc_update’ which was another tool that our organization owned. The hang was more pronounced on users’ machines. We were able to reproduce it locally but the hang usually lasted under a minute. There was some thinking that users must have been performing builds and tests on their machines creating a load that would worsen the hang. I’ll explain later why that was not the case, at least not all the time.
The next step was to use commands like strace and ps to see what the system was doing. During the hang, the ps command did not list cc_update. So I had evidence that it had completed as expected. The output from cc_update on the screen also convinced me that it was exiting successfully. The strace output showed:
<snip>
write(1, "You are now all set for commit.\n", 32You are now all set for commit.
) = 32
write(1, "Please execute \"commit\" to commi"..., 50Please execute "commit" to commit your changeset.
) = 50
time(NULL) = 1409120125
write(1, "Reading from PID 13447 at Tue Au"..., 51Reading from PID 13447 at Tue Aug 26 23:15:25 2014
) = 51
read(3, 0x8b79853, 4096) = ? ERESTARTSYS (To be restarted)
--- SIGCHLD (Child exited) @ 0 (0) ---
read(3, "", 4096) = 0
time(NULL) = 1409120127
write(1, "PID 13447 terminated at Tue Aug "..., 49PID 13447 terminated at Tue Aug 26 23:15:27 2014
) = 49
write(1, "Closing read handle\n", 20Closing read handle
) = 20
close(3)
<snip>
The ‘Child exited’ line also offered proof that cc_update exited successfully. The read() call immediately after that line was where the hang happened. That system call was the result of the sysread() Perl function. The exit of cc_update should have propagated an EOF to the other end of the pipe being read by sysread(). Then sysread() should have returned 0.
Let’s take stock of all the information we collected so far:
The next step was making it easier to reproduce this error. It was incredibly expensive to run our tool and get it to the point where it finally executed the code in question. We wrote a simulator with lots of tracing statements that simulated the call to cc_update and captured its output in the same manner as the production code.
Experimentation
At this point I tried various experiments to gather more information:
Rubberducking
Rubberducking saved the day. In my research I had been reading about the read-file handle not receiving an EOF until all references to the write-file- handle being closed. While writing those words in an email to my team, I immediately thought of a parent process and a child process writing to the same file handle and competing over console output. Without fully comprehending what my brain had discovered, I decided to search the ps output for any processes that were attached to the same tty as the calling process. As soon as I saw process 12715 in the below output, I knew exactly what was happening. Before it exits, cc_update invokes the msite_sync command in the background. That child process is writing to the same end of the pipe that cc_update was writing to before it exited. The other end of the pipe doesn’t receive an EOF because as far as it knows, there could be more content to read.
$ ps -ef | grep pts/0
<snip>
ryan 11367 8146 0 01:03 pts/0 00:00:00 strace ./sim.pl
ryan 11368 11367 0 01:03 pts/0 00:00:00 /usr/bin/perl5.8.8 -w ./sim.pl
ryan 11370 11368 0 01:03 pts/0 00:00:00 [sh] <defunct>
ryan 12715 1 0 01:07 pts/0 00:00:00 sh -c msite_sync -waittime 30 -w -t replica-1 -s hub1 /vob/ios > msync_out_file.ios 2>&1
<snip>
This explains why the hang is so long. msite_sync can be a long-lived process especially in large active code bases. On testing machines where this large code base didn’t see a lot of activity, that process wouldn’t take more than a few minutes to complete, which is why we didn’t experience long hangs on our testing machines. Out in the wild, it could take 5-10-20 minutes even on a non-loaded machine.
One solution to this problem would be to enhance the CCTools update command to properly detach any background child processes from its parent’s file handles. However, it was also important for the caller of cc_update to protect itself from calling commands that didn’t take such precautions. Thus I believe we resolved this issue by forking a child process, punting its output to a temporary file, and finally by using the waitpid() function to detect when that child process had exited.
Summary
Hopefully, I’ve conveyed how I have employed the troubleshooting steps mentioned in the previous post. These examples represent the types of problems that manifest in an enterprise engineering environment.