System Independent Debugging
Slanted Towards Endevor Support
With Apologies to Dave Alcalay, who originally wrote a document of the same title and from whom I learned some of these principles.
Pay Attention to the Rules:
Rule Number One: The User is Always Wrong. If the User is right, see Rule One.
OK, so that isn’t always entirely true, but it is ordinarily a reasonable starting point.
Rule Number Two: It is Difficult to see What is Really There.
Don’t assume that what is there is what is supposed to be there. Look carefully at what is there, trying to see everything and not assuming that what you think SHOULD be there is what is there.
One of my first debugging victories was a user’s JCL that was perfect, except it all started in column 3, not column 1. Every time the job was submitted, it promptly disappeared because JES2 thought it was just data and flushed it.
Common cases of this kind of issue are things that do not work because of missing periods or continuations.
Another recent case was a processor that kept ABENDING with an S806 (module not found or could not be loaded). Yet the STEPLIB was there, and the module was in the STEPLIB, and the REGION should have been sufficient. Under JES3 as locally configured, if there is a DUPLICATE DD statement, one of them gets ignored. As it happened, there was a duplicate STEPLIB, so the one that HAD the needed program was not the one that was used, hence the S806.
A single extra space or missing space or a comma where a period should be can make something fail.
As the famous Isaac Asimov said, “Look look, look.” This is a foundational principle of science.
Rule Number Three: Don’t Assume the Data is Valid.
A couple of binary zeros is not the same as a couple of blanks. Bad data that is not detected early enough by the code can crash pretty much anything. Sometimes, “HEX ON” is your best friend. And lowercase letters in JCL, ordinarily, are a very bad sign, except for UNIX stuff. Endevor processors also choke if there are lower case letters in a dataset name, except for UNIX stuff.
Rule Number Four: It is harder to find out why something did NOT happen than why it did.
In cases where something did not happen, you sometimes have to trace the process through from the beginning, looking for where it might have broken down. Don’t always start with the failing job. Take a look at the predecessors, if appropriate, and at the input data. Ditto, the failing step.
A classic example of this is a BIND did not happen. But what makes a BIND happen (in one Endevor instance in a brokerage shop)? A valid, approved Bind inventory record that is correct, inclusion of a certain control element, switches appropriately set in the processor symbolics, etc. And some of the jobs in this process, sadly, flagged an invalid BIND inventory record, put out an error message, and gave a good return code. Not cool. The software is not properly designed if a good return code is given when the job actually fails.
Rule Number Five: There is no such thing as an uncaused event. We may not be able to find the cause, but there always is one.
So we try to find the root cause. If we cannot find the root cause, we watch for a reoccurrence and carefully note all the perceivable factors that might be involved.
Rule Number Six: The last thing to change is likely the cause of the error, so when the problem began to occur is significant.
There is, unfortunately, such a thing as a latent bug, but most of the time, a problem begins to happen because of a catalyst, which is usually a change. And the user often does not observe that he picked up a different copybook or DCLGEN or used a different compile parm. With Endevor, the Component list is your best friend in that case as it records such things, but careful observation is always needed.
When something that was working suddenly begins to fail “for no reason,” that means that you have not yet identified what has changed. SOMETHING has to have changed, or else it would continue to work. The trick is to identify what has changed that could cause the symptoms you are seeing.
Sometimes, the change was not the actual CAUSE of the problem, but contributed to making it more likely to occur.
Rule Number Seven: Truth is binary valued. A thing is true or it is not. There is no middle ground. ALMOST does not count “except in horseshoes and hand grenades.”
If a thing appears to be true SOMETIMES and the software seems to fail randomly, then you haven’t identified all the factors, which can be many. So you look for when it started, what was happening at the time, what the failures have in common, etc. If you do the same thing in the same way under the same conditions, you should get the same result. If you do not get the same result, not EVERYTHING is the same. Even the time of day can be an issue, as in the old midnight problem with PDS files in the 80’s. They would sometimes get corrupted if updated right at midnight.
And I heard of a server that always failed on a certain day of the week at a certain time, unexplainedly. That turned out to be because the cleaning lady would unplug it so she could plug in her vacuum cleaner.
Rule Number Eight: The error message is not always where the problem is. Sometimes, it comes long after the problem.
If checking a SYSLOG, for example, look for the FIRST SIGN of abnormal behavior and start from there. Ditto a job log.
Rule Number Nine: If all else fails, raise the REGION.
OK, so this doesn’t fix everything, but like rebooting a PC, it fixes a whole bunch of problems.
Rule Number Ten: Never let the user diagnose the problem.
When a user starts asking me questions that seem to be fishing for information about how something works, I often “cut to the chase” by asking why the user is asking.
The clue that the user is trying to debug his own problem is often that the user asks a question that is TOO specific or TOO general, such as “why does COBOL now default to reentrant code?” the wisest course of action is to respond asking why the user asks. Generally speaking, the users do not have sufficient knowledge of how the operating system or the compilers or the linkage editor (aka binder) work to diagnose their own problems entirely correctly.
You can save an awful lot of time by getting the user to articulate the symptoms. And you, as the Doctor, can then determine what is wrong and how to treat it.
Rule Number Eleven: Remember Occam’s Razor.
Occam’s Razor is a principle that states that if there are multiple potential explanations for why something happened, the simplest is the most likely to be true.
Therefore, when the user changes his code and finds a problem with it, rather than speculating that power fluctuations might have caused the mainframe hardware to malfunction, the first place to look is at the user’s changes.
Rule Number Twelve: Try to get ALL the relevant data.
“Endevor is broken” does not help you fix the problem. Even “it ABENDED” doesn’t help. With what did it ABEND? Where? What is the jobname? What is the job number? What did the user do before the failure? What else was the user doing at the same time? On what LPAR did it run? Etc.
When files are supposed to be “linked,” that is, to have each a record for every record in the related file, you can get all manner of strange behavior if that is not, IN FACT, the case. So if your record counts for a certain type of record in one file don’t match the counts in the other, don’t begin by assuming the files are correct. The Endevor PCF and some local control files in a former shop were “linked,” but sometimes because of failures, a record got deleted from one or the other, so results were not what was expected.
Rule Number Thirteen: Beware of too little data or too much data.
Often, you have to zero in on one problem at a time in order to fix a complex issue. “My compile failed.” OK, but you have 12 syntax errors. Let’s fix those first before we conclude that Endevor is broken.
Rule Number Fourteen: Even if the user says he did not change anything, he did.
Never blindly believe the user when he claims he hasn’t changed anything. Copybooks can change, even without being noticed. Statically included modules can change. System software can change. Data can change. But the most frequent cause is that the user DID change something but does not think it was significant.
So always verify, point by point, whether anything did or did not change.
Endevor has the component list, which is frightfully handy for this type of work.
Rule Number Fifteen: The easiest way to tell if something is wrong is to know what normal behavior looks like.
It takes time and careful observation to learn what is normal, and, sadly, that changes over time, so you have to refresh your knowledge all the time.
Rule Number Sixteen: If you do not know FOR SURE what the message means (and even the return/reason code), look it up.
We encountered S213-74 ABENDs in the middle of the night in a process which had been working. This was a case in point. We all know what an S213-04 is (format one DSCB not on volume or cannot be accessed), but a -74 was new to me. It turned out that it was caused by a vendor coding error in a processor which did not come to light until there was an operating system upgrade.
Rule Number Seventeen: Sometimes you have to follow a process of ruling out possible causes in order to find the actual root cause.
Always start with the easiest/most likely cause. Entertain outlandish possibilities only when everything “normal” and “usual” has been ruled out.
Rule Number Eighteen: Learn the rules, for example, of JCL.
Refresh your knowledge, as they change them over time. It is impossible to support JCLCHECK, for example, without a thorough knowledge of JCL. Ditto, Endevor.
Rule Number Nineteen: Do not take the user’s word for it
When the user reports that after your attempted fix the same error occurred, look at the job to see if that is really the case. Users often do not even look beyond the return code. They do not often read messages. If last time, you got a 1708 DARC and this time you got an S0C1 ABEND, the cases are definitely NOT the same, even if the overall return code was 16. Always look yourself at the errors that were encountered.
Rule Number Twenty: Pay Attention in Meetings
As boring as meetings can be and seemingly irrelevant, we debugged a seemingly unrelated failure to CAST a package when someone made a chance comment in another meeting about a change that had been made to UNIX security. This was the change that required an “OMVS segment” when certain UNIX services were invoked which involved TCP/IP. If we had not been paying attention in the unrelated meeting, we could have gone days or weeks without being able to figure out what was wrong.
Techniques:
The Alaskan Wolf Hunt Method of Debugging:
There is a wolf out there in Alaska. Divide Alaska in two. He is either in this half of Alaska or the other half. If he is in this half, we are done. If not, we divide the other half of Alaska into halves and check one of them. If he is in the quarter we check, we are done. If not, divide the other quarter in this half of Alaska into halves. And continue. Eventually, we catch the wolf.
In programming, this if often done by putting messages or displays into the code. If the error happens before the message, that narrows down where it might be occurring, etc.
The Social Security Pensioner Issue:
A program in the Social Security Administration worked perfectly until the first pensioner turned 100. Needless to say, the age field had only two digits.
Always look for overflow or underflow. Don’t let the number of digits give you the fidgets. And watch out for truncation. Differing COBOL options (and the same option in differing versions of COBOL) can cause different truncation behavior.
Boundary Testing:
If the results are incorrect, check to see if the fields are of appropriate size and type for the data that is expected. Then look at the actual data and see if they still seem appropriate.
Watch for zero counters and adding zeroes to counters. I spent 8 solid hours trying to figure out why a program was never executing the “end of job” logic, only to discover I had been adding zero instead of one to the counter. OOPS. Did I ever feel stupid!
Watch for exceeded limits. We had a job that ran successfully for years, only to go down with an S0C7 as soon as the number of records hit 10,001. Your guess? Yes! The table size was 10,000 records. The next record caused an overflow which in turn caused the S0C7.
Don’t Assume it Worked:
Responsible programmers check the return codes and file statuses from opens, reads, writes, closes, etc. If you don’t, you can expect strange results.
Ditto for dynamic allocations and program calls.
There is an S0C4 ABEND with your name on it just waiting for you if you do not heed this advice.
Never Code Insulting Messages:
If you code “Turkey! This should not have happened,” you can be positive you will receive that error message, probably hundreds of times. Computers will always insult you, given the opportunity.
Never Completely Trust the Comments:
Comments and documentation are usually not updated each and every time something changes. Do NOT assume they are accurate.
It Works in Production but Fails in Test:
I remember being called by the Vice President on a case where the job worked in production but failed in test and was “exactly the same.” I brought up the working job and the failing job side by side and started looking. The failing job had tape mounts. The working job did not. I pointed that out to the user, who then admitted that the jobs were ALMOST the same, but not quite. ALMOST THE SAME is not the same as THE SAME.
Start noting the differences, then triage them, concentrating on the differences most likely to create the effects noted.
Try to Get All the Relevant Data:
A user reported that JCLCHECK was malfunctioning when run a certain way but produced correct results when run differently. I spent 8 hours beating my head against the wall, attempting to narrow down PRECISELY what was causing the product to behave differently under different circumstances, only to discover that the “failing” case had additional local code that suppressed the messages the user was wanting to see. The product was not malfunctioning AT ALL. We caused the problem ourselves. Sigh.
The quote remains pertinent, “We have met the enemy, and he is us.” -- Walt Kelley
Good one Dorothy!!!