Adventures in Hell

Last modified: Jan 29, 2008

This page documents (in short form) some cases where software misbehavior caused massive timeless, especially cases where there is no obvious way to fix the underlying problem, and where someone less persistent than I would have simply given up.  I'm documenting them here in the hope that you got here because of a similar problem, and this will help you.

Adventure 2:   (Unix) Diagnosing System Misbehavior

Symptom:  A Unix virtual private server is occasionally killing random processes and issuing "memory alerts" in the control panel.

This is more serious that it sounds - completely randomly selected processes which needs to allocate a little memory is - once in a while - being killed instead.  The system is otherwise running fine, but the overall effect is something like standing inside a building, and periodically knocking another hole in a wall.  Eventually, something really bad will happen.   This story illustrates how difficult and dangerous the Unix environment can be.

First, to start at the end, I finally found the problem by reading through all the obscure log files I could find scattered around on the server.   I could not have done this without root access.   What I found was a "failed, ran out of memory" note in the logs for qmail.  This message was associated with a mail filter I had written and debugged a month before.

The intended output of this filter is to note bouncing email addresses in a database.   This filter is run as user 'vpopmail' as part of the mail delivery process, but it's an ordinary user script, requiring no particular privileges to run.   Most Unix system permit such scripts to be authored by ordinary users.   As part of its "good  citizen" contract, this script logs what it does, and any problems it encounters, in a text file.

In the process of developing and debugging this script, I had discovered that it needed write access to the log file for user 'vpopmail', which I granted on an ad-hoc basis by changing the write permissions on the file.  I knew at the time that this wasn't a permanent solution, but I forgot that before I wrapped the project.

Some days later, my log file rotator created a new log file, without the ad-hoc write permission.   This caused the mail filter to fail because it couldn't write the log file.  -- but wait -- being a "good citizen", the program, it tried to log the fact that it couldn't open the log file. You can guess what happened next.

... so the NEXT thing that happened is that the mail filter went into a recursive dive, used up all the virtual memory available to it, and eventually was killed by the system.   There was no entry in my log file though, and because I wasn't actively expecting a flood of bounced email messages, I didn't notice the absence of a few new bounced email notations in my database.   What I did notice was that once ins a while, some-process-or-other had mysteriously died and needed to be started...   I also noticed that this mysterious unreliability was a new problem, but it took a long time, chasing red herrings and pointing fingers at innocent parties, before I found the smoking gun that allowed it to be fixed.

Chapter 2:  (yes, there's more!)

The other end of this same procedure is sending email.  I have a script that sends these emails, and I've always used it sparingly: I don't want to be considered to be spamming my players.   Since I'm such a good citizen, and have been running a web site for a number of years, I've accumulated quite a few email address which do NOT bounce.   On the most recent occasion when I sent a mass email to all (~3000 now) addresses, instead of the usual trickle of new bounce messages, I received a flood, and my server's email send/receive functions seemed to be severly wedged, which I eventually fixed by rebooting.

Another protracted investigation ensued, which eventually led to the observation that when I sent my 3000 messages (one at a time, but as fast as possible), Unix apparently did it's best to fire up 3000 concurrent "send" processes, which resulted in all sorts of resource exhaustion possibilities, in this case the thing that hit the wall first was the number of open files.  Who knew there even was a limit?

.. and, after more investigation, I found that Qmail is configurable wiht a "concurrency limit" which can be set to a suitable value, which was actually set to 255.
At least there was already a configuration varaible I could tweak.

Adventure 1:   (Windows) Network Activity Graphs

Symptom: DuMeter stopped working. DuMeter is a windows application that graphs network activity, which I find entertaining, and occasionally really useful.  One day I noticed it was reporting no activity, when I knew for sure there was a lot.  DuMeter wasn't giving any clues why it didn't see any network traffic, and there was no way to tell when it had stopped working . I tried reinstalling DuMeter and installing several other network metering programs, with results as follows:
Some email help from the NAD developers led me to iphlpapi.dll, microsoft's recommended network helper API.  I wrote a test program and found that the GetIfTable function was not reporting the existence of my network adapter.  So I concluded that NAD for sure, and probably the other non functioning programs, were being misled by this.  Some poking around in the windows registry revealed that my network card had migrated into an unusual position, as adapter number 4, and there were no adapters number 2 and 3.   This odd state of affairs probably arose as a result of adding and removing PGPnet and VMware which create virtual adapters. I'm guessing that something of this kind is what is confusing iphlpapi.dll.   Theoretically this problem could be fixed by some small rearrangement of the registry, but it's all completely opaque and undocumented, so I used the "big hammer" approach:.  I removed all my network adapter drivers (both real and virtual) and reinstalled them.  Problem fixed.

No thanks to Microsoft, which produced the buggy API through some complex interaction of it's slopware manipulations of the windows registry.  Microsoft would make me pay them for the privilege of reporting this bug to them, and I've spent too much time already.

June 2003,  Windows 2000

Back to my home page email to