Wed, 13 Oct 2004

Watchdog

I had a server in our colocation facility crash a couple of days ago due to an out of memory condition. Sadly, Linux’s behavior when it runs out of memory appears to consist of putting fingers in ears and screaming “kernel: __alloc_pages: 0-order allocation failed” to the console over and over, as opposed to doing something useful like letting me log in via ssh.

Enter watchdog. All that watchdog does is open /dev/watchdog, and write a single byte into it every 10 seconds. So long as it does this, nothing happens. However, should the watchdog stop writing that byte for over a minute, the system is reset, hard (in other words, just the thing to hack around stupid memory behavior, and hopefully removing the need for me to go to the colo).

Unfortunately for me, I don’t have the money for server hardware that supports watchdog hardware, so I’m stuck with the Software Watchdog (supplied by the softdog kernel module). The softdog can help in some situations, but won’t do any good in the event of a true hardware lockup when interrupts are scrambled. Still, it’s better than nothing.

From my testing, the watchdog appears to do no harm, and consumes minimal system resources. I’m setting it up on all of my systems, Just In Case (tm).

[/config/watchdog] permanent link