Downtime and Lessons Learned

Well, barnson.org, and its sister site, outlanders-outfit.org, were down over the weekend. It is a brief object lesson on interdependencies in the Internet world; click “read more” for details.

Well, barnson.org, and its sister site, outlanders-outfit.org, were down over the weekend. It is a brief object lesson on interdependencies in the Internet world; click “read more” for details.

You see, I had a script on Outlanders Outfit page that monitored our outfit’s Teamspeak server. The script has worked pretty well for months, with occasional problems that generally worked themselves out in short order. Here’s the code. It ran within a “block” entry on the sidebar of my Drupal front page for Outlanders. Drupal drives barnson.org as well. Pat yourself on the back if you can figure out the problem:

//Display formatting to match our block layout...
echo ("<div class=\"box\">
<p class=\"boxtitle\">Who's On Teamspeak</p>
<div class=\"item-list\">");

// opens a connection to the teamspeak server

function getSocket($host, $port, $errno, $errstr, $timeout) {

global $errno, $errstr;

$socket = fsockopen($host, $port, $errno, $errstr, $timeout);

if(!$socket or fread($socket, 4) != "[TS]") {

echo("Server error: Teamspeak server not running!");

return false;

}// end if

return $socket;

}// end function getSocket(...)



// sends a query to the teamspeak server

function sendQuery($socket, $query) {

fputs($socket, $query."\n");

}// end function sendQuery(...)



// returns the result of the last query

function getResults($socket) {

return fgets($socket);

}// end function getResults(...)



// closes the connection to the teamspeak server

function closeSocket($socket) {

fputs($socket, "quit");

fclose($socket);

}// end function closeSocket(...)





// ---=== main program ===---



// establish connection to teamspeak server

$socket = getSocket("teamspeak.outlanders-outfit.org", 51234, $errno, $errstr, 1);

if($socket == false) exit;



// select the one and only running server on port 8767

sendQuery($socket, "sel 8767");

if(getResults($socket) != "OK\r\n") {

echo("Internal server error! Cause unknown.");

exit;

}// end if



// retrieve player list

sendQuery($socket,"pl");



echo("<table><thead><th align=\"CENTER\">ID</th><th>&nbsp;</th><th align=\"CENTER\">Name</th></thead>\n");

$counter = 0;

do {

$playerinfo = fscanf ($socket, "%s %d %d %d %d %d %d %d %d %d %d %d %d %s %s");

list($number, $d, $d, $d, $d, $d, $d, $d, $d, $d, $d, $d, $d, $s, $name) = $playerinfo;

if($number != "OK") echo("<tr><td align=\"CENTER\">$number</td><td>&nbsp;</td><td align=\"CENTER\">$name</td></tr>\n");

$counter++;

} while($number != "OK");

if($counter == 1) echo("<tr><td colspan=\"3\" align=\"CENTER\">No players on server</td></tr>\n");

echo("</table>\n");

// close connection to teamspeak server

closeSocket($socket);

echo ("</div><p><a href=\"node/view/55\">Click here to use Teamspeak yourself!</a></div>");

If you look closely, nowhere do I handle timeouts. That means that php is waiting on the timeout value of the operating system to indicate that it was unable to connect to a remote host.

This is bad on a web site with even a moderate number of hits. Outlanders is small, but several sections (notably, some of my Humorous Pictures get quite a few reads and are occasionally linked from high-traffic sites. Kind of funny, that, since I pretty much just grabbed someone else’s images and put my own captions on them.

To compound the problem, as a temporary workaround I disabled the .htaccess file controls for the site, and replaced the index page with a plain HTML page that said “Ouldanders is down for the moment, it should be back Monday morning”. I had thrown that up there when I hadn’t figured out the problem yet, and figured that would help it hold until I returned from vacation late Sunday night.

Well, apparently the Teamspeak server operator decided my site was querying his TS server too much (and I must agree, it hit the TS server pretty much for every page read on my site due to the way I have this configured), so set the Teamspeak server to drop packets from my virtual host.  Not a problem for him, but I eventually ended up with my Apache server maxxed out on processes as it waited for the return from Teamspeak.  It doesn’t take many people clicking “refresh” waiting on a 3-minute operating system timeout on a tcp connect() to max out your Apache processes.

  1. My FreeBSD server began spiralling apache processes out of control, because it was having to wait on a timeout for the Teamspeak monitor for each process that never came.
  2. This continued for an entire day.
  3. The automated monitoring scripts caught on that my virtual server was sucking up too much RAM.
  4. Rather than just killing the offending process, it shut down my entire VDS (jailed FreeBSD server)
  5. I was offline until the 65535.net admin intervened.

So, the lesson learned is this: when I have a dependency on another system, I need to be a bit more careful in how I handle communication with that other system.

I cranked down my StartServers for this Apache to 2, the MinSpareServers to 2, MaxSpareServers to 3, and Maxclients to 20.  MaxClients had been set to 150, and each Apache process takes up 7.1Mbytes.  Since I’d maxxed out Apache with all these processes just spinning their wheels, that sucks up an entire gigabyte of RAM.  I don’t think these little VDS servers even have that much RAM!

With MaxClients set to 20, it means that I won’t survive a Slashdotting very well, but I also won’t suck up all the RAM on this poor box.  It makes me wonder how well putting a Squid proxy-cache in front of Apache as a reverse HTTP accelerator might work, since there seems to be plenty of CPU horsepower — just not very much memory.

Anyway, that’s the nutshell of why barnson.org and outlanders-outfit.org were unavailable for the weekend.  I learned a little bit as I sorted out what was dying.  The one thing I need to figure out, though, is where Drupal stores the Blocks configuration and Modules configuration settings in MySQL.  If I hadn’t had a browser window left up since Friday that had been logged into the Outlanders administration menu (and thus had the right cookie to admin without login) I might never have been able to use the web interface to login and make changes.  That was pretty hairy.  So I’ll be plunging into the code, figuring out first how to emergency-admin Drupal, and then writing some error checking/shorter timeout values and caching into my Teamspeak monitoring script and then negotiate with the TS2 server admin to allow my server to query his again.

Whew, what fun!

4 thoughts on “Downtime and Lessons Learned”

  1. Oh my, i hope you’re not using

    Oh my, i hope you’re not using this piece of code anymore. This looks like a slightly modified version of what i posted on the TS forum a looong time ago. It’s buggy and doesn’t work properly with the latest official release. Take a look at http://www.brainspace.info, I’ve pretty much rewritten everything, it queries a lot more info, puts it in an easy to access data structure and it comes with a customizable display script too.

    Kind regards, Brain

    1. I’ll have to check it out…

      I’ve done a good deal of custom coding on Outlanders Outfit‘s TS monitor since then. The queries themselves are largely unchanged, but I’ve modified the script heavily to handle various error conditions. Most notable was to function from a file cache if the server was unavailable or timed out, and to always read from the cache to disply the page, and only check the Teamspeak server every 30 minutes on a cron hook.

      The script is working OK for the time being, but with over a hundred people frequently online on the TS server, it definitely overloads page information. I need to narrow down the channel used, truncate names that are too long for my page layout, etc. I’ll check out the latest version for sure — I’d love to have the data available in an easy-to-use data structure rather than printed directly to stdout.

      Thanks for the link.


      Matthew P. Barnson

  2. Good article, but where is the finished code?

    Your article is very thorough and informative. How do I find the finished code?

    1. Unfortunately…

      Unfortunately, The Outlanders died as an outfit, and it wasn’t important enough to me to refine it further. As it stands, with newer versions of Teamspeak, my code no longer works well.

      That said, it should be fairly trivial for someone to implement it. Teamspeak has a well-documented socket API. The rest of my solution was trivial. Rather than launch it dynamically via PHP, I called a PHP script to get the Teamspeak stats, and wrote the output of that script to a file. The output is valid HTML.

      This meant the Drupal code for my block was ridiculously simple:

      readfile(‘/path/to/teamspeak-status.html’);

      I apologize that there is no current working code. And I don’t plan on releasing any more, unless by some strange change I get back into Planetside and Teamspeak again!


      Matthew P. Barnson

Comments are closed.