With the help of New Relic, I recently discovered one of my more taxed Ubuntu servers was having Rails downtime almost nightly averaging about six minutes each time. This was a bit puzzling because if you’ve read my other blog posts, you’ll know I’m big on monitoring processes and making sure nothing dies without a plan in place to bring it back up; I also test these plans to make sure they work, and setup email alerts for certain failures. In the end, I found a cron job as the culprit, and made a change to the process it was kicking off (
update-apt-xapian-index) which fixed my issues with downtime.
- Rackspace Server (256MB RAM), Ubuntu 12.04
- New Relic
Finding Out There Was a Problem
I’m a big believer in using the minimal amount of hardware necessary for the applications I build, which is why I
am was a big fan of the Rackspace 256MB RAM Ubuntu servers…they discontinued them earlier this year, otherwise I’d still be spinning up new 256MB servers. The downside to a small image, however, is that by the time you host an application or two, there aren’t many resources left over for other processes. This, in my opinion, is also a good thing, because it means the server is sized for the applications and there aren’t a lot of wasted resources being paid for. Well, in this case, the limited hardware bit me when a cron job ran greedily, starving my Rails applications for resources and making them unresponsive.
I probably wouldn’t have realized there was a problem if I hadn’t been running New Relic for my site monitoring. Thankfully, their free level of service provides uptime monitoring and sends emails both when applications don’t respond, and then when they begin responding again. Here’s an excerpt from one of the “alert ended” emails I received:
Identifying the Cause
Fortunately, this was only a staging server where I hosted applications with features I hadn’t yet pushed to production (also part of why this server was under more load than I usually aim for). But this meant it was also a lower priority for me to fix…and that turned out to be a good thing, because the problem was elusive.
Over the course of a few weeks, I had built up quite the collection of New Relic downtime emails. Looking through them for clues, I found the average downtime was six minutes with a few lasting 20 or 30 minutes. The downtime also only happened between 10pm and 1am. At this point I figured I was dealing with a scheduled job not playing nicely with Rails, and assumed (wrongly) that I could remote to the server when I received the next downtime email and see what was running. This proved wrong because the next two nights I received alert emails, I also couldn’t SSH to my box. I knew the machine was online, it was just completely unresponsive.
I had seen somewhere that New Relic offered server monitoring in addition to application monitoring, and again had a free service, although it limited logs to 30 minutes. That turned out to be a simple install, and my key to finding the problem. The next night I received an email alert, logged in to New Relic, and could view the running processes and see what was consuming resources. The server dashboard also revealed that in addition to cpu at 100%, disks and memory were pegged and my swap file was growing.
There was one process that seemed to be consuming all the resources:
I was unfamiliar with this process, so a quick trip to Google told me the fullname is actually
update-apt-xapian-index, it’s used to maintain an index of packages, and a lot of people have posted on forums about it consuming 100% cpu. I found a nice blog post by Naveen with some more detailed information and a few solutions to the problem.
The solutions seemed to be fairly standard across different sites:
- Run the process as a lower priority
- Make the job non-executable
- Purge the package
I went with lowering the priority. The one twist I had is that most of the posts suggest adding the priority to
\etc\cron.weekly\apt-xapian-index as shown below, however, my job already had the priority set as lowest (19):
1 2 3 4 5 6 7 8 9 10
This also wasn’t the source of my problem since my downtime occurred more often than once a week. Looking in
\etc\cron.daily, there is an
apt job which runs at 6:25 UTC, which is just prior to when I would see alert emails. Opening up that job, I found this section toward the bottom:
1 2 3 4 5
You can see there is no process priority specified for
update-apt-xapian-index which means it will default to 10. There is no need for the process to run with that high a priority, so I set it to 19 (lowest):
1 2 3 4 5
So far, I haven’t had any issues since making this change!
Looking back, it was a lot of work for what ended up a four-character fix, but that’s part of the risk/fun when going light on hardware and working on optimizations to maximize the load the servers can handle.
New Relic also aided me with their free services; it was a good lesson that I can’t solely focus on managing ‘my’ processes (Rails, Unicorn, Nginx, etc), it’s important to also have some system monitoring in place.