The Libre.fm blog

Libre.fm June 2016 downtime, what happened?

Advertisements

In June 2016, Libre.fm was periodically unavailable for several hours at a time, and then eventually offline for six days. Here’s what happened.

This post is written by Matt Lee, founder of Libre.fm

So we’re back — thankfully. The last few days have been tricky and worrying for Libre.fm.

To give you an idea of our Libre.fm runs, day to day, let me talk a little about our infrastructure: Libre.fm runs on the Bytemark Cloud (aka BigV) — we have four servers for Libre.fm, two web servers, a load balancer and a database server. The problem was our database server.

As of right now (June 30th, 13:20pm Central Time) Libre.fm has 153330128 of your scrobbles — this represents about 120Gb of PostgreSQL data. Put simply, we ran out of disk space and PostgreSQL shut down.

The first thing we did was shut the server down, add more space and bring it back up. PostgreSQL wouldn’t start.

Next, I made a complete disk backup of all of the PostgreSQL files, and went to look at our backups on rsync.net. And they were broken — because of our disk space issues, we’d been failing to make a backup file, and our backup scripts had been backing up a 0 byte file. Not good.

Clint Adams, one of the Libre.fm developers had a look at the server, but as he was in Cape Town for DebConf, was unable to figure out the problem. I proceeded to follow some online tutorials for dealing with this, and eventually got PostgreSQL to start, but it was throwing errors about being unable to find things.

It turned out that the problem was that we had transaction wraparound — we’d done more than 2 billion transactions on the database (logins, scrobbles, etc) and the database had lost track of where we were. My reindexing efforts weren’t helping.

So, I restored the files a second time, and with help from RhodiumToad on IRC, used the built in diagnostic tools with PostgreSQL to understand the issue better, and to reset the transaction logs. Huge thanks to RhodiumToad! I found this tweet:

I was then able to upgrade PostgreSQL from version 9.1 to 9.4, put in a better backup script and tell our backup service rsync.net to alert us if no backups happen for 2 days.

I was finally ready to bring the site back up, but first… I had a private look at the stats of the site: 153184145 scrobbles. That number is now 153330314 — an increase of 146169 — so while you’ve been busy listening to music, we’ve been busy learning about databases and the importance of proper backups.

It’s clear to me that we need to do a lot more on Libre.fm — I’m sorry we let things get this bad. I’m going to spend a few hours later this week looking at the most obvious bugs and improvements we can make to the site, and start working on those. If you have your own suggestions, please take a moment to register for our GitLab site and report the issue for us.

Follow Libre.fm on Twitter at @librefm or follow @mattl and @robmyers and see how Joan of Arc felt.

Advertisements