June 2016 downtime, what happened?

In June 2016, was periodically unavailable for several hours at a time, and then eventually offline for six days. Here’s what happened.

This post is written by Matt Lee, founder of

So we’re back — thankfully. The last few days have been tricky and worrying for

To give you an idea of our runs, day to day, let me talk a little about our infrastructure: runs on the Bytemark Cloud (aka BigV) — we have four servers for, two web servers, a load balancer and a database server. The problem was our database server.

As of right now (June 30th, 13:20pm Central Time) has 153330128 of your scrobbles — this represents about 120Gb of PostgreSQL data. Put simply, we ran out of disk space and PostgreSQL shut down.

The first thing we did was shut the server down, add more space and bring it back up. PostgreSQL wouldn’t start.

Next, I made a complete disk backup of all of the PostgreSQL files, and went to look at our backups on And they were broken — because of our disk space issues, we’d been failing to make a backup file, and our backup scripts had been backing up a 0 byte file. Not good.

Clint Adams, one of the developers had a look at the server, but as he was in Cape Town for DebConf, was unable to figure out the problem. I proceeded to follow some online tutorials for dealing with this, and eventually got PostgreSQL to start, but it was throwing errors about being unable to find things.

It turned out that the problem was that we had transaction wraparound — we’d done more than 2 billion transactions on the database (logins, scrobbles, etc) and the database had lost track of where we were. My reindexing efforts weren’t helping.

So, I restored the files a second time, and with help from RhodiumToad on IRC, used the built in diagnostic tools with PostgreSQL to understand the issue better, and to reset the transaction logs. Huge thanks to RhodiumToad! I found this tweet:

I was then able to upgrade PostgreSQL from version 9.1 to 9.4, put in a better backup script and tell our backup service to alert us if no backups happen for 2 days.

I was finally ready to bring the site back up, but first… I had a private look at the stats of the site: 153184145 scrobbles. That number is now 153330314 — an increase of 146169 — so while you’ve been busy listening to music, we’ve been busy learning about databases and the importance of proper backups.

It’s clear to me that we need to do a lot more on — I’m sorry we let things get this bad. I’m going to spend a few hours later this week looking at the most obvious bugs and improvements we can make to the site, and start working on those. If you have your own suggestions, please take a moment to register for our GitLab site and report the issue for us.

Follow on Twitter at @librefm or follow @mattl and @robmyers and see how Joan of Arc felt.



Filed under Uncategorized

5 responses to “ June 2016 downtime, what happened?

  1. J

    One thing you should prioritize for the user end of things is the ability to delete data vs. having to delete an entire account to do so. You mentioned looking into this on a reply to a reddit thread a time ago but apparently it never materialized.

  2. I would suggest some form of monitoring as well. So you can have alerts for when the PostgreSQL server is having problems with disk space. Not only that, but also monitoring your RAM and CPU usage as well.

    • Yeah, we’re looking at proper monitoring stuff right now. Is nagios still the way people do it?

      • I am really sorry. Just read your comment now (I don’t know why wordpress didn’t send me an e-mail about this). Yeah Nagios is a good options as it is Sensu. There are other options like elk (Elastic search, kibana, logstash). But I would suggest start as simple as you can and improve as you go.

        Good luck!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s