Friday's Unscheduled Maintenance and Data Loss

Was I affected?

Any data entered into Sifter between Friday, March 18, 2011 at ~~4:00~~ 3:59 UTC and ~~12:30~~ 16:52 UTC has been lost. Note: Updated times to be more accurate. 16:52 UTC is when Sifter came back online, so this time window is larger than the window of lost data because we had disabled web access for the last hour and a half to two hours during this time window.

What if I was affected?

While some of the data in the database has been lost, teams should collectively have a record of the data created in their email due to Sifter’s email notifications. It isn’t a perfect solution, but the emails should be able to help minimize the chance that anything slips through the cracks.

What should I do now?

If you were affected by the data loss or downtime, please contact us via our support site. There’s no way for us to put a value on the inconvenience to you and your team, but if you’ve been affected, let us know, and we will credit your account for the month. If you have any questions, or would like clarification, don’t hesitate to ask.

What happened?

Around 1:30 AM Friday, March 18, 2011 UTC, we began a slice resize to boost performance in the short-term as we made long-term plans for significant improvements to our production environment. Everything checked out and ran smoothly into the evening. This morning, upon reviewing and double-checking everything, we decided that the resize wasn’t making a difference and decided against it.

At that point, we had a decision to make. We could either rollback the slice with about a minute of downtime or confirm the resize and then resize downwards at a later time with 20-30 minutes more downtime. Unfortunately, we decided to try and minimize the downtime and just rollback the slice. As a result, all data that had been entered on the new slice was lost as the new slice was deleted in the rollback.

We realized this almost immediately after the resize and immediately turned Sifter off as we quickly began researching and exploring our options. We verified with Slicehost that the data was indeed lost. We run daily offsite backups every night at midnight our time, so we knew that we could recover all of the data prior to that time.

With Sifter disabled, we restored from our most current backup. Unfortunately, the backup didn’t include any data created after 4:00 AM Friday, March 18, 2011 UTC. So anything created after that time has been lost. We know that apologies don’t go very far, but it should go without saying that we are truly sorry for the problems that this has inevitably caused.

What steps are you taking to prevent this in the future?

First and foremost, this is one of those lessons that you learn from and absolutely never forget. So, we won’t make a mistake like this again. Of course, that’s not enough. Prior to this incident, we were already in the process of starting to explore our options for improving our architecture. Our main priority was improving performance, but we also plan on adding additional layers of redundancy and backups as well. We’ll be better than ever as a result of this, but we know that doesn’t bring back the lost data.

We don’t have full technical details yet because we’re still evaluating our next steps, but rest assured we’ll be making significant updates to our architecture and backup system.

A Sincere Apology

Words always seem kind of empty when something like this happens, but anyone who’s ever contacted us should know how passionate we are about taking incredible care of our customers. We’re taking this hard, and we’ll be working even harder to make amends. We sincerely apologize for our mistake and look forward to making this up to you.