Despite soft-launching email replies and testing it extensively, we’ve discovered a single point of failure that has led to all of the email replies sent since late on 7/8 being lost. Naturally, it goes without saying that we understand the severity of this mistake, and apologize deeply for the communication problems that it has likely caused among our customers and their teams.
Instead of setting up and maintaining an inbound mail server ourselves, we chose SendGrid to handle it for us. With an impressive list of customers, we felt like they’d do a much better job handling inbound mail than we ever could. And, I’m still confident that they can. We placed a lot of faith in them, but with 20/20 hindsight, we should have put some checks in place on our end just to be safe.
On July 7th, we received a support request from someone who was having problems getting their email replies into the system. We immediately looked into the matter and discovered that SendGrid was no longer notifying us of any new inbound emails. After working with SendGrid via their very responsive live chat, we were able to resolve the problem.
Unfortunately, we received another support request today from a customer having problems with their email replies not showing up. This time, some research showed that SendGrid hadn’t been passing along emails since about 6:00pm Central Time on July 8th. So, the original fix on SendGrid’s end was no longer working. As a result, for a week, we were not receiving any replies to notification emails as SendGrid was failing to handle them and pass along the data to us. To make matters worse, SendGrid has assured us that all of the email replies received during that time have been lost.
To SendGrid’s credit, both times, they were incredibly responsive and helpful in resolving the problem. Unfortunately, that doesn’t bring your email replies back. The system is working again, and if you reply to old “replyable” emails, your comments will make it into Sifter. However, for the time being, we’re disabling this feature for new notifications until we can be confident that we won’t see any additional problems.
What are we doing about it?
There’s two layers to this. The first is what SendGrid is doing. They’ve fixed the problem and it’s working right now, but we’re hesitant to flip the switch right away and turn it back on for everyone. They’ve assured me that the bug is indeed fixed now and they are in the process of adding the necessary tests to ensure that wildcard subdomains on email addresses will be handled properly. For the technically-minded, they said that they will have their existing SMTP monitoring server send messages to parse API addresses that they set up with several levels of wildcards. It will then be able to detect if the message doesn’t get sent through, and then have that setup with Nagios to notify them.
The second layer is what we’re doing to make sure this doesn’t happen again. In the short-term, we’re going to dial it back until we’re confident with the reliability. We’re going to put some checks in place so that we can keep a much closer eye on the reliability. We’ll still be using the feature internally, and we have a couple of customers using it as well. When we feel like it’s ready, we’ll flip the switch and have everyone up and running again.
Despite our best efforts and internal testing, our email reply functionality hasn’t lived up to our standards, and we sincerely apologize for that. SendGrid is stepping up their game, and I find it hard to be mad at people or companies for mistakes when they’re as honest and responsive as SendGrid has been. With both us and SendGrid keeping a closer eye on things, it’s unlikely that this will ever happen again. However, just to be on the safe side, we’ll be limiting the availability of the feature to ourselves and a handful of beta customers until we’re confident that things are running smoothly. If you have any questions or concerns, please don’t hesitate to contact us.