Apr 9, 2019

Topics:

HubSpot

A Breakdown of HubSpot’s Outage Retrospective for the Non-Technical User

Apr 9, 2019

A Breakdown of HubSpot’s Outage Retrospective for the Non-Technical User

If you’re a HubSpot user, unless you were on vacation from March 29 until now (if you were, I’m jealous!), you recently experienced one of the most catastrophic outages HubSpot has ever had.

Honestly, it was a mess. Most of the time during product outages you’ll have one or two tools that aren’t working or are having bugs, and it’s resolved within minutes or hours.

This time it was just about everything: emails, form submissions, workflows, lists, sales tools, CRM, imports, analytics. It’s hard to find any pieces of the tool that WEREN’T affected during that time.

What’s even more troubling is that this outage took about 36 hours to be mostly resolved, but the processing of the backlog of data from those 36 hours is still going on at the time this article was released.

Thankfully, during all of this, HubSpot’s crisis communication was solid. Updates on status.hubspot.com were regular and timely (although are they ever frequent enough??) given the scale of the situation.

JD Sherman (HubSpot’s COO) released an article on March 29 with an apology and an outline of next steps for the team -- namely, doing an in-depth retrospective on the cause of the issue and how they’ll make sure it won’t happen again.

That retrospective was delivered on April 4. You can read the full article here. There’s a lot of detail in there about how their systems are structured and what exactly happened. If you’re not into all of the “geek-speak,” we’ve got you covered.

A Quick Review of How HubSpot’s Infrastructure Works

HubSpot uses a combination of software systems -- Kafka and ZooKeeper -- that allow all of the HubSpot tools to talk to each other and all of the data to be processed effectively.

Both these software systems have redundancies and safeguards built into them so that if some servers crash, other servers can pick up the slack, and end users don’t experience any issues.

So What Broke?

It’s a bit difficult to explain without getting super technical, but think about it like a series of unfortunate events.

High strain was put onto ZooKeeper, causing parts of it to crash. Typically, ZooKeeper recovers quickly, but in this case, it took several minutes. The delay in recovery then broke the communication between ZooKeeper and Kafka, causing Kafka to crash.

Even though the team was able to restore ZooKeeper, the damage was done in Kafka and it wasn’t able to recover. What made things worse was a second outage in ZooKeeper accompanied by trying to restart Kafka, which started to cause data corruption.

Why Did It Take So Long to Fix?

Corrupted data? That sounds bad. And well, it is. This is actually why some things took so long to come back online.

When the HubSpot team realized that the server recovery was starting to corrupt data, they had a decision to make: either focus on recovering data (and safeguarding against corrupted data) or focus on restoring the tools.

They decided to focus on recovering data to ensure that there would be no gaps in historical data for customers (which in the long run they believe to be the right decision, and I’d personally agree!). This is the reason that the affected tools took almost 36 hours to be restored.

So, in the name of protecting customer data, HubSpot manually recovered a whoooole bunch of our data, and then was able to restore the affected tools.

This is also why you’re still seeing (at the time this article was published) the “continuing to process data from March 28 & 29” status message from HubSpot.

What Now?

Now that we know exactly what happened, HubSpot’s got a plan to make sure this never happens again. An interesting note in all of this is that HubSpot’s own teams use many of their tools across different parts of the business, so this not only affected their customers but their own business (even more motivation to make sure it never happens again!).

They’re making changes in a few different areas to protect against another outage: technical/infrastructure, reliability, testing, and communication.

Technical / Infrastructure

As is to be expected, HubSpot will be doing some restructuring of their server clusters to make sure it’s not even possible to have an outage this large again. By doing this, any outage that does happen should be restricted to a small piece of the platform, and the recovery time for issues should be significantly quicker.

Reliability

HubSpot does have a team of people who test and upgrade their systems, but it hasn’t been as high of a priority as it should be. Now, they’ll have a dedicated team of people who will “oversee new standards, frequencies, and resources to ensure that we're consistently evaluating our key infrastructure systems for code fixes and critical patches without gaps.”

Testing

Along with investing much more heavily into the reliability of their platform, HubSpot is also increasing the level of frequency and depth to which they’re testing their systems. Again, it’s not that these processes didn’t exist before, but this outage uncovered some gaps in the frequency in which they test for massive failures, as well as how comprehensively they test these systems.

Communication

Lastly, HubSpot is committing to making their communication during any major incident more frequent and helpful, specifically in the minutes and hours immediately following an issue.

Their status updates will now include more detailed explanations of what is going on, as well as when the next update can be expected.

In Conclusion

No one here is pretending that this outage wasn’t bad. Not even HubSpot. But one of the things I appreciate the most about HubSpot as an organization is their transparency and willingness to admit when they’ve messed up.

They know the impact this had on their customers, and on their own business, and they’re actively seeking to make sure it never happens again.

So, even if you’re a little rattled by this outage, know that improvements are being made, fixes are being implemented, and HubSpot will continue to make their product the best it can be. Okay -- HubSpot lovefest over!

Free Assessment:

How does your sales & marketing measure up?

Take this free, 5-minute assessment and learn what you can start doing today to boost traffic, leads, and sales.

Take the assessment

A Breakdown of HubSpot’s Outage Retrospective for the Non-Technical User

A Quick Review of How HubSpot’s Infrastructure Works

So What Broke?

Why Did It Take So Long to Fix?

What Now?

Technical / Infrastructure

Reliability

Testing

Communication

In Conclusion

Related Articles

HubSpot Products That Will Transform Your Business [Endless Customers Podcast S.1 Ep.13]

HubSpot vs WordPress: Which is Better for Your Business Website?

HubSpot Updates: New Tools and Features to Kick Off 2024

How a Remodeling Company Tamed Its AR Problem with HubSpot

HubSpot Update November 2023; New Tools Bring Cautious Excitement

HubSpot Update September 2023: What’s New from INBOUND

HubSpot Pricing: Your Guide to Everything HubSpot Costs

How To Measure The Trust You've Built With Your Audience (with template)

HubSpot CRM Review — Pros and Cons

Track These 5 Inbound Marketing Metrics to See Better Results

Can HubSpot Help My Retail Business Grow?

Get More Out of HubSpot Reporting With a Third-party Tool

4 Keys To An Effective HubSpot Strategy in 2022

Using They Ask, You Answer in Customer Service

Is The HubSpot Free CRM Actually Free?

Ultimate List of HubSpot Pros and Cons

How To Optimize Your Marketing Automation Workflows With HubSpot (Tips)

Top 13 Inbound Marketing & HubSpot Solutions Partner Program Agencies for 2022

HubSpot Sales Hub: 18 Things Every Sales Rep Should Know How to Do (+ Videos)

INBOUND 2021 Recap: Takeaways, Speakers, and Lessons Learned

Need a HubSpot Admin? Here’s How to Find and Hire the Right Candidate

How to Get Sales Reps to Use the HubSpot CRM

HubSpot and Data Privacy: How to Collect Contacts the Right Way

INBOUND is Fast Approaching, Google Leads are Syncing, and Workflow Actions are Placeholding [Hubcast 275]

How to Know When You’ve Outgrown HubSpot Sales Hub Starter

Join 40,000+ sales and marketing pros who receive our weekly newsletter.

Thanks, stay tuned for our upcoming edition.

Join 40,000+ sales and marketing pros who receive our weekly newsletter.

A Breakdown of HubSpot’s Outage Retrospective for the Non-Technical User

A Quick Review of How HubSpot’s Infrastructure Works

So What Broke?

Why Did It Take So Long to Fix?

What Now?

Technical / Infrastructure

Reliability

Testing

Communication

In Conclusion

Related Articles

HubSpot Products That Will Transform Your Business [Endless Customers Podcast S.1 Ep.13]

HubSpot vs WordPress: Which is Better for Your Business Website?

HubSpot Updates: New Tools and Features to Kick Off 2024

How a Remodeling Company Tamed Its AR Problem with HubSpot

HubSpot Update November 2023; New Tools Bring Cautious Excitement

HubSpot Update September 2023: What’s New from INBOUND

HubSpot Pricing: Your Guide to Everything HubSpot Costs

How To Measure The Trust You've Built With Your Audience (with template)

HubSpot CRM Review — Pros and Cons

Track These 5 Inbound Marketing Metrics to See Better Results

Can HubSpot Help My Retail Business Grow?

Get More Out of HubSpot Reporting With a Third-party Tool

4 Keys To An Effective HubSpot Strategy in 2022

Using They Ask, You Answer in Customer Service

Is The HubSpot Free CRM Actually Free?

Ultimate List of HubSpot Pros and Cons

How To Optimize Your Marketing Automation Workflows With HubSpot (Tips)

Top 13 Inbound Marketing & HubSpot Solutions Partner Program Agencies for 2022

HubSpot Sales Hub: 18 Things Every Sales Rep Should Know How to Do (+ Videos)

INBOUND 2021 Recap: Takeaways, Speakers, and Lessons Learned

Need a HubSpot Admin? Here’s How to Find and Hire the Right Candidate

How to Get Sales Reps to Use the HubSpot CRM

HubSpot and Data Privacy: How to Collect Contacts the Right Way

INBOUND is Fast Approaching, Google Leads are Syncing, and Workflow Actions are Placeholding [Hubcast 275]

How to Know When You’ve Outgrown HubSpot Sales Hub Starter

Join the 40,000+ sales and marketing pros who receive our weekly insights, tips, and best practices.