Hmm. Here's what I read from this article: RedPanda didn't happen to use any of the stuff in GCP that went down, so they were unaffected. They use a 3rd party for alerting and dashboarding, and that 3rd party went down, but RedPanda still had their own monitoring.
When I read "major outage for a large part of the internet was just another normal day for Redpanda Cloud customers", I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance.
I get it, an SRE is doing well if you don't notice them, but the only real preventative measure I saw here that directly helped with this issue, is that they over provision disk space. Which I'd be alarmed if they didn't do.
Yeah I thought they were going to show something cool like multi-tenant architecture. Odd to write this article when it was clear they expected to be impacted as they were reaching out to customers.
I think you're missing the point. What I took away was that: "Because we design for zero dependencies for full operation, we didn't go down". Their extra features like tiered storage and monitoring going down didn't affect normal operations, which it seems like it did for similar solutions with similar features.
> I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech.
It's a tale of how they set things up so they wouldn't need to valiantly fix things, and I think the subtext is probably that Redpanda doesn't pass responsibility on to a third party.
There are plenty of domains and, more importantly, people who need uptime guarantees to mean "fix estimate from a real human working on the problem" and not eventual store credit. Payroll is a classic example.
Nothing about the way they architected their system even mattered in this incident. Their service just wasn't using any of the infrastructure that failed - there was no event here that actually put their system design to the test. There just isn't a story here.
It's like if the power went out in the building next door, and you wrote a blog post about how amazing the reliability of your office computers are compared to your neighbor. If your power had gone out too but you had provisioned a bunch of UPSs and been fine, then there's something to talk about.
To extend the analogy, if the neighborhood had a reputation for brown-outs and you deliberately chose not to build an office there, then maybe you have something. But here, RedPanda's GCP offering is inside GCP, this failure in GCP has never happened before, they just got lucky.
> triple-nine uptime & fault tolerance.
Haha, we used to joke that's how many nines our customer-facing Ruby on Rails services had compared against our resilient five nines payments systems. Our heavy infra handled billions in daily payment volume and couldn't go down.
With the Ruby teams, we often playfully quipped, "which nines are those?" humorously implying the leading digit itself wasn't itself a nine.
AKA: "We're closing in on our third 8 of uptime..."
- [deleted]