This episode currently has no reviews.
Submit ReviewAbout Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Knock knock. Who's there? A DDOS attack. A DDOS a... Knock. Knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock.
Welcome to what we're calling Networking in the Cloud, episodes six, How Things Break in the Cloud, sponsored by ThousandEyes. ThousandEyes recently launched their state of the cloud performance benchmark report that effectively lets you compare and contrast performance and other aspects between the five large cloud providers, AWS, Azure, GCP, Alibaba and IBM cloud. Oracle cloud was not invited because we are talking about real clouds here. You can get your copy of this report at snark.cloud/realclouds. and they compare and contrast an awful lot of interesting things. One thing that we're not going to compare and contrast though, because of my own personal beliefs, is the outages of different cloud providers.
Making people in companies, by the way, companies are composed of people, making them feel crappy about their downtime is mean, first off. Secondly, if companies are shamed for outages, it in turn makes it far likelier that they won't disclose having suffered an outage. And when companies talk about their outages in constructive blameless ways, there are incredibly valuable lessons that we all can learn from it. So let's dive into this a bit.
If there's one thing that computers do well, better than almost anything else, it's break. And this is, and I'm not being sarcastic when I say this, a significant edge that Microsoft has when they come to cloud. They have 40 some odd years of experience in apologizing for software failures. That's not trying to be insulting to Microsoft, it's what computers do, they break. And being able to explain that intelligently to business stakeholders is incredibly important. They're masters at that. They also have a 20 year head start on everyone else in the space. What makes this interesting and useful is that in the cloud, computers break differently than people would expect them to in a non-cloud environment.
Once upon a time when you were running servers and data centers, if you see everything suddenly go offline, you have some options. You can call the data center directly to see if someone cut the fiber, in case you were unaware of fiber optic cables' sole natural predator in the food chain is the mighty backhoe. So maybe something backhoed out some fiber lines, maybe the power is dead to the data center, maybe the entire thing exploded, burst into flames and burned to the ground, but you can call people. In the cloud, it doesn't work that way. Here in the cloud, instead you check Twitter because it's 3:00 AM and Nagios is the original call of duty or PagerDuty calls you, because you didn't need that sleep anyway, telling you there is something amiss with your site. So when a large bond provider takes an outage, and you're hanging out on Twitter at two in the morning, you can see DevOps Twitter come to life in the middle of the night, as they chatter back and forth.
And incidentally, if that's you, understand a nuance of AWS availability zone naming. When people say things like us-east-1a is having a problem and someone else says, "No, I just see us-east-1c is having a problem," you're probably talking about the same availability zone. Those letters change, non deterministically, between accounts. You can pull zone IDs, and those are consistent. But by and large, that was originally to avoid having problems like everyone picking A, as humans tend to do or C, getting the reputation as the crappy one.
So why would you check Twitter to figure out if your cloud provider's having a massive outage? Well, because honestly, the AWS status page is completely full of lies and gaslights you. It is as green as the healthiest Christmas tree you can imagine, even when things are exploding for a disturbingly long period of time. If you visit the website, stop.lying.cloud, you'll find a Lambda and Edge function that I've put there that cuts out some of the croft, but it's not perfect. And the reason behind this, after I gave them a bit too much crap one day and I got a phone call that started with, "Now you listen here," it turns out that there are humans in the loop, and they need to validate that there is in fact a systemic issue at AWS and what that issue might be, and then finally come up with a way to report that in a way that ideally doesn't get people sued and manually update the status page. Meanwhile, your site's on fire. So that is a trailing function, not a leading function.
Alternately, you could always check ThousandEyes. That's right, this episode is sponsored by ThousandEyes. In addition to the report we mentioned earlier, you can think of them as Google Maps of the internet without the creepy privacy overreach issues. Just like you wouldn't necessarily want to commute during rush hour without checking where traffic is going to be and which route was faster, businesses rely on ThousandEyes to see the end to end paths their applications and services are taking in real time to identify where the slow downs are, where the outages are and what's causing problems. They use ThousandEyes to see what's breaking where and then importantly, ThousandEyes shares that data directly with the offending service providers. Not just to hold them accountable, but also to get them to fix the issue fast. Ideally, before it impacts users. But on this episode, it already has.
So let's say that you don't have the good sense to pay for ThousandEyes or you're not on Twitter, for whatever reason, watching people flail around helplessly trying to figure out what's going on. Instead, you're now trying desperately to figure out whether this issue is the last deploy your team did or if it's a global problem. And the first thing people try to do in the event of an issue is, "Oh crap, what did we just change? undo it." And often that is a knee jerk response that can make things worse if it's not actually your code that caused the problem. Worse, it can eat up precious time at the beginning of an outage. If you knew that it was a single availability zone or an entire AWS region that was having a problem, you could instead be working to fail over to a different location instead of wasting valuable incident retime checking Twitter or looking over your last 200 commits.
Part of the problem, and the reason this is the way that it is, is that unlike rusting computers in your data center currently being savaged by raccoons, things in the cloud break differently. You don't have the same diagnostic tools, you don't have the same level of visibility into what the hardware is doing, and the behaviors themselves are radically different. I have a half dozen tips and tricks on how to monitor whether or not your data center's experiencing a problem r...
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.TranscriptCorey: Knock knock. Who's there? A DDOS attack. A DDOS a... Knock. Knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock, knock.
Welcome to what we're calling Networking in the Cloud, episodes six, How Things Break in the Cloud, sponsored by ThousandEyes. ThousandEyes recently launched their state of the cloud performance benchmark report that effectively lets you compare and contrast performance and other aspects between the five large cloud providers, AWS, Azure, GCP, Alibaba and IBM cloud. Oracle cloud was not invited because we are talking about real clouds here. You can get your copy of this report at snark.cloud/realclouds. and they compare and contrast an awful lot of interesting things. One thing that we're not going to compare and contrast though, because of my own personal beliefs, is the outages of different cloud providers.
Making people in companies, by the way, companies are composed of people, making them feel crappy about their downtime is mean, first off. Secondly, if companies are shamed for outages, it in turn makes it far likelier that they won't disclose having suffered an outage. And when companies talk about their outages in constructive blameless ways, there are incredibly valuable lessons that we all can learn from it. So let's dive into this a bit.
If there's one thing that computers do well, better than almost anything else, it's break. And this is, and I'm not being sarcastic when I say this, a significant edge that Microsoft has when they come to cloud. They have 40 some odd years of experience in apologizing for software failures. That's not trying to be insulting to Microsoft, it's what computers do, they break. And being able to explain that intelligently to business stakeholders is incredibly important. They're masters at that. They also have a 20 year head start on everyone else in the space. What makes this interesting and useful is that in the cloud, computers break differently than people would expect them to in a non-cloud environment.
Once upon a time when you were running servers and data centers, if you see everything suddenly go offline, you have some options. You can call the data center directly to see if someone cut the fiber, in case you were unaware of fiber optic cables' sole natural predator in the food chain is the mighty backhoe. So maybe something backhoed out some fiber lines, maybe the power is dead to the data center, maybe the entire thing exploded, burst into flames and burned to the ground, but you can call people. In the cloud, it doesn't work that way. Here in the cloud, instead you check Twitter because it's 3:00 AM and Nagios is the original call of duty or PagerDuty calls you, because you didn't need that sleep anyway, telling you there is something amiss with your site. So when a large bond provider takes an outage, and you're hanging out on Twitter at two in the morning, you can see DevOps Twitter come to life in the middle of the night, as they chatter back and forth.
And incidentally, if that's you, understand a nuance of AWS availability zone naming. When people say things like us-east-1a is having a problem and someone else says, "No, I just see us-east-1c is having a problem," you're probably talking about the same availability zone. Those letters change, non deterministically, between accounts. You can pull zone IDs, and those are consistent. But by and large, that was originally to avoid having problems like everyone picking A, as humans tend to do or C, getting the reputation as the crappy one.
So why would you check Twitter to figure out if your cloud provider's having a massive outage? Well, because honestly, the AWS status page is completely full of lies and gaslights you. It is as green as the healthiest Christmas tree you can imagine, even when things are exploding for a disturbingly long period of time. If you visit the website, stop.lying.cloud, you'll find a Lambda and Edge function that I've put there that cuts out some of the croft, but it's not perfect. And the reason behind this, after I gave them a bit too much crap one day and I got a phone call that started with, "Now you listen here," it turns out that there are humans in the loop, and they need to validate that there is in fact a systemic issue at AWS and what that issue might be, and then finally come up with a way to report that in a way that ideally doesn't get people sued and manually update the status page. Meanwhile, your site's on fire. So that is a trailing function, not a leading function.
Alternately, you could always check ThousandEyes. That's right, this episode is sponsored by ThousandEyes. In addition to the report we mentioned earlier, you can think of them as Google Maps of the internet without the creepy privacy overreach issues. Just like you wouldn't necessarily want to commute during rush hour without checking where traffic is going to be and which route was faster, businesses rely on ThousandEyes to see the end to end paths their applications and services are taking in real time to identify where the slow downs are, where the outages are and what's causing problems. They use ThousandEyes to see what's breaking where and then importantly, ThousandEyes shares that data directly with the offending service providers. Not just to hold them accountable, but also to get them to fix the issue fast. Ideally, before it impacts users. But on this episode, it already has.
So let's say that you don't have the good sense to pay for ThousandEyes or you're not on Twitter, for whatever reason, watching people flail around helplessly trying to figure out what's going on. Instead, you're now trying desperately to figure out whether this issue is the last deploy your team did or if it's a global problem. And the first thing people try to do in the event of an issue is, "Oh crap, what did we just change? undo it." And often that is a knee jerk response that can make things worse if it's not actually your code that caused the problem. Worse, it can eat up precious time at the beginning of an outage. If you knew that it was a single availability zone or an entire AWS region that was having a problem, you could instead be working to fail over to a different location instead of wasting valuable incident retime checking Twitter or looking over your last 200 commits.
Part of the problem, and the reason this is the way that it is, is that unlike rusting computers in your data center currently being savaged by raccoons, things in the cloud break differently. You don't have the same diagnostic tools, you don't have the same level of visibility into what the hardware is doing, and the behaviors themselves are radically different. I have a half dozen tips and tricks on how to monitor whether or not your data center's experiencing a problem r...
This episode currently has no reviews.
Submit ReviewThis episode could use a review! Have anything to say about it? Share your thoughts using the button below.
Submit Review