This episode currently has no reviews.
Submit ReviewAbout Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.
TranscriptCorey: Hello, and welcome back to our Networking in the Cloud mini series sponsored by ThousandEyes. That's right. ThousandEyes is state-of-the-cloud Performance Benchmark Report is now available for your perusal. It's really providing a lot of baseline that we're taking all of the miniseries information from. It pointed us in a bunch of interesting directions and helps us tell stories that are actually, for a change, backed by data rather than pure sarcasm. To get your copy, visit snark.cloud/realclouds because it only covers real cloud providers. Thanks again to ThousandEyes for their ridiculous support of this shockingly informative podcast mini series.
It's a basic fact of cloud that things break all the time. I've been joking for a while that a big competitive advantage that Microsoft brings to this space is that they have 40 years of experience apologizing for software failures, except that's not really a joke. It's true. There's something to be said for the idea of apologizing to both technical and business people about real or perceived failures being its own skillset, and they have a lot more experience than anyone in this space.
There are two schools of thought around how to avoid having to apologize for service or component failures to your customers. The first is to build super expensive but super durable things, and you can kind of get away with this in typical data center environments right up until you can't, and then it turns out that your SAN just exploded. You're really not diversifying with most SANs. You're just putting all of your eggs in a really expensive basket, and of course, if you're still with power or networking outage, nothing can talk to the SAN, and you're back to square one.
The other approach is to come at it with a perspective of building in redundancy to everything and eliminating single points of failure. That's usually the better path in the cloud. You don't ever want to have a single point of failure if you can reasonably avoid it, so going with multiple everythings starts to make sense to a point. Going with a full on multi-cloud story is a whole separate kettle of nonsense we'll get to another time. But you realize at some point you will have single points of failure and you're not going to be able to solve for that. We still only have one planet going around one sun for example. If either of those things explode, well, computers aren't really anyone's concern anymore. However, betting the entire farm on one EC2 instance is generally something you'll want to avoid if at all possible.
In the world of AWS, there aren't data centers in the way that you or I would contextualize them. Instead, they have constructs known as availability zones and those composed to form a different construct called regions. Presumably, other cloud providers have similar constructs over in non-AWS land, but we're focusing on AWS as implementation in this series, again, because they have a giant multi-year head start over every other cloud provider, and even that manifests in those other cloud providers comparing what they've built and how they operate to AWS. If that upsets you and you work at one of those other cloud providers, well, you should have tried harder. Let's dive in to a discussion of data centers, availability zones, and regions today.
Take an empty warehouse and shove it full of server racks. Congratulations. You have built the bare minimum requirement for a data center at its most basic layer. Your primary constraint and why it's a lot harder than it sounds is power, and to a lesser extent, cooling. Computers aren't just crunching numbers, they're also throwing off waste heat. You've got to think an awful lot about how to keep that heat out of the data center.
At some point, you can't shove more capacity into that warehouse-style building just because you can't cool it if it's all running at the same time. If your data center's particularly robust, meaning you didn't cheap out on it, you're going to have different power distribution substations that feed the building from different lines that enter the building at different corners. You're going to see similar things with cooling as well, multiply redundant cooling systems.
One of the big challenges, of course, when dealing with this physical infrastructure is validating that what it says on the diagram is what's actually there in the physical environment. That can be a trickier thing to explore than you would hope. Also, if you have a whole bunch of systems sitting in that warehouse and you take a power outage, well, you have to plan for this thing known as inrush current.
Normally, it's steady state. Computers generally draw a known quantity of power. But when you first turn them on, if you've ever dealt with data center servers, the first thing they do is they power up everything to self-test. They sound like a jet fighter taking off as all the fans spin up. If you're not careful, and all these things turn on at once, you'll see a giant power spike that winds up causing issues, blowing breakers, maxing out consumption, so having a staggered start becomes a concern as well. Having spent too much time in data centers, I am painfully familiar with this problem of how you safely and sanely recover from site-wide events, but that's a bit out of scope, thankfully, because in the cloud, this is less of a problem.
Let's talk about the internet and getting connectivity to these things. This is the Networking in the Cloud podcast after all. You're ideally going to have multiple providers running fiber lines to that data center hoping to avoid fiber's natural predator, the noble backhoe. Now, ideally, all those fiber lines are going over different paths, but again, hard thing to prove, so doing your homework's important, but here's something folks don't always consider: If you have a hundred gigabit ethernet links to each computer, which is not cheap, but doable, and then you have 20 servers in a rack, each rack theoretically needs to be able to speak at least two terabit at all times to each other server in each other rack, and most of them can't do that. They wind up having bottle-necking issues.
As a result, when you have high-traffic applications speaking between systems, you need to make sure that they're aware of something known as rack affinity. In other words, are there bottlenecks between these systems, and how do you minimize those to make sure the crosstalk works responsibly? There are a lot of dragons in here, but let's hand-wave past all of it because we're talking about cloud here. The point of this is that there's an awful lot of nuance to running data centers, and AWS and other large cloud providers do a better job of it than you do. That's not me insulting your data center staff. That's just a fact. They have the scal...
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.
TranscriptCorey: Hello, and welcome back to our Networking in the Cloud mini series sponsored by ThousandEyes. That's right. ThousandEyes is state-of-the-cloud Performance Benchmark Report is now available for your perusal. It's really providing a lot of baseline that we're taking all of the miniseries information from. It pointed us in a bunch of interesting directions and helps us tell stories that are actually, for a change, backed by data rather than pure sarcasm. To get your copy, visit snark.cloud/realclouds because it only covers real cloud providers. Thanks again to ThousandEyes for their ridiculous support of this shockingly informative podcast mini series.
It's a basic fact of cloud that things break all the time. I've been joking for a while that a big competitive advantage that Microsoft brings to this space is that they have 40 years of experience apologizing for software failures, except that's not really a joke. It's true. There's something to be said for the idea of apologizing to both technical and business people about real or perceived failures being its own skillset, and they have a lot more experience than anyone in this space.
There are two schools of thought around how to avoid having to apologize for service or component failures to your customers. The first is to build super expensive but super durable things, and you can kind of get away with this in typical data center environments right up until you can't, and then it turns out that your SAN just exploded. You're really not diversifying with most SANs. You're just putting all of your eggs in a really expensive basket, and of course, if you're still with power or networking outage, nothing can talk to the SAN, and you're back to square one.
The other approach is to come at it with a perspective of building in redundancy to everything and eliminating single points of failure. That's usually the better path in the cloud. You don't ever want to have a single point of failure if you can reasonably avoid it, so going with multiple everythings starts to make sense to a point. Going with a full on multi-cloud story is a whole separate kettle of nonsense we'll get to another time. But you realize at some point you will have single points of failure and you're not going to be able to solve for that. We still only have one planet going around one sun for example. If either of those things explode, well, computers aren't really anyone's concern anymore. However, betting the entire farm on one EC2 instance is generally something you'll want to avoid if at all possible.
In the world of AWS, there aren't data centers in the way that you or I would contextualize them. Instead, they have constructs known as availability zones and those composed to form a different construct called regions. Presumably, other cloud providers have similar constructs over in non-AWS land, but we're focusing on AWS as implementation in this series, again, because they have a giant multi-year head start over every other cloud provider, and even that manifests in those other cloud providers comparing what they've built and how they operate to AWS. If that upsets you and you work at one of those other cloud providers, well, you should have tried harder. Let's dive in to a discussion of data centers, availability zones, and regions today.
Take an empty warehouse and shove it full of server racks. Congratulations. You have built the bare minimum requirement for a data center at its most basic layer. Your primary constraint and why it's a lot harder than it sounds is power, and to a lesser extent, cooling. Computers aren't just crunching numbers, they're also throwing off waste heat. You've got to think an awful lot about how to keep that heat out of the data center.
At some point, you can't shove more capacity into that warehouse-style building just because you can't cool it if it's all running at the same time. If your data center's particularly robust, meaning you didn't cheap out on it, you're going to have different power distribution substations that feed the building from different lines that enter the building at different corners. You're going to see similar things with cooling as well, multiply redundant cooling systems.
One of the big challenges, of course, when dealing with this physical infrastructure is validating that what it says on the diagram is what's actually there in the physical environment. That can be a trickier thing to explore than you would hope. Also, if you have a whole bunch of systems sitting in that warehouse and you take a power outage, well, you have to plan for this thing known as inrush current.
Normally, it's steady state. Computers generally draw a known quantity of power. But when you first turn them on, if you've ever dealt with data center servers, the first thing they do is they power up everything to self-test. They sound like a jet fighter taking off as all the fans spin up. If you're not careful, and all these things turn on at once, you'll see a giant power spike that winds up causing issues, blowing breakers, maxing out consumption, so having a staggered start becomes a concern as well. Having spent too much time in data centers, I am painfully familiar with this problem of how you safely and sanely recover from site-wide events, but that's a bit out of scope, thankfully, because in the cloud, this is less of a problem.
Let's talk about the internet and getting connectivity to these things. This is the Networking in the Cloud podcast after all. You're ideally going to have multiple providers running fiber lines to that data center hoping to avoid fiber's natural predator, the noble backhoe. Now, ideally, all those fiber lines are going over different paths, but again, hard thing to prove, so doing your homework's important, but here's something folks don't always consider: If you have a hundred gigabit ethernet links to each computer, which is not cheap, but doable, and then you have 20 servers in a rack, each rack theoretically needs to be able to speak at least two terabit at all times to each other server in each other rack, and most of them can't do that. They wind up having bottle-necking issues.
As a result, when you have high-traffic applications speaking between systems, you need to make sure that they're aware of something known as rack affinity. In other words, are there bottlenecks between these systems, and how do you minimize those to make sure the crosstalk works responsibly? There are a lot of dragons in here, but let's hand-wave past all of it because we're talking about cloud here. The point of this is that there's an awful lot of nuance to running data centers, and AWS and other large cloud providers do a better job of it than you do. That's not me insulting your data center staff. That's just a fact. They have the scal...
This episode currently has no reviews.
Submit ReviewThis episode could use a review! Have anything to say about it? Share your thoughts using the button below.
Submit Review