Listener Questions 1
Publisher |
Corey Quinn
Media Type |
audio
Categories Via RSS |
Business News
News
Tech News
Publication Date |
Feb 12, 2021
Episode Duration |
00:22:13

Links:

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I'm Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: We're back again. Hashtag Triple F.

Jesse: It's going to be a thing.

Pete: We're still trying to make it a thing. Desperately trying to make it a thing. Otherwise, we're just going to look like fools, Jesse, if it's not a thing.

Jesse: Oh now, I wouldn't want to look like a fool, you know, next to anybody else in my company.

Pete: [laugh]. It definitely seems to be the one that trait you need to have to work at Duckbill is, to be okay looking like a fool. So, we are midway through the Unconventional Guide to AWS Cost Optimizations, cost savings. And we have been sharing a link on pretty much if not all of these recordings where you can send us feedback. And you can send us questions. And someone finally sent us a question. I think people are listening out there, Jesse. Isn't that great? 

Jesse: We have one follower. Yay.

Pete: It's amazing. So, we are really happy that someone asked us a question. You can be the next person to ask us a question by going into lastweekinaws.com/QA. That's not our quality assurance site for testing, new branding things, and new products. QA is for ‘question and answer.’ 

So, go there, check it out, drop in a message, you can put your name there or not, it's totally fine. But this first question—well, first, I need to actually, I need to admit something. I'm lying right now. This question actually came in months ago. We saw it and thought that was a great question, we should answer it at some point. And then we forgot about it. So we're bringing it back up again, and I think it's relevant so I don't feel too bad about it.

Jesse: Yeah, we saw this question around the time that we started recording the entire Unconventional Guide series. And apologies to this listener. This is a very good question. We want to talk about it, so we are talking about it today. But it took a little bit of a time for us to get to this. 

Pete: But you know what? We made it. We're here.

Jesse: We’re here.

Pete: We're here. So, Nick Moore asked this great question. He said, “Hey, Pete and Jesse. Very much enjoying your Friday segment on the Morning Brief.” Thank you very much for that. “If possible, I'd like to hear you talk about your experiences with cost optimization for quote, ‘big data’ projects in the cloud, i.e. Using platforms like Hadoop to process large and complex data, either using pass—like, EMR or [IS 00:03:03]. Is this something that your customers ask about often/at all? And how do or would you approach it? Thanks, again.” 

Well, hey, this is a truly awesome question. And at a high level, many of our clients actually are pretty heavy users of various Amazon services for their, kind of, big data needs. And big data, it's all relative, right? I mean, to some companies, big data is in the hundreds of terabytes, to other companies it's in the hundreds of petabytes. It's totally relative, but at the end of the day, it's going to be a challenge, no matter how big of a company you are. Your big data challenges are always a challenge.

Jesse: You've got some kind of data science or data analytics work that you want to do with large data sets. That may be large datasets comparatively to the work that you're doing; that may be large data sets comparatively to the industry. Doesn't matter. Either way, it is big data projects, and there are many, many, many, many solutions out there.

Pete: What's interesting, too, is I think the reason that this has grown in prevalence over the last year, more of our clients have been using more of these services is simply because the barrier to entry on these projects, on these engagements, is so low. You can get started on Amazon with some Athena and Glue, maybe some EMR, for just an incredibly low cost. And also, from a technical standpoint, it's not that challenging. I mean, as a good example, most reasonably technical people could take their cost and usage report, get it integrated into Athena using AWS Glue in minutes. I mean, without using CloudFormation. I mean just clicking through to set it up. And honestly, for some clients, their cost and usage reports, and that's a big data problem. That could be—if you're not storing it in Parquet, if you're actually storing it in CSV because you're a mad person, those could be hundreds of gigabytes a day in volume.

Jesse: Yeah. So, when we talk about big data tasks, there's a couple different services that we generally see folks using within AWS. We generally see S3, Kinesis, and most obviously, EMR.

Pete: Yeah, exactly. And we're seeing new services like Kinesis, expanding on Kinesis: Kinesis Firehose, when that came out; people are using that for some of their big data needs, especially when trying to stream data into S3. That's a really powerful feature that Firehose can do. And then, once it's in S3, the question that our clients often ask is, kind of, “What do I do with it now?” And if we dive into just S3, and you've got your data in S3, where are the kinds of places that we see unnecessary charges for data warehouse tasks? 

Jesse: Honestly, it's unfortunately kind of both of the major places that you're going to be charged for S3 which is, for your storage costs, and for your requests. 

Pete: So, what you're saying is that all S3 charges are unnecessary. [laugh].

Jesse: Just get rid of it. Just put all that on EBS volume somewhere. Turn off your S3, you're solid. 

Pete: Exactly. It is kind of funny, but it's true. I mean, there's ways to abuse both of those pricing models, whether it's storage or requests. The first place that we honestly see a lot of this is just people are data pack rats. And let's be honest; I'm one of them as well, I have a NAS setup at home with, like, 30 terabytes of hard drives on it. 

I don't throw anything away digitally. Turns out most of our other clients are the exact same way, and sadly, a lot of them use standard storage for S3, which we talk about often. It's common: you get started with the standard storage, that's a great place. But for big data tasks, it's often the wrong storage solution, especially for data that maybe has already been transformed and is stored in a more efficient format; maybe it's queried infrequently. There's two ways to so...

Join Pete and Jesse as they take a question from the field and talk about their experiences optimizing big data projects in the cloud. They touch upon how big data challenges are challenging whether you’re talking about terabytes or petabytes, the most popular services for big data projects in AWS, how people are essentially digital hoarders today and never throw any data out, why Pete believes more people should take advantage of Glacier Deep Archive, tricks for optimizing Parquet files, what the Kinesis outage meant for many Duckbill Group clients, why you may need to rethink your approach to compression, how Jesse thinks not enough clients use spot instances, and more.

Links:

Transcript

Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I’m going to just guess that it’s awful because it’s always awful. No one loves their deployment process. What if launching new features didn’t require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren’t what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.

Pete: Hello, and welcome to the AWS Morning Brief: Fridays From the Field. I'm Pete Cheslock.

Jesse: I'm Jesse DeRose.

Pete: We're back again. Hashtag Triple F.

Jesse: It's going to be a thing.

Pete: We're still trying to make it a thing. Desperately trying to make it a thing. Otherwise, we're just going to look like fools, Jesse, if it's not a thing.

Jesse: Oh now, I wouldn't want to look like a fool, you know, next to anybody else in my company.

Pete: [laugh]. It definitely seems to be the one that trait you need to have to work at Duckbill is, to be okay looking like a fool. So, we are midway through the Unconventional Guide to AWS Cost Optimizations, cost savings. And we have been sharing a link on pretty much if not all of these recordings where you can send us feedback. And you can send us questions. And someone finally sent us a question. I think people are listening out there, Jesse. Isn't that great? 

Jesse: We have one follower. Yay.

Pete: It's amazing. So, we are really happy that someone asked us a question. You can be the next person to ask us a question by going into lastweekinaws.com/QA. That's not our quality assurance site for testing, new branding things, and new products. QA is for ‘question and answer.’ 

So, go there, check it out, drop in a message, you can put your name there or not, it's totally fine. But this first question—well, first, I need to actually, I need to admit something. I'm lying right now. This question actually came in months ago. We saw it and thought that was a great question, we should answer it at some point. And then we forgot about it. So we're bringing it back up again, and I think it's relevant so I don't feel too bad about it.

Jesse: Yeah, we saw this question around the time that we started recording the entire Unconventional Guide series. And apologies to this listener. This is a very good question. We want to talk about it, so we are talking about it today. But it took a little bit of a time for us to get to this. 

Pete: But you know what? We made it. We're here.

Jesse: We’re here.

Pete: We're here. So, Nick Moore asked this great question. He said, “Hey, Pete and Jesse. Very much enjoying your Friday segment on the Morning Brief.” Thank you very much for that. “If possible, I'd like to hear you talk about your experiences with cost optimization for quote, ‘big data’ projects in the cloud, i.e. Using platforms like Hadoop to process large and complex data, either using pass—like, EMR or [IS 00:03:03]. Is this something that your customers ask about often/at all? And how do or would you approach it? Thanks, again.” 

Well, hey, this is a truly awesome question. And at a high level, many of our clients actually are pretty heavy users of various Amazon services for their, kind of, big data needs. And big data, it's all relative, right? I mean, to some companies, big data is in the hundreds of terabytes, to other companies it's in the hundreds of petabytes. It's totally relative, but at the end of the day, it's going to be a challenge, no matter how big of a company you are. Your big data challenges are always a challenge.

Jesse: You've got some kind of data science or data analytics work that you want to do with large data sets. That may be large datasets comparatively to the work that you're doing; that may be large data sets comparatively to the industry. Doesn't matter. Either way, it is big data projects, and there are many, many, many, many solutions out there.

Pete: What's interesting, too, is I think the reason that this has grown in prevalence over the last year, more of our clients have been using more of these services is simply because the barrier to entry on these projects, on these engagements, is so low. You can get started on Amazon with some Athena and Glue, maybe some EMR, for just an incredibly low cost. And also, from a technical standpoint, it's not that challenging. I mean, as a good example, most reasonably technical people could take their cost and usage report, get it integrated into Athena using AWS Glue in minutes. I mean, without using CloudFormation. I mean just clicking through to set it up. And honestly, for some clients, their cost and usage reports, and that's a big data problem. That could be—if you're not storing it in Parquet, if you're actually storing it in CSV because you're a mad person, those could be hundreds of gigabytes a day in volume.

Jesse: Yeah. So, when we talk about big data tasks, there's a couple different services that we generally see folks using within AWS. We generally see S3, Kinesis, and most obviously, EMR.

Pete: Yeah, exactly. And we're seeing new services like Kinesis, expanding on Kinesis: Kinesis Firehose, when that came out; people are using that for some of their big data needs, especially when trying to stream data into S3. That's a really powerful feature that Firehose can do. And then, once it's in S3, the question that our clients often ask is, kind of, “What do I do with it now?” And if we dive into just S3, and you've got your data in S3, where are the kinds of places that we see unnecessary charges for data warehouse tasks? 

Jesse: Honestly, it's unfortunately kind of both of the major places that you're going to be charged for S3 which is, for your storage costs, and for your requests. 

Pete: So, what you're saying is that all S3 charges are unnecessary. [laugh].

Jesse: Just get rid of it. Just put all that on EBS volume somewhere. Turn off your S3, you're solid. 

Pete: Exactly. It is kind of funny, but it's true. I mean, there's ways to abuse both of those pricing models, whether it's storage or requests. The first place that we honestly see a lot of this is just people are data pack rats. And let's be honest; I'm one of them as well, I have a NAS setup at home with, like, 30 terabytes of hard drives on it. 

I don't throw anything away digitally. Turns out most of our other clients are the exact same way, and sadly, a lot of them use standard storage for S3, which we talk about often. It's common: you get started with the standard storage, that's a great place. But for big data tasks, it's often the wrong storage solution, especially for data that maybe has already been transformed and is stored in a more efficient format; maybe it's queried infrequently. There's two ways to so...

This episode currently has no reviews.

Submit Review
This episode could use a review!

This episode could use a review! Have anything to say about it? Share your thoughts using the button below.

Submit Review