Already, a good dozen companies that purport to “right size” or “adjust your instances’ sizes and families to fit your workload” are shrieking and reaching for the torches and pitchforks, but hear me out.
We’re all on the same side here: Nobody wants to see money wasted on cloud services.
The theory behind right sizing is sort of like this: You walk into a new environment, one you’ve never seen before. You start poking around. Ha! You see a bunch of m3.2xlarge instances running. Upgrading them to m5.2xlarge instances instantly saves 28%.
“What ancient moron set this up?!” you confidently ask—invariably to said “ancient moron,” who is now actively incentivized not only to fight any recommendation you might possibly make, but also to see you run over in a tragic parking lot incident.
“If you upgrade to m5.2xlarge instances instead you’ll save 3.4 cents per hour per instance! Plus, your systems are idle most of the time; making them m5.larges instead saves an additional x per instance per hour! It’s a slam dunk! Now, here’s the part where you pay me.”
Why Right Sizing Is Wrong
On paper, right sizing makes an awful lot of sense.
Your existing nodes are largely bored (the average CPU utilization of EC2 instances often hovers in the range of single-digit percentages), newer instance families are a lot more efficient (and cost effective—i3 instances are roughly a third the cost of i2 instances), and nobody wants your workloads to sit idly.
That said, virtually nobody actually right sizes their instances and suggesting that someone do it as a low-effort change is almost always incorrect.
Why is that?
It turns out that despite what the modern best practice evangelists preach at disturbingly high volume, an awful lot of workloads are “legacy,” which is condescending-engineer-speak for “actually makes money.”
They generally aren’t terrific at handling cluster members joining or leaving, they’re monolithic, and it’s a near-certainty that they’ve got system dependencies that will bust themselves to chunks if deployed on a more modern OS.
Newer instance families use different hypervisors (Xen for the old, Nitro for the new), which means two things.
First, older versions of operating systems don’t support the newer hypervisor. So, you’re not just migrating instances; you’re upgrading the entire OS as well—which includes a hideous number of version dependencies. If you’ve containerized your workload to the point where you don’t care about this, great–you probably want to look at spot fleets instead.
Second, a lot of these workloads are “certified” by either external vendors or internal divisions to run on certain versions of various bundled libraries. Suddenly, upgrading them to the newest version doesn’t work nearly as well as you’d hope.
Also, unless you had the foresight to buy convertible-type reserved instances to begin with, you’ll likely find yourself wasting a lot of previously committed money after being forced to do this eventually. Within a family and generation (for example, m3, c4, t2…) size doesn’t matter; a single 4xlarge decomposes to four xlarge instances, and vice-versa. But standard reserved instances don’t carry over between generations or families.
Lastly, you’re almost certainly using the wrong instances. With over 180 distinct SKUs in us-east-1 alone, you’re statistically almost never going to be using the proper instance type for your workload unless you spend significant time benchmarking your application.
In conclusion, if a tool, person, or tool of a person comes in to take a look at your AWS environment and casually suggests migrating to newer instances as a “quick win,” show them the door. ‘
It’s a win. But it’s certainly not an easy one for most environments.