When Corey Quinn and I first started The Duckbill Group in 2019, I was expecting we’d be advising organizations on complicated Reserved Instance purchases and the like.
As we worked with more and more organizations on their horrifying AWS bills, I came to find that we spent almost all of our time (and still do!) advising Engineering teams on architectural improvements in pursuit of cost optimization. In fact, Reserved Instances and Savings Plans don’t even factor into our analysis until the very end of our AWS Cost Optimization engagements.
That realization has since become a core thesis for our consulting at The Duckbill Group: Cost management is primarily an engineering problem, not a financial problem. This fundamental misunderstanding leads to organizations building ineffective cloud finance teams. By understanding how this goes wrong, you can build a much more effective cloud finance team at your company.
The unexpected reality: Architectural choices drive cost
Having seen oh-so-many AWS bills and consulted with dozens upon dozens of organizations, both staggeringly large and super tiny, it’s become clear to us that architectural choices are the primary driver of cloud costs.
Not those Elastic IPs you don’t use anymore.
Not those instances you forgot to turn off.
Not your developer environments.
It’s your architecture.
Your production environments should generally account for the vast majority of your cost, but within production, it’s the choices you make that drive that cost, not unused resources inside production.
Through that lens, the real path to optimizing cost begins to appear: Figure out the specific technical drivers of cost inside your architecture, pinpoint them, and then come up with ways to improve the cost of operating them.
What I mean by figuring out drivers of cost is more than just saying, “Oh, we use a lot of RDS.” I mean intimately understanding how data flows through the environment. What, exactly, causes costs to fluctuate? If you were to increase the workload of a given subset of your environment, what would happen to costs? If user traffic dropped by 20% tomorrow, what specifically would happen to your costs? Understanding these cost drivers allows you to pinpoint where the cost optimizations are.
Any system that charges based on metered usage naturally leads to this situation of needing to intimately understand cost drivers and behaviors. To optimize the costs of that metered-usage system, you have to change how the system is used. And thus, architecture and costs are the same problem.
Most of your cost management efforts are ineffective
The misunderstanding of the nature of the cost management problem has led, in our experience, to massively ineffective cost management efforts within many organizations.
Many of the self-reported “cost-mature” organizations that have come to us tend to have a dedicated person or team for managing their AWS costs. That’s great! We recommend that.
But when we dig into those day-to-day role responsibilities, we usually find that they’re centered on largely ineffective tasks: managing coverage and utilization of Reserved Instances and Savings Plans (RIs/SPs), and chasing down idle resources.
Both of those things are obvious tasks to do in all cost management efforts. It’s where almost all of the cost optimization SaaS tooling vendors focus their products because the tasks can largely be automated, thus, they’re a great opportunity for software-driven solutions. These tasks are useful and valuable in some cases, but they’re not the most important work to be done.
RIs/SPs should be the last consideration in cost management efforts. When you purchase RIs/SPs, you’re fundamentally saying that you agree that what is currently running in the environment is the right stuff to be running, both in terms of number of resources and configuration of them.
You can’t do that at the start of your efforts.
And as for idle resources, yeah, sure, that’s a thing but … it’s really not that big of a problem.
We had a client many moons ago that was adamant they needed a robust solution to handle the runaway costs in their developer environments due to lots of constantly idle resources. We started discussing various models of solutions, such as automatic deprovisioning after-hours, before thinking to check their total spend on development environments. It was less than 1% of their bill! We advised them to just let the idle resources run and stop worrying about it.
The concern of idle resources driving costs is usually more fear than reality. And before you come at me: Yes, there will always be an exception where someone saved a ton of money by turning stuff off. That’s great, but it’s not the common case.
Managing RIs/SPs and idle resources are both worthwhile activities in the right context, but that context isn’t generally day-to-day operations.
Engineering isn’t paying attention the way you think
The main trouble I have with how most cost management efforts are structured is that they assume Engineering has made all the right choices to build cost-aware and cost-effective systems.
As any capable engineer will attest, this is a very bad assumption.
It’s not that engineers are willfully wasting money. Your organization employs engineers to build functionality for your customers and create value for them. The quicker they can do that, the sooner the customer gets value. That’s exactly what your business wants them doing.
Optimizing costs is a later-stage task (read: afterthought) for engineers, because creating customer value always comes first. Sometimes that later stage never arrives simply because your engineers are focused on the neverending backlog of work that creates customer value.
This is a good problem to have. An expensive one, perhaps — but a good one.
That leads me to my next point.
You’re improperly staffing cost management roles
Many organizations hire people with a finance background to staff cost management roles. Please stop doing this — you’re making the problem worse.
On one hand, you probably have better reports now. But on the other hand, as established earlier, Finance can’t solve an Engineering problem. You need an engineer in that role. While you’ve got better reports, you’re no better off in terms of actual understanding and insight into your cloud spend.
Finance, given a lack of Engineering knowledge and no context provided by Engineering, is going to base decisions on the assumption that Engineering is doing the right things when it comes to AWS architecture and costs. That’s, uhh, not a bet I’d take.
Don’t get me wrong here: Finance has a crucial role in managing your AWS spend. It’s just not in cutting costs, because that’s not the best use of their expertise. (Let Finance focus on contract negotiation, unit economics, and forecasting).
If you’re building an AWS cost management function, a Finance person is not the first hire I’d make.
Cost management SaaS tools are only a partial solution
One of the more common behaviors we’ve run across is organizations’ propensity to just rub some software on it. They roll out a product such as Cloudability or CloudHealth and believe that their cloud costs are all taken care of.
As it turns out, beating people over the head with pretty dashboards is a lot easier than changing the company culture so they care about costs.
One of my favorite things to do is ask how many active users the products have — it’s never more than four or five people, no matter the size of the organization. Software is only part of the solution to cloud finance, and it’s not even the most impactful part.
Going down this path risks putting the cloud finance function in a precious position: They have all of the responsibility for managing costs but none of the leverage needed. It results in a constant battle against engineering teams that don’t understand why they have to care about cost management so much, particularly if it’s not part of their compensation or promotion structure.
Fixing cloud finance means rethinking cost management
Cloud finance is still a new area of work for the industry, so it’s understandable that many companies struggle to build an effective practice of it. The key to fixing broken, ineffective cloud finance teams is realizing that cost management is primarily an engineering problem.
To build an effective cloud finance team, your organization should:
- Rethink what you think you know about cloud spending. Operate from a new assumption: Architecture and costs are the same thing.
- Work to better understand how your architecture impacts your costs and how your specific cost drivers behave. The majority of your efforts should be on understanding cost drivers instead of RI/SP management and identifying idle resources.
- Build processes into your engineering release cycles for ongoing cost optimization efforts. A little bit of time spent every week will have much better results than a lot of effort every quarter.
- Staff your cloud finance efforts with engineering, supported by finance — not the other way around.
- Acknowledge that your tools aren’t the complete solution. Tools are not a replacement for people; tools augment people.
Being more aware of the failure modes I’ve laid out and what to do instead will, hopefully, allow your company to improve how it manages cloud costs.