Best Practices for Infrastructure as Code (IaC) on AWS with Terraform
Discover 7 best practices for managing IaC on AWS. Discover how to streamline workflows, enhance security, and maximize the power of Terraform.
Maximize your engineering efficiency with strategies focused on cost-effectiveness. Learn tips for optimizing resources and reducing operational costs.
By Chris Hurst and Chris Childress
We are engineers, and as engineers, we are constantly seeking to improve the function, performance, security, and other aspects of the solutions we create. One of those “other aspects” is improving the cost efficiency of the solution, or the cost per amount of value added.
Value could be a file sent or received, a notification to a subscriber, serving a webpage to a customer, or many other programmatic functions of an application. The focus of this post is to hone in on the cost, talk about what cost really is, and how to better understand and optimize it.
We can probably all agree the most objective measure of cost, especially for a business, is the amount of money spent. Where it may become confusing is how to measure the money expended.
Some of that difficulty has traditionally been hardware-related; how do you measure cost over time while accounting for capital expenditures for hardware, capturing electricity costs, accounting for the cost of space in the building for the data center, and more?
Fortunately, in 2021 we can have a cloud bill and have very objective costs over time for this underlying hardware level. If you take those hardware costs and add software licensing you will have arrived at where many seem to stop in the cost measurement.
So what’s wrong with that? If you have all of the hardware and software costs accounted for, we know exactly what to optimize to make our systems more cost-efficient, right? Not so fast!
The uncomfortable truth is the most significant cost of these systems is often the cost of engineering time. This might not surprise you, especially if you’ve worked in the management or consulting side where you see the amount of money being spent for engineers, consultants, and contractors.
When these solutions are examined, you will find it’s not unusual for the cost to implement and operate a solution to be multiple times higher than the cost of the infrastructure it runs on.
If this is true, then why doesn’t everyone optimize for this much more significant cost? In short, it’s complicated to calculate the cost of engineering time, especially at a project level, with any accuracy
We believe there are many reasons for this, but here are just a few:
Some engineering time brings measurably more value per unit than others, some costs more than others, and the relationship of cost to value provided does not relate as closely as you might think.
Not only is there a big cost for the engineering time you do have to allocate, but engineering time is a very limited resource and costs a significant amount to acquire more.
Many organizations have little to no understanding of the engineering time that is spent on specific initiatives, and if you think that’s an easy fix, try to implement cost tracking software and get a bunch of humans to use it to provide accurate and useful data…
So now that we've identified a problem, let’s summarize the situation into a more concise statement; we are engineers and we want to continually improve our systems' cost efficiency, but we can’t get accounting to provide accurate engineering cost information regarding a specific project.
A logical line of thinking might be “OK, so that’s a numbers problem and not something an engineer can fix. Why should this matter to me as an engineer, and what can I do about it?”. This is where we believe we can provide some insight!
We believe engineers already know which services are costing the organization more engineering time and likely have a good idea of what to do to optimize it through the entire lifecycle.
They might even have tried to optimize it without knowing it in the past but didn’t have the right structure to their requests to get management support. We’ve experienced this, so to draw that information out of ourselves we like to ask a simple question: “What does it feel like to work with this solution, and why?”.
If you apply this question to a few of your services, you will probably find a common thread. The services taking a lot of time and costing an organization the most money are also the ones that aren’t enjoyable to work with.
Here are a couple of attributes of a service that doesn’t feel good to work with that lead to increased cost:
The services that don’t feel good to work with often cause the most unplanned work. This is the work that happens at the worst times, preventing deliverables you wanted to work on from getting the attention they need, and pulling you out of your productivity power band.
Often, the most effective engineers are the ones having to spend time on these systems. This is because those engineers are the ones that can fix the issues quickly and unblock the team, understand the complexity, or are trusted by the organization to resolve an outage quickly.
There are different reasons why services reach this state, but here are a few we would recommend looking out for. You may even want to add some of these as checks in architecture reviews or regularly scheduled operational reviews.
Unnecessary Optimization - It can be really fun to build a service for your company’s internal staff that will be able to scale up to 10 million users, even though your company only has about 3,600 employees. It’s “future-proof” now though, right? We would argue a more accurate term might be “future-resisting”, because the additional complexity that was added to provide that level of scalability effectively calcified the solution, making it more rigid and more complicated for every kind of change, feature enhancement, engineer onboarding, or outage resolution. Simplicity is a great ally when attempting to make systems that are enjoyable to work with and don’t consume a lot of engineering time.
Recurring Issues and Runbooks - There are exclusions, but as a general rule, if something is happening frequently enough (more than once) to have a runbook you should probably find the root cause and fix it instead. Recurring issues burn engineering time and emotional/mental stamina, often from your highest performers. Also, I’m not sure I’ve ever met anyone that really liked working on systems with a lot of recurring issues.
Aggressive Infrastructure Cost Reduction - What is a more traditional approach to cost optimization than trying to find that perfect size for your VM? That Windows server only needs 2GB of RAM, right? It might need to be rebooted every few weeks when it becomes unresponsive and starts getting user complaints, and it does take a long time to install updates or software, but it’s costing us $20 per month less so it’s worth it! Let’s go back to those labor costs we were talking about (most external resources are between $150-200/hr) and it should become clear this may not have been worth saving the $20/mo to have an instance that is half the size.
Avoiding Vendor Lock-In - There are definitely some companies that need to be very aware of this; what about yours? If you have to move that product to another vendor later, it will definitely cost more money to move it than if you made it vendor agnostic. How much more though? Would it cost you so much more that it is worth paying upfront for the design to allow for it? Is it worth the complexity and all the abstraction needed to allow for that agnostic nature introduced into the solution?
Custom Frameworks - Whether it’s for security reasons, performance reasons, or some extra capability, sometimes it makes sense to make a custom framework. However, you definitely want to consider what the cost would be for every engineer who is working with the solution to understand this solution, add to this solution, operate services made with this framework, to generate and maintain enough documentation to be able to deliver quality outcomes, and the amount of time spent maintaining the framework. A good rule of thumb is, if it provides a real business differentiator for one of your business’s core domains it might be worth doing, if it isn’t then you should probably grab that existing framework, warts and all, and keep on moving. The open-source maintainer would probably love having another contributor as well!
With this in mind, the next time you go to look at your service to figure out how you can make the service portable across any provider, make the compute bill just a bit smaller, or architect it to run on Kubernetes instead of a Dyno or Fargate runtime, ask yourself this question: “How will this affect the team’s experience adding to or operating this solution?”.
Hopefully, this can lead your engineering organization to make more cost-efficient solutions that are also a joy to work with!
As always, we are not all-knowing and probably don’t know much about your solution. These are only observations from our internal work and client solutions.
We would love to hear your thoughts on whether you think this does or does not apply to your organization so we can all learn more together. Contact us today!
Discover 7 best practices for managing IaC on AWS. Discover how to streamline workflows, enhance security, and maximize the power of Terraform.
Discover how AWS and Stratusphere™ FinOps transform cloud cost management for optimal efficiency and savings. Start optimizing with a free 30-day...
Explore an in-depth comparison of Stratusphere™ FinOps and AWS Compute Optimizer for cost optimization. Discover which tool fits your enterprise.