Ask HN: Who is accountable for cloud costs in your org?

Context: I lead the DevOps team in a mid size engineering org (60 engineers approximately). The product is a B2B SaaS product. The organization started taking cloud costs seriously early 2024 and my team worked throughout the year to make infrastructure changes to reduce our cloud costs. This included projects like removing manually created stale infrastructure, automating infra management, rightsizing, purchasing the right Savings Plans(and RIs), and many others. Now we're at a place where infra-only projects to optimize cloud costs are pretty much exhausted. On the other hand, most of the anomalies (and surprises) regarding cost spikes come from application level changes. This causes serious problems because not only these anomalies are identified late into the deployment lifecycle, but these anomalies are inherently harder to resolve quickly. An example of this is when a service triggered a downstream workflow which started spawning additional background jobs (10x in production) which blew up the cost projections.

So, my question to the group is - Who do you hold(or should hold) accountable when cloud costs spike unexpectedly: the engineers who write the code, the platform team who manages the infrastructure, or the product managers who set the requirements? (My current solution is a mix of platform team and engineers but we're still trying to formalize the accountability model.)

9 points

・

devashish86

・

6 months ago

19 comments

talonx ・ 6 months ago

Rather than focusing on accountability as the starting point, I would suggest building tooling and visibility so that cloud costs are visible across all layers including application and infra. Once you have this, accountability becomes easier.

Each application team should be able to view the total cost of running their service - and thus be held accountable to reduce costs when necessary.

Without data you are running blind. Cost optimization cannot be solved by a standalone team - it has to be owned by everyone.

Source: Personal experience reducing cloud costs in a slightly smaller team.

devashish86 ・ 6 months ago

I agree that building visibility makes accountability easier. Its relatively trivial to build observability for individual services and we have achieved some version of it.
The problem is when 30 odd microservices (each team owning between 5-10) talking to each other. In pre-production setup, changes in few of these services might not have noticeable impact on cost which will become quite apparent in production. When this happens, we definitely notice an increase the cost and the unit metric. But then we dont know where to start fixing this problem from. Right now, this becomes a war-room situation based on the severity but I dont think this is sustainable.
In comparison, if we take API latency as a metric, the accountability and ownership are clearly defined: If an API slows down, the team that owns it, fixes it. They can work with anyone they need to but its their job to fix it.
Did you face similar concerns/issues? Not sure if this is a problem other engineering teams are struggling with or even considering as a real problem to invest it.
I'm also not sure if theres a "standard" way of doing this which we should be thinking about. So, looking for ideas and thoughts here.
- talonx ・ 6 months ago
  
  I see what you are describing. The pre-production/staging setup may not bring out cost increases caused by application changes, and by the time it is running in production for a while it has already caused a cost explosion.
  We did face similar situations - but we fixed them after the cost went up on prod. I guess this has more to do with how much and how fast an "undetected" cost in pre-prod can explode in production. We used to keep an eye on the prod cost numbers after a deployment, and then tackle each one, because the increase was not that quick.
  I'm not sure either about a "standard" way, so I'm just thinking aloud here, and I've not tried this myself:
  For application changes, measure the difference in cost in pre-prod, in terms of percentage increase, between the previous deployment and the current one, and use that to estimate the possible prod increase. I suspect this will become messy very fast as the other factors to include would be num requests, CPU/memory usage, and so on.

lbhdc ・ 6 months ago

We use backstage (backstage.io) to manage our infra. It has plugins that track costs and attribute them to individuals and teams. That gets aggregated and is used to forcast costs for projects/teams/whatever.

You can do click-ops in the UI (which then generates yaml in a repo), or you can write special yaml files in your repo yourself. These yaml files define the owner (team entity) or the individual that the cost originates from. Its a mostly automated process.

Since each resource an application uses is known, anomalies can be tracked down and attributed. So, for example, if someone starts serving big files from anywhere other than the CDN and blowing up egress costs, the source and root cause are easy to identify.

Backstage has a "lifecycle" tag for the resources you spin up (experimental is the default). If you spin stuff up that isn't tagged as being in production they get auto deleted after a period of time (you get a bunch of emails about it beforehand). That cleans up experiements or other test infra that people have forgotten about.

bobdvb ・ 6 months ago

Ultimately the CTO pays the bill but my boss, the head of Tech Ops, effectively pays the bill for cloud, even though almost all the users are not in his reporting line. Our division is a service to the business as Platform Ops.

Our bill is so big that no one engineer can significantly move the needle, but we have people going around and looking at costs (both Arch and FinOps) to identify what appears to be inefficient spend. We're also quite happy to tell AWS we want something zero rated or discounted if we don't like the cost of it. At a certain size the account team from the cloud provider are somewhat on your side when it comes to negotiations.

Generally architecture reviews and engineering peer reviews should avoid designs which cost a lot, but the most common cause of inefficient engineering practices is when time is more important than money. Then 6 months later someone looks at the cost and says "Why the hell are we doing search that way?", "Because you said you didn't have any time to change the API, so we just made it work this stupid way."

Any new engineer can join and find 5 ways of saving more than their annual salary within a week. But corralling all the teams to actually change the code? That takes leverage.

MattGaiser ・ 6 months ago

The problem with the "everyone" model being pitched here is that it may as well be a synonym for "nobody."

I've worked for a few orgs where quality and testing were "everyone's" responsibility and it ultimately led to everyone pushing it off their plates and lots of it simply not getting done. Why? We could collectively borrow against the future and "everyone" being responsible meant that nobody could be held accountable, as then the debate would be in deciding fractions of responsibility.

It also encouraged those with other incentives, like product, to lean heavily on that to ship more features over doing reliable tech work as they figured the debt would be someone else's problem down the road.

People have this naive idea that people who are given responsibility will step up. There are those that do, but the rest often see the far easier path of externalizing problems and frankly most jobs reward that as they don't see externalities well.

I would have it so that platform team is responsible for identifying and engineering is responsible for fixing it. I am not sure that either team would have the skills needed to prevent such things from happening, so perhaps canary deployments would be the way to go if it is a substantial risk in your domain.

devashish86 ・ 6 months ago

> The problem with the "everyone" model being pitched here is that it may as well be a synonym for "nobody." Can't agree with this enough!
Thanks for your inputs. A lot of it resonates with what I've observed which translates to the fact that this is as much a cultural/people problem as much it is a technical problem. If teams took ownership by just building visibility, then it'd be an easier problem to solve.
You bring up a good point of doing canary deployments for solving this problem. I'll check this out.
But its interesting that you say ".. if it is a substantial risk in your domain". Isn't this a problem that most engineering teams are struggling with, especially in last few years? Being part of a few DevOps meetups in my area(Seattle) for a while and having attended a bunch of conferences in last couple of year, I've noticed cost coming up as one of the most recurring discussion topics. Just curious why cloud costs wont be a risk in any domain.
- MattGaiser ・ 6 months ago
  
  It is a risk for any company, but the possible harm is variable.
  At a prior employer, cloud costs could have doubled or even gone up an order of magnitude and because the margins were so good and the tech costs so low, it wouldn't have mattered and may barely have been noticed. Compute wasn't a substantial business cost in any way, as customers were paying for domain expertise in the product.
  At another prior employer, costs scaled with revenue pretty linearly, so while bad, it wouldn't be catastrophic before being noticed as it would also mean increased revenue.
  However, for say a company that does video streaming where cloud costs are already enormous, poor cloud usage can cut months off runway. Same with AI, where the money is overwhelmingly being burned on compute.
  Cloud waste can happen anywhere, but the harm can range from still a tiny number to destroying the ability to make payroll depending on what you are doing.

itake ・ 6 months ago

At my job (eng count ~1k), the EM's are responsible, with TPMs helping monitor the metrics.

Teams are given a fixed budget per micro-service and if that spend exceeds that budget, you need to find the money from another service in our org.

blinded ・ 6 months ago

Everyone, but it is up to one team to enforce standards (tagging) such that you can do proper cost attribution to teams and products.

Depending on how you add infra there are tools you can use to estimate the cost of a change at the pr level before it goes live.

devashish86 ・ 6 months ago

Standards enforcement like tagging, TF structure, pipelines etc is currently owned by the platform team. We also have mechanisms to figure out the cost change (approximately) with each Infra PR. The struggle is to attribute cost changes in application-only changes and to identify them early in the lifecycle. These would be PRs for microservices that add features, fix bugs and handle tech debt.
- blinded ・ 6 months ago
  
  Hate to suggest more process, but the technical review stage could be edited to include a section where cost changes are explored?

re-thc ・ 6 months ago

> or should hold

EVERYONE. Like with everything on Earth it's everyone's responsibility. It all adds up.

The way where everything is in isolation and silos really doesn't work and with everyone not having the full picture nothing is ever optimal.

MattGaiser ・ 6 months ago

Everyone can't have the full picture as too much knowledge and information are required and even if they did, their incentives usually aren't aligned with caring.

nejsjsjsbsb ・ 6 months ago

Each team for their microservice costs. There are finops teams that help collate the data though.

What talonx said is what we do.

JSTrading ・ 6 months ago

The new model is basically: get staff into donating. I mean, they’re using it too, right? Saving the planet one bullshit app at a time, aren’t they? So why shouldn’t they pony up and help foot the bill?

YuriNiyazov ・ 6 months ago

Visibility is essential, and visibility is owned by the devops team. ALl the resources have to be tagged with the team that owns those resources, and you should have a meeting once a month or so with that team where you bring to them a report of "here are all the resources we have assigned to you, do you agree?" and if they agree, then great, they are responsible for it, and if they don't agree, then now you have to have a conversation where you bring in the PMs and the EMs, and the head of finance if you need to.

And it's essential that you check in regularly, rather than just give them a dashboard and say "ok, you go look here and tell me if anything is amiss", because they will never look.

pants2 ・ 6 months ago

Offer your team a $100 gift card for every $1,000 shaved off in monthly cloud costs

devashish86 ・ 6 months ago

Already done! This actually works pretty well when two conditions are true 1. cost becomes an engineering-org wide pain point 2. there are a bunch of low hanging fruit (another way to say you're just starting off your cost improvement journey)
Since cost is not really a "deliverable" for non-platform teams, this incentive doesnt go far. Especially after a few iterations of this are done, saving 1K is hard.
We did a short program(similar to a bug bash) for a couple of sprints to dig up all the improvements we can do to reduce to cost. This was early in our cost reduction journey. This did help us get a huge list of what we can do and we picked the most impactful items from this.