Rather than focusing on accountability as the starting point, I would suggest building tooling and visibility so that cloud costs are visible across all layers including application and infra. Once you have this, accountability becomes easier.
Each application team should be able to view the total cost of running their service - and thus be held accountable to reduce costs when necessary.
Without data you are running blind. Cost optimization cannot be solved by a standalone team - it has to be owned by everyone.
Source: Personal experience reducing cloud costs in a slightly smaller team.
I agree that building visibility makes accountability easier. Its relatively trivial to build observability for individual services and we have achieved some version of it.
The problem is when 30 odd microservices (each team owning between 5-10) talking to each other. In pre-production setup, changes in few of these services might not have noticeable impact on cost which will become quite apparent in production. When this happens, we definitely notice an increase the cost and the unit metric. But then we dont know where to start fixing this problem from. Right now, this becomes a war-room situation based on the severity but I dont think this is sustainable.
In comparison, if we take API latency as a metric, the accountability and ownership are clearly defined: If an API slows down, the team that owns it, fixes it. They can work with anyone they need to but its their job to fix it.
Did you face similar concerns/issues? Not sure if this is a problem other engineering teams are struggling with or even considering as a real problem to invest it.
I'm also not sure if theres a "standard" way of doing this which we should be thinking about. So, looking for ideas and thoughts here.
I see what you are describing. The pre-production/staging setup may not bring out cost increases caused by application changes, and by the time it is running in production for a while it has already caused a cost explosion.
We did face similar situations - but we fixed them after the cost went up on prod. I guess this has more to do with how much and how fast an "undetected" cost in pre-prod can explode in production. We used to keep an eye on the prod cost numbers after a deployment, and then tackle each one, because the increase was not that quick.
I'm not sure either about a "standard" way, so I'm just thinking aloud here, and I've not tried this myself:
For application changes, measure the difference in cost in pre-prod, in terms of percentage increase, between the previous deployment and the current one, and use that to estimate the possible prod increase. I suspect this will become messy very fast as the other factors to include would be num requests, CPU/memory usage, and so on.