Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

cloud.google.com

・

312 points

・

mariuz

・

2 days ago

184 comments

ashishb ・ 2 days ago

I love Google Cloud Run and highly recommend it as the best option[1]. The Cloud Run GPU, however is not something I can recommend. It is not cost effective (instance based billing is expensive as opposed to request based billing), GPU choices are limited, and the general loading/unloading of model (gigabytes) from GPU memory makes it slow to be used as server less.

Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.

1 - https://ashishb.net/programming/free-deployment-of-side-proj...

gabe_monroy ・ 2 days ago

google vp here: we appreciate the feedback! i generally agree that if you have a strong understanding of your static capacity needs, pre-provisioning VMs is likely to be more cost efficient with today's pricing. cloud run GPUs are ideal for more bursty workloads -- maybe a new AI app that doesn't yet have PMF, where you really need that scale-to-zero + fast start for more sparse traffic patterns.
- jakecodes ・ 2 days ago
  
  Appreciate the thoughtful response! I’m actually right in the ICP you described — I’ve run my own VMs in the past and recently switched to Cloud Run to simplify ops and take advantage of scale-to-zero. In my case, I was running a few inference jobs and expected a ~$100 bill. But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.
  I’m fairly experienced with GCP, but even then, the billing model here caught me off guard. When you’re dealing with machines that can run up to $64K/month, small missteps get expensive quickly. Predictability is key, and I’d love to see more safeguards or clearer cost modeling tooling around these types of workloads.
  
  gabe_monroy ・ 2 days ago
  
  Apologies for the surprise charge there. It sounds like your workload pattern might be sitting in the middle of the VM vs. Serverless spectrum. Feel free to email me at (first)(last)@google.com and I can get you some better answers.
  
  ashishb ・ 2 days ago
  
  > But due to the instance-based behavior, it stayed up the whole time, and I ended up with a $1,000 charge for relatively little usage.
  Indeed. IIRC, if you get a single request every 15 mins (~100 requests a day), you will pay for Cloud Run GPU for the full day.
- Sn0wCoder ・ a day ago
  
  Has this changed? When I looked pre-ga the requirements were you need to pay for the CPU 24x7 to attach a GPU so that is not really scaling to zero unless this requirement has changed...
  
  ashishb ・ a day ago
  
  Speaking from my experience, it does scale to zero except you pay for 15 mins after the last request.
  So if you get all your requests in a 2 hours window then that's great. It will scale to zero for rest of the 22 hours.
  However, if you get at least one request every 15 mins then you will pay for 24 hours and it is ~3X more expensive then equivalent VM on Google Cloud.
- krembo ・ 2 days ago
  
  How does that compare to spinning up some ec2s with amazon trainium gpus?
  
  mgraczyk ・ a day ago
  
  Depending on your model, you may spend a lot of time trying to get it to work with Trainium
icedchai ・ 2 days ago

Cloud Run is a great service. I find it much easier to work with than AWS's equivalent (ECS/Fargate.)
- psanford ・ 2 days ago
  
  AWS AppRunner is the closest equivalent to Cloud Run. Its really not close though, AppRunner is an unloved service at AWS and is missing a lot of the features that make Cloud Run nice.
  
  vrosas ・ 2 days ago
  
  AppRunner was Amazon's answer to AppEngine a full decade+ later. Cloud Run is miles ahead.
  
  romanhn ・ 2 days ago
  
  I agree with the unloved part. It was a great middle ground between Lambda and Fargate (zero cold start, reasonable pricing), but has seemingly been in maintenance mode for quite a while now. Really sad to see.
- gabe_monroy ・ 2 days ago
  
  i am biased, but i agree :)
  
  icedchai ・ 2 days ago
  ・ 2 more
  
  hah. I looked at your comments and saw you were a google VP! I've migrated some small systems from AWS to GCP for various POCs and prototypes, mostly Lambda and ECS to Cloud Run, and find GCP provides a better developer experience overall.
  
  gabe_monroy ・ 2 days ago
  
  love that you're enjoying the devex. we put a lot of sweat into it, especially in services like cloud run.
  
  ashishb ・ a day ago
  
  Yeah, anyone who uses GCP and AWS thoroughly will agree that GCP is a superior developer experience.
  The problem is continuous product churn. This was discussed at length at https://news.ycombinator.com/item?id=41614795
- AChampaign ・ 2 days ago
  
  I think Lambda is more or less the AWS equivalent.
  
  icedchai ・ 2 days ago
  
  It's not. Cloud Run can be longer running: you can have batch and services. Lambda is closer to Cloud Functions.
  
  ZeroCool2u ・ 2 days ago
  ・ 5 more
  
  I think Cloud Run Functions would be the direct equivalent to Lambda.
  
  hn_throwaway_99 ・ 2 days ago
  ・ 3 more
  
  I agree, but in the GCP world, a lot of these things are merging. My understanding is that Cloud Run, Cloud Run Functions (previously known as Cloud Functions Gen2) and even App Engine Flexible all run in the same underlying cloud run infrastructure, so it's essentially just some interface differences that to me now seem more like historical legacy/backwards compatibility reasons than meaningful functionality differences (e.g. Functions can now handle multiple concurrent requests).
  
  yegle ・ 2 days ago
  ・ 2 more
  
  FWIW, App Engine Flexible is a different product that runs on GCE VM.
  Other products (App Engine standard, Cloud Functions gen1, Cloud Run, Cloud Run Functions) share many underlying infrastructures.
  
  hn_throwaway_99 ・ 2 days ago
  
  Oh, thanks! I guess I had it backwards - I thought App Engine standard was the one on a different infrastructure.
  
  AChampaign ・ 2 days ago
  
  Oh, you’re probably right.
  
  shiftyck ・ 2 days ago
  ・ 2 more
  
  Eh idk Cloud Run is much better suited to long running instances than Lambda. You would use Cloud Functions for those types of workloads in GCP.
  
  weberer ・ 2 days ago
  
  For those who don't know, AWS Lambda functions have a hard limit of 15 minutes.
mountainriver ・ 2 days ago

The problem is you can't reliably get VMs on GCP.
All the major clouds are suffering from this. AWS you can't ever get an 80gb gpu without a long term reserve and even then it's wildly expensive. GCP you can sometimes but its also insanely expensive.
These companies claim to be "startup friendly", they are anything but. All the neo-clouds somehow manage to do this well (runpod, nebius, lambda) but the big clouds are just milking enterprise customers who won't leave and in the process screwing over the startups.
This is a massive mistake they are making, which will hurt their long term growth significantly.
- covi ・ 2 days ago
  
  To massively increase the reliability to get GPUs, you can use something like SkyPilot (https://github.com/skypilot-org/skypilot) to fall back across regions, clouds, or GPU choices. E.g.,
  $ sky launch --gpus H100
  will fall back across GCP regions, AWS, your clusters, etc. There are options to say try either H100 or H200 or A100 or <insert>.
  Essentially the way you deal with it is to increase the infra search space.
- rendaw ・ 2 days ago
  
  We've hit into this a lot lately too, even on AWS. "Elastic" compute, but all the elasticity's gone. It's especially bitter since splitting the costs for spare capacity is the major benefit of scale here...
  
  mountainriver ・ 2 days ago
  
  Enterprises are just gobbling up all the supply on reserves so they see no need to lower the price.
  All the while saying they are "startup friendly".
- dconden ・ 2 days ago
  
  Agreed. Pricing is insane and availability generally sucks.
  If anyone is curious about these neo-clouds, a YC startup called Shadeform has their availability and pricing in a live database here: https://www.shadeform.ai/instances
  They have a platform where you can deploy VMs and bare metal from 20 or so popular ones like Lambda, Nebius, Scaleway, etc.
bodantogat ・ 2 days ago

I had the opposite experience with cloud run. Mysterious scale outs/restarts - I had to buy a paid subscription to cloud support to get answers and found none. Moved to self managed VMs. Maybe things have changed now.
- PaulMest ・ 2 days ago
  
  Sadly this is still the case. Cloud Run helped us get off the ground. But we've had two outages where Google Enhanced Support could give us no suggestion other than "increase the maximum instances" (not minimum instances). We were doing something like 13 requests/min on this instance at the time. The resource utilization looked just fine. But somehow we had a blip in any containers being available. It even dropped below our min containers. The fix was to manually redeploy the latest revision.
  We're now investigating moving to Kubernetes where we will have more control over our destiny. Thankfully a couple people on the team have experience with this.
  Something like this never happened with Fargate in the years my previous team had used that.
- ajayvk ・ 2 days ago
  
  https://github.com/claceio/clace is project I am building which gives a Cloud Run type deployment experience on your own VMs. For each app, it supports scale down to zero containers (scaling up beyond one is being built).
  The authorization and auditing features are designed for internal tools, any app can be deployed otherwise.
  
  holografix ・ a day ago
  ・ 2 more
  
  Have a look at Knative
  
  Bombthecat ・ 18 hours ago
  
  Knative is amazing!
Bombthecat ・ 18 hours ago

You don't go to cloud services because they are cheaper.
You go there because you are already there or have contracts etc etc
JoshTriplett ・ 2 days ago

Does Cloud Run still use a fake Linux kernel emulated by Go, rather than a real VM?
Does Cloud Run give you root?
- seabrookmx ・ 2 days ago
  
  You're thinking of gvisor. But no, the "gen2" runtime is a microvm ala firecracker and performs a lot better as a result.
  
  JoshTriplett ・ 2 days ago
  
  Ah, that's great.
  And it looks like Cloud Run can do something Lambda can't: https://cloud.google.com/run/docs/create-jobs . "Unlike a Cloud Run service, which listens for and serves requests, a Cloud Run job only runs its tasks and exits when finished. A job does not listen for or serve requests."
  
  pryz ・ 2 days ago
  ・ 2 more
  
  https://github.com/cloud-hypervisor/cloud-hypervisor or something else?
  
  seabrookmx ・ 15 hours ago
  
  Possibly? I haven't found any public documentation that says specifically what hypervisor is used.
  Google built crosvm which was the initial inspiration for firecracker, but Cloud Run runs on top of Borg (this fact is publicly documented). Borg is closed source, so it's possible the specific hypervisor they're using is as well.
- rpei ・ a day ago
  
  We (I work on Cloud Run) are working on root access. If you'd like to know more you can reach me rpei@google.com
  
  JoshTriplett ・ a day ago
  
  Awesome! I'll reach out to you, thank you.
dig1 ・ 2 days ago

> I love Google Cloud Run and highly recommend it as the best option
I'd love to see the numbers for Cloud Run. It's nice for toy projects, but it's a money sink for anything serious, at least from my experience. On one project, we had a long-standing issue with G regarding autoscaling - scaling to zero sounds nice on paper, but they will not mention you the warmup phases where CR can spin up multiple containers for a single request and keep them for a while. And good luck hunting for unexplainedly running containers when there are no apparent cpu or network uses (G will happily charge you for this).
Additionally, startup is often abysmal with Java and Python projects (although it might perform better with Go/C++/Rust projects, but I don't have experience running those on CR).
- tylertreat ・ 2 days ago
  
  > It's nice for toy projects, but it's a money sink for anything serious, at least from my experience.
  This is really not my experience with Cloud Run at all. We've found it to actually be quite cost effective for a lot of different types of systems. For example, we ended up helping a customer migrate a ~$5B/year ecommerce platform onto it (mostly Java/Spring and Typescript services). We originally told them they should target GKE but they were adamant about serverless and it ended up being a perfect fit. They were paying like $5k/mo which is absurdly cheap for a platform generating that kind of revenue.
  I guess it depends on the nature of each workload, but for businesses that tend to "follow the sun" I've found it to be a great solution, especially when you consider how little operations overhead there is with it.
ivape ・ 2 days ago

Maybe I just don't know, but I really don't think most people here can even point to a cloud GPU with 1000 concurrent users and not end up with a million dollar bill.

isoprophlex ・ 2 days ago

All the cruft of a big cloud provider, AND the joy of uncapped yolo billing that has the potential to drain your creditcard overnight. No thanks, I'll personally stick with Modal and vast.ai

montebicyclelo ・ 2 days ago

Not providing a cap on spending is a major flaw of GCP for individuals / small projects.
With Cloud Run, AFAIK, spending can effectively be capped by: limiting concurrency, plus limiting the max number of instances it can scale to. (But this is not as good as GCP having a proper cap.)
- brutus1213 ・ 2 days ago
  
  Amazon is the same I think? I live in constant fear we will have a runaway job one day. I get daily emails to myself (as a manager) and to my finance person. We had one instance where a team member forgot to turn off a machine for a few months :(
  I get why it is a business strategy to not have limits .. but I wonder if providers would get more usage if people had more trusts on costs/predictability.
  
  anonymousab ・ 2 days ago
  ・ 12 more
  
  I remember going out to dinner, years ago, with a fairly senior AWS billing engineer. An acquaintance of a coworker.
  He looked completely surprised when I asked about runaway billing and why there wasn't any simple options to cap a given resource to prevent those cases.
  His response was that they didn't build that because none of their customers wanted anything like that, as far as he was aware.
  
  mwest217 ・ 2 days ago
  ・ 5 more
  
  Disclaimer: I work at Google but not on cloud. Opinions my own.
  I think the reason this doesn’t get prioritized is that large customers don’t actually want a “stop serving if I pass this limit” amount. If there’s a spike in traffic, they probably would rather pay the money to serve it. The customers that would want this feature are small-dollar customers, and from an economic perspective it makes less sense to prioritize this feature, since they’re not spending very much relative to customers who wouldn’t want this feature.
  Maybe if there weren’t more feature requests to get prioritized this might happen, but the reality is that there are always more feature requests than time to implement them, and a feature request used almost exclusively by the smallest dollar customers will always lose to a feature for big-dollar customers.
  
  montebicyclelo ・ 2 days ago
  
  I guess where it could potentially bring value is by:
  Removing a major concern that prevents individuals / small customers from using GCP in the first place; so more of them do use it
  That could then lead to value in two ways:
  - They make small projects that go on to be large projects later, (e.g. a small app that grows / becomes successful, becomes a moneymaker)
  - Or, they might then be more inclined to get their big corp to use GCP later on, if they've already been using it as an individual
  But that's long term, and hard to measure / put a number on
  
  coredog64 ・ 2 days ago
  ・ 2 more
  
  As noted above, there is enough value here such that AWS implemented this several years ago. Said implementation is appropriate for both personal AWS accounts and large scale multi-account organizations.
  Having implemented this on behalf of others several times, I'll share the common pain points: * There's a long lead time. You need to enable Cost Explorer (24-48 hours). If you're trying for fine distinctions, activating tags as cost allocation tags is another 24 hours * AWS cost data is a lagging indicator, so you need to be able to absorb a day of charges * Automation support is poor, especially for organizations * Organization budgets configured at the account level are misleading if you don't understand how they're configured
  What's really wanted here is that AWS needs to commit to more timely cost data delivery such that you can create an hourly budget with an associated action.
  
  happyopossum ・ a day ago
  
  > Said implementation is appropriate for both personal AWS accounts and large scale multi-account organizations.
  Followed by a list of caveats that make it wholly irrelevant for an individual who is afraid of a surprise charge covering less than several days.
  
  jiggawatts ・ a day ago
  
  Every large enterprise has insurmountable difficult even imagining why customers would want something as bizarre as a "stop loss" on their spending...
  ... right up until it's their own bottom line that is at risk, and then like magic spending limits become a critical feature.
  For example, Azure has no stop-loss feature for paid customers, but it does for the "free" Visual Studio subscriber credits. Because if some random dev with a VS subscription blows through $100K of GPU time due to a missing spending constraint, that's Microsoft's problem, not their own.
  It's as simple as that.
  
  dragandj ・ 2 days ago
  ・ 4 more
  
  Yeah, right. Capping a resource, such a wild idea. Of course they won't implement it for the same reason bar owners don't put a cap on drinks.
  
  ElevenLathe ・ 2 days ago
  
  Aren't bars actually required to cap drinks? It's usually phrased as having to refuse serving if you're visibly drunk, but still effectively a cap. That said, a big cloud bill doesn't make you intoxicated. The more I examine this analogy, the less it makes sense.
  
  mtrovo ・ 2 days ago
  
  I don't know if the analogy works that well, the assumption is that you're making more money then you put in the more traffic you get. As a bar owner is the choice between closing your bar for the month when you run out of beer or running to the supplier to bring more kegs.
  
  rkoots ・ 2 days ago
  
  [dead]
  
  sidibe ・ 2 days ago
  
  I'm sure lot of people at Amazon and Google are aware small customers want this and it's a feature they'd like to brag about, but it is much harder to implement a real time quota on spend than a daily batched job for the money part + realtime resource scoped quotas.
  
  152132124 ・ 2 days ago
  
  None of their Big Customers they meant, the small ones who worry about this doesn't matter.
  
  coredog64 ・ 2 days ago
  
  There's a coarse option: Set up a budget and then a budget action. While ECS doesn't have GPU capabilities, the equivalent here would be "IAM action of budget sets deny on expensive service IAM action" (SCP is also available, but that requires an AWS Org, at which point you've probably got a team that already knows this)
  It's coarse because it's daily and not hourly. However, you could also self-service do some of this with CloudWatch metrics to map to a cost and then have an alarm action.
  https://aws.amazon.com/blogs/mt/manage-cost-overruns-part-1/
  
  tmoertel ・ 2 days ago
  ・ 3 more
  
  > I get why it is a business strategy to not have limits...
  What is the strategy? Is is purely market segmentation? (As in: "If you need to worry about spending too much, you're not the big-money kind of enterprise customer we want"?)
  
  nprateem ・ 2 days ago
  ・ 2 more
  
  It's not a strategy. It's technically difficult, opens them to liability if runaway happens so fast their system can't stop it, and is only wanted by bottom of the barrel customers.
  
  physix ・ 2 days ago
  
  Just a thought: Maybe if they had some kind of opt-in insurance against overuse until the circuit breaker can kick in?
  But, looking from the outside, the lack of protection is effectively a win for them. They don't need to invest in building that out, and their revenue is increased by not having it (if you ignore the effect of throttling adoption). So I have always assumed that there is simply no business case for that, so why bother?
- yarri ・ 2 days ago
  
  [edit - Gabe responded]. See this Cloud Run spending cap recommendation [0] to disable billing, which potentially irreversibly deletes resources but does cap spend!
  [0] https://cloud.google.com/billing/docs/how-to/disable-billing...
  
  badrequest ・ 2 days ago
  
  Sure, but why post a tutorial of how to spin this up in GCP instead of...productizing it in GCP?
- gabe_monroy ・ 2 days ago
  
  Heard on this feedback. While not quite a hard cap, I'd also point to https://cloud.google.com/billing/docs/how-to/budgets which many customers are having success with for this use case.
- advisedwang ・ a day ago
  
  It's rock and a hard place for the cloud providers.
  Cap billing, and you have created an outage waiting to happen, one that will be triggered if they ever have sudden success growth.
  Don't cap billing, and you have created a bankruptcy waiting to happen.
- delfinom ・ 2 days ago
  
  Flaw? Nah
  Feature for Google's profits.
kamranjon ・ 2 days ago

I dunno, the scale to zero and pay per second features seemed super useful to me after forgetting to shut down some training instances with AWS. Also the fast startup ability, if it actually works as well as they say, would be amazing for a lot of the type of workloads that I have.
- isoprophlex ・ 2 days ago
  
  Agreed, but runpod or modal offer the same. Happy to use big cloud for a client if they pay the bills, but for personal quests... too scary.
decimalenough ・ 2 days ago

You can set max instances in Cloud Run, which is an effective limit on how much you'll spend.
Also, hard dollar caps are rarely if ever the right choice. App Engine used to have these, and the practical effect was that your website would completely stop working exactly when you least want it to (posted on HN etc).
It's better to set billing alerts and make the call yourself if they go off.
- rustc ・ 2 days ago
  
  > Also, hard dollar caps are rarely if ever the right choice.
  Depends on if you're a big business or an individual. There is absolutely no reason I would ever pay $100k for a traffic burst on my personal site or side project (like the $100k Netlify case a few months ago).
  > It's better to set billing alerts and make the call yourself if they go off.
  Billing alerts are not instant and neither is anyone online 24x7 monitoring the alerts.
  
  brutus1213 ・ 2 days ago
  
  100% agreed. This can be solved with technology .. let users set a soft and hard threshold for example. Runaway costs is the problem here.
- ipaddr ・ 2 days ago
  
  One bad actor / misconfiguration / attack can put you out of business. It not the safest strategy to allow unlimited liability in business or for personal projects.
petesergeant ・ 2 days ago

I've abandoned DataDog in production for just this reason. Is the amount of money they make on dinging people who screw up really worth the ill-will and people who decide they're just not going to start projects on these platforms?
- geodel ・ a day ago
  
  > Is the amount of money they make on dinging people who screw up really worth the ill-will
  I think it is .
  1) They make money for services they provided instead of looking into meaning of what customer actually wanted.
  2) Small time customers move away so they concentrate energy on big enterprise sales.
  Not justifying anything here but it just kind of make business sense for them.
  
  petesergeant ・ a day ago
  
  Definitely possible. I wonder over what time period you miss out on small customers who become big customers and go on that journey with you; perhaps that would be minimal anyway.
weinzierl ・ 2 days ago

I never used modal or vast.ai and from their pages it was not obvious how they solve the yolo billing issue? Are they pre-paid or do they support caps?
- thundergolfer ・ 2 days ago
  
  Engineer from Modal here: we support caps. They kick in within ~2s if your usage exceeds the configured limit.
- sharifhsn ・ 2 days ago
  
  I know vast.ai uses a prepaid credits system.
  
  geodel ・ a day ago
  
  Doesn't seem vast. Seems tight-budget.ai to me :-)
oldandboring ・ 2 days ago

> uncapped yolo billing
This made me laugh out loud, thank you for this!
rikafurude21 ・ 2 days ago

thats what billing limits are for
- isoprophlex ・ 2 days ago
  
  Unless something changed gcp only does billing alerts, not billing limits
- aiiizzz ・ 2 days ago
  
  Those, on gcp, are just alerts, not hard limits, no?
  
  jsheard ・ 2 days ago
  ・ 2 more
  
  Yeah. I think you can hack together a function which pulls the plug automatically if a billing alert fires, but IIRC the alerts can take a few hours to respond, so extreme runaway usage could still result in a bad time.
  
  worldsayshi ・ 2 days ago
  
  I would rather not realize i had left a bug in that hack after the fact.
spacecadet ・ 2 days ago

Runpod is pretty great. I wrote some genetic end point script that I can deploy in seconds, download the models to the pod, and Im ready to go. Plus I forgot and left a pod running, but down, for a week and it was like 0.60, and they emailed me like 3 times reminding me of the pod.
nprateem ・ 2 days ago

Cloud Run is great but no billing limits is too scary. No idea why they don't address this. They must know if they support individuals we'll eventually leave our saases there.
- randlet ・ 2 days ago
  
  Setting max instances effectively caps your spend right?
  
  vrosas ・ 2 days ago
  ・ 3 more
  
  Yes. CR has had this feature since day 1, people just don't bother to read the docs and would rather write long blog posts blaming their cloud provider for manufacturing the gun they shot themself in the foot with.
  
  undefined ・ 2 days ago
  
  [deleted]
  
  undefined ・ 2 days ago
  
  [deleted]

mythz ・ 2 days ago

The pricing doesn't look that compelling, here are the hourly rate comparisons vs runpod.io vs vast.ai:

    1x L4 24GB:    google:  $0.71; runpod.io:  $0.43, spot: $0.22
    4x L4 24GB:    google:  $4.00; runpod.io:  $1.72, spot: $0.88
    1x A100 80GB:  google:  $5.07; runpod.io:  $1.64, spot: $0.82; vast.ai  $0.880, spot:  $0.501
    1x H100 80GB:  google: $11.06; runpod.io:  $2.79, spot: $1.65; vast.ai  $1.535, spot:  $0.473
    8x H200 141GB: google: $88.08; runpod.io: $31.92;              vast.ai $15.470, spot: $14.563

Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.

otherjason ・ 2 days ago

Where did you get the pricing for vast.ai here? Looking at their pricing page, I don't see any 8xH200 options for less than $21.65 an hour (and most are more than that).
- zackangelo ・ 2 days ago
  
  I think it’s a typo, looks pretty close to their 8xH100 prices.
progbits ・ a day ago

You can just go to "create compute instance" to see the spot pricing.
Eg GCP price for spot 1xH100 is $2.55/hr, lower with sustained use discounts. But only hobbyists pay these prices, any company is going to ask for a discount and will get it.
steren ・ 2 days ago

> Google's pricing also assumes you're running it 24/7 for an entire month
What makes you think that?
Cloud Run [pricing page](https://cloud.google.com/run/pricing) explicitly says : "charge you only for the resources you use, rounded up to the nearest 100 millisecond"
Also, Cloud Run's [autoscalling](https://cloud.google.com/run/docs/about-instance-autoscaling) is in effect, scaling down idle instances after a maximum of 15 minutes.
(Cloud Run PM)
- mythz ・ 5 hours ago
  
  Because the pricing when creating an instance shows me the cost for the entire month, then works out the average hourly price based on that. This is just creating a GPU VM instance, I don't see how to see the cost of different NVidia GPUs without it.
  If you wanted to show hourly pricing, you would show that first, then calculate the monthly price from the hourly rate. I've no idea if the monthly cost includes sustained usage discount and what the hourly cost is for just running it for an hour.
  
  steren ・ 4 hours ago
  
  > Because the pricing when creating an instance shows me the cost for the entire month
  Are you referring to the GCP pricing calculator?
  > This is just creating a GPU VM instance
  Maybe you are referring to the Compute Engine VM creation page? Cloud Run is a different GCP service.
  The Cloud Run Service creation UI doesn't show the cost.
counters ・ 2 days ago

Nothing but 1xL4 are even offered on Cloud Run GPUs, are they?
ZiiS ・ 2 days ago

I think the Google prices are billed per-second so under 20min you are better on Google?
- mythz ・ 2 days ago
  
  RunPod also charges per second [1], also this is Google's expected avg cost per hour after running it 24/7 for an entire month, I couldn't find an hourly cost for each GPU.
  When you need under <1hr than you can go with Runpod's Spot pricing which is ~4-7x cheaper than Google, where even 20min of Google would cost more than 1hr on RunPod.
  [1] https://docs.runpod.io/serverless/pricing
- thousand_nights ・ 2 days ago
  
  runpod is billed by the minute
  
  bts4 ・ 11 hours ago
  
  Technically we bill Pods by the millisecond. Pennies matter :)

jbarrow ・ 2 days ago

I’m personally a huge fan of Modal, and have been using their serverless scale-to-zero GPUs for a while. We’ve seen some nice cost reductions from using them, while also being able to scale WAY UP when needed. All with minimal development effort.

Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?

scj13 ・ 2 days ago

Modal is great, they even released a deep dive into their LP solver for how they're able to get GPUs so quickly (and cheaply).
Coiled is another option worth looking at if you're a Python developer. Not nearly as fast on cold start as Modal, but similarly easy to use and great for spinning up GPU-backed VMs for bursty workloads. Everything runs in your cloud account. The built-in package sync is also pretty nice, it auto-installs CUDA drivers and Python dependencies from your local dev context.
(Disclaimer: I work with Coiled, but genuinely think it's a good option for GPU serverless-ish workflows. )
AndresSRG ・ 2 days ago

I’m also a big fan.
Modal has the fastest cold-start I’ve seen for 10GB+ models.
dr_kiszonka ・ 2 days ago

Thanks for sharing! They even support running HIPAA-compliant workloads, which I didn't anticipate.
chrishare ・ 2 days ago

Modal documentation is also very good.

montebicyclelo ・ 2 days ago

Reason Cloud Run is so nice compared to other providers is that it has autoscaling, with scaling to 0. Meaning it can cost basically 0 if it's not being used. Also can set a cap on the scaling, e.g. 5 instances max, which caps the max cost of the service too. - Note, I only have experience with the CPU version of Cloud Run, (which is very reliable / easy).

rvnx ・ 2 days ago

Even regular Cloud Run can take a lot of time to boot (~3 to 30 seconds), so this can be a problem when scaling to 0
- gizzlon ・ 2 days ago
  
  That's not my experience, using Go. Never measured, but it goes to 0 all the time, so I would definitely noticed more than a couple of seconds.
  
  827a ・ 2 days ago
  
  It depends on whether you're on gen1 or gen2 Cloud Run; the default execution environment is `default` which means "you have no idea because GCP selects for you" (not joking).
  Counterintuitively (again, not joking): gen2 suffers from really bad startup speeds, because its more like a full-on linux VM/container than whatever weird shim environment gen1 runs. My Gen2 containers basically never start up faster than 3 seconds. Gen1 is much faster.
  Note that gen1 and gen2 Cloud Run execution environments are an entirely different concept than first generation and second generation Cloud Functions. First gen Cloud Functions are their own thing. Second generation Cloud Functions can be either first generation or second generation Cloud Run workloads, because they default to the default execution environment. Believe it or not, humans made this.
  
  karn97 ・ 2 days ago
  
  [dead]
- lexandstuff ・ 2 days ago
  
  Not to mention, if it's an ML workload, you'll also have to factor in downloading the weights and loading them into memory, which can double that time or more.
  
  rvnx ・ 2 days ago
  ・ 4 more
  
  According to the press release, "we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model"
  Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.
  
  happyopossum ・ a day ago
  ・ 2 more
  
  > Imagine, you have a very small weak model, and you have to wait 20 seconds for your request.
  For your first request, after having scaled to 0 while it wasn’t in use. For a lot of use cases, that sounds great.
  
  steren ・ a day ago
  
  Also, a GPU instance needs 5s to start. The test depends on how large the model is. So a "very small weak model" can lead much faster than 20s
  
  infecto ・ 2 days ago
  
  Imagine running a production client facing api and not overprovisioning it.
- mdhb ・ 2 days ago
  
  I’m looking at logs for a service I run on cloud run right now which scales to zero. Boot times are approximate 200ms for a Dart backend.

huksley ・ 2 days ago

A small and independent EU GPU cloud provider, DataCrunch (I am not affiliated), offers VMs with Nvidia GPUs even cheaper than Run Pod, etc

1x A100 80Gb 1.37€/hour

1x H100 80Gb 2.19€/hour

sigmoid10 ・ 2 days ago

That's funny. You can get a 1x H100 80Gb VM at lambda.ai for $2.49/hour. At the current exchange rate, that's exactly 2.19€. Coincidence or is this actually some kind of ceiling?
diggan ・ 2 days ago

Or go P2P with Vast.ai, cheapest A100 right now is a setup with 2x A100 for $0.8/hour (so $0.4 per A100). Not affiliated with them, but mostly happy user. Be vary of network speeds though, some hosts are clearly on shared bandwidth and reported numbers don't always line up with reality, which kind of sucks when you're trying to shuffle around 100GB of data.
- triknomeister ・ 2 days ago
  
  You really need NVL for some performance.
  
  diggan ・ 2 days ago
  
  Ok, did you check the instance list? There is a bunch of 8x H200 NVL available?

gabe_monroy ・ 2 days ago

i'm the vp/gm responsible for cloud run and GKE. great to see the interest in this! happy to answer questions on this thread.

albeebe1 ・ 2 days ago

Oh this is great news. After a $1000 bill running a model on vertex.ai continuously for a little test i forgot to shut down, this will be my go to now. I've been using Cloud Run for years running production microservices, and little hobby projects and i've found it simple and cost effective.

lemming ・ 2 days ago

If I understand this correctly, I should be able to stand up an API running arbitrary models (e.g. from Hugging Face), and it’s not quite charged by the token but should be very cheap if my usage is sporadic. Is that correct? Seems pretty huge if so, most of the providers I looked at required a monthly fee to run a custom model.

lexandstuff ・ 2 days ago

Yes, that's basically correct. Except be warned that the cold start times can be huge (30-60 seconds). So scaling to 0 doesn't really work in practice, unless your users are happy to wait from time to time. Also, you also have to pay a small monthly fee for container storage (and a few other charges iirc).
42lux ・ 2 days ago

Runpod, vast, coreweave, replicate... just a bunch of alternatives that let you run serverless GPU inference.
- _zoltan_ ・ 2 days ago
  
  you can't just sign up for coreweave, can you?
  
  42lux ・ 2 days ago
  
  We wrote them an email.

felix_tech ・ 16 hours ago

I've been using this for daily/weekly ETL tasks which saves quite a lot of money vs having an instance on all the time but it's been clunky.

The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out

covi ・ 8 hours ago

Take a look at SkyPilot. Good for running these batch workloads. You can use spot instances to save costs.

jjuliano ・ 2 days ago

I'm the developer of kdeps.com, and I really like Google Cloud Run, been using it since beta version. Kdeps outputs Dockerized full-stack AI agent apps that runs open-source LLMs locally, and my project works so well with GCR.

m1 ・ a day ago

Love cloud run and this looks like a great addition. Only things I wish from cloud run is being able to run self hosted GitHub runners on it (last time I checked this wasn’t possible as it requires root), also the new worker pool feature seems great in practice but it looks like you have to write the scaler yourself rather than it being built in.

undefined ・ a day ago

[deleted]
aniruddhc ・ a day ago

Hi! I'm the Eng Manager responsible for Autoscaling for Serverless and Worker Pools.
We're actively defining our roadmap, and understanding your use case would be incredibly valuable. If you're open to it, please email me at <my HN username>@google.com. I'd love to learn more about how you'd use worker pools and what kind of workloads you need to scale.

Aeolun ・ 2 days ago

That’s 67ct / hour for a gpu enabled instance. That’s pretty good, but I have no idea how T4 GPU’s compare against others.

wut42 ・ 2 days ago

L4 are pretty limited nowadays. It's usually rented at 40ct/hour on other providers.

holografix ・ 2 days ago

The value in this really is running small custom models or the absolute latest open weight models.

Why bother when you can get payg API access to popular open weights models like Llama on Vertex AI model garden or at the edge on Cloudflare?

progbits ・ 2 days ago

Custom models.
We use this, pretty convenient and less hassle than managing our autoscaling GPU pools.

gardnr ・ 2 days ago

The Nvidia L4 has 24GB of VRAM and consumes 72 watts, which is relatively low compared to other datacenter cards. It's not a monster GPU, but it should work OK for inference.

m4r1k ・ 2 days ago

I wrote about cloud run and inference w/ ollama on cloud run -> https://medium.com/google-cloud/ollama-on-cloud-run-with-gpu...
performance is okay, ada lovelace has cuda 8_9 support which brings native fp8 support. imo the best aspect is the speed of spinning up new containers and the overall easiness of this service. the live demo at google next 25 was quite something https://www.youtube.com/watch?v=PWPvX25R6dM&t=2140s

pier25 ・ 2 days ago

How does this compare to Fly GPUs in terms of pricing?

ringeryless ・ 2 days ago

i wonder what all this hype-driven overcapacity will be used for by future generations.

once this bubble pops we are going to have some serious albeit high-latency hardware

zaphar ・ 2 days ago

Crunching really large amounts of numbers has always been useful. And that's all this really is. Running weather simulations, Advanced math problems, Complicated engineering simulations. The space of possible uses is incredibly wide.
- otherjason ・ 2 days ago
  
  The last few generations of GPU architectures have been increasingly optimized for massive throughput of low-precision integer arithmetic operations, though, which are not useful for any of those other applications.
happyopossum ・ a day ago

> overcapacity
I’m Not sure that word means what you think it means. There is a pretty severe shortage of GPU capacity in the industry right now.
- undefined ・ 12 hours ago
  
  [deleted]
esafak ・ 2 days ago

What overcapacity? People are struggling to find affordable GPUs.

treksis ・ 2 days ago

Everything good except the price.

ninetyninenine ・ 2 days ago

Im tired of using AI in cloud services. I want user friendly locally owned AI hardware.

Right now nothing is consumer friendly. I can’t get a packaged deal of some locally running ChatGPT quality UI or voice command system in an all in one package. Like what Macs did for PCs I want the same for AI.

Hilift ・ 2 days ago

Oracle just announced they are spending $40 billion on GPU hardware. All cloud providers have an AI offering, and there are AI-specific cloud providers. I don't think retail is invited.
Workaccount2 ・ 2 days ago

From the most unexpected place (but maybe expected if you believed they were paying attention)
Maxsun is releasing a a 48GB dual Intel Arc Pro B60 GPU. It's expected to cost ~$1000.
So for around $4k you should be able to build an 8 core 192GB local AI system, which would allow you to locally run some decent models.
This also assumes the community builds an intel workflow, but given how greedy Nvidia is with vram, it seems poised to be a hit.
- zorgmonkey ・ 2 days ago
  
  The price of that system is unfortunately going to end up being a lot more than 4k, you'd need a CPU that has at least 64 lanes of PCIe. That's going to be either a Xeon W or a Threadripper CPU, with the motherboard RAM, etc you're probably looking at least another 2k.
  Also kind of a nitpick, but I'd call that 8 GPU system, each BMG-G21 die has 20 Xe2 cores. Also even though it would be 4 PCIe cards it is probably best to think of it as 8 GPUs (that's how it will show up in stuff like pytorch), especially because their is no high-speed interconnect between the GPU dies colocated on the card. Also if you're going to do this make sure you get a motherboard with good PCIe bifurcation support.
Disposal8433 ・ 2 days ago

Your local computer is not powerful enough, and that's why you must welcome those brand new mainframes... I mean, "cloud services."
- pjmlp ・ 2 days ago
  
  It is funny how using a Web IDE, and a cloud shell, is such a déjà vu from when I used to do development on a common UNIX server shared by the whole team.
  
  edoceo ・ 2 days ago
  ・ 2 more
  
  Telnet from a Wyse terminal.
  
  pjmlp ・ 2 days ago
  
  My first experience with such a setup was connecting to DG/UX, via the terminal application on Windows for Workgroups, or some thin client terminals in a mix of green or ambar phosphor, spread around the campus.
  The only time I used a Pascal compiler in ISO Pascal mode, it had the usual extensions inspired on UCSD, but we weren't allowed to use them on the assignments.
- ninetyninenine ・ 2 days ago
  
  My local computer is not powerful enough to run training but it can certainly run an LLM. How do I know? Many other people and I have already done it. Deepseek for example can be run locally but it’s not a user friendly setup.
  I want an Amazon echo agent running my home with a locally running LLM.
ata_aman ・ 2 days ago

I made something[0] last year to have something very consumer friendly. Unbox->connect->run. First iteration is purely to test out the concept and is pretty low power, currently working on a GPU version for bigger models and launching Q4 this year.
[0] https://persys.ai
petesergeant ・ 2 days ago

Hoping the DGX Spark will deliver on this
- Gracana ・ 2 days ago
  
  It will not. 273GB/s memory bandwidth is not enough.

moeadham ・ 2 days ago

if only they had some decent GPUs. L4s are pretty limited these days.

steren ・ 2 days ago

Cloud Run PM lead here.
Sign up for new GPU types at https://docs.google.com/forms/d/e/1FAIpQLSdZk5sCsDUjAoYQX-sq...
gabe_monroy ・ 2 days ago

this is something we are working on. stay tuned!

ivape ・ 2 days ago

Does anyone actually run a modest sized app and can share numbers on what one gpu gets you? Assuming something like vllm for concurrent requests, what kind of throughput are you seeing? Serving an LLM just feels like a nightmare.

einpoklum ・ 2 days ago

Why is commercial advertising published as a content article here?

Rachman91 ・ 2 days ago

[flagged]

omneity ・ 2 days ago

> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)

This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.

If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.

diggan ・ 2 days ago

That has to be cold-start, and next N requests would surely be using the already started thing? It sounds bananas they'd even mention using something like that with 19 seconds latency for all requests in any context.
- agcat ・ 2 days ago
  
  That's true. Traditional single-tier storage can not meet the throughput and latency demand. My cofounder wrote this piece on a three-tiered storage architecture optimized for both performance and cost - https://nilesh-agarwal.com/three-tier-storage-architecture-f...
happyopossum ・ a day ago

Sure, but how often is an enterprise deployed LLM application really cold-starting? While you could run this for one-off and personal use, this is probably more geared towards bursty ‘here’s an agent for my company sales reps’ kind of workloads, so you can have an instance warmed, then autoscale up at 8:03am when everyone gets online (or in the office or whatever).
At that point, 19 seconds looks great, as lower latency startup times allow for much more efficient autoscaling.
wut42 ・ 2 days ago

Definitely -- and yet it's kinda a feat compared to other solutions: when i tried Runpod Serverless i could wait up to five minutes for a cold start to a even more smaller model than a 4B.
- undefined ・ a day ago
  
  [deleted]
infecto ・ 2 days ago

If you were running a real business with these would the aim not be to overprovision and to setup auto scaling in such a way that you always have excess capacity?
- omneity ・ 2 days ago
  
  That seems to be the gist of it. You cannot rely on serverless alone and you need one or many pre-warmed instances at all times. This distinction is rarely mentioned in serverless GPU spaces yet has been my experience in general.
  
  nullpointerexp ・ 2 days ago
  
  When scaling from 0 to 1 instances, yes, you have to wait 19 seconds.
  For scaling N --> N+1 - If you configure the correct concurrency value (the number of parallel requests one instance can handle), Cloud Run will scale up to additional instances when getting to X% (I think it's 70%). That will be before the instance is fully exhausted. So your users should not experience the 19 seconds cold start.
bravesoul2 ・ 2 days ago

Looks like GPU instances not "lambda", so presumable you would over-provision to compensate.