Google's real moat isn't the TPU silicon itself—it's not about cooling, individual performance, or hyper-specialization—but rather the massive parallel scale enabled by their OCS interconnects.
To quote The Next Platform: "An Ironwood cluster linked with Google’s absolutely unique optical circuit switch interconnect can bring to bear 9,216 Ironwood TPUs with a combined 1.77 PB of HBM memory... This makes a rackscale Nvidia system based on 144 “Blackwell” GPU chiplets with an aggregate of 20.7 TB of HBM memory look like a joke."
Nvidia may have the superior architecture at the single-chip level, but for large-scale distributed training (and inference) they currently have nothing that rivals Google's optical switching scalability.
Also, Google owns the entire vertical stack, which is what most people need. It can provide an entire spectrum of AI services far cheaper, at scale (and still profitable) via its cloud. Not every company needs to buy the hardware and build models, etc., etc.; what most companies need is an app store of AI offerings they can leverage. Google can offer this with a healthy profit margin, while others will eventually run out of money.
Google's work on Jax, pytorch, tensorflow, and the more general XLA underneath are exactly the kind of anti-moat everyone has been clamoring for.
Anti-moat like commoditizing the compliment?
If they get things like PyTorch to work well without carinng what hardware it is running on, it erodes Nvidia's CUDA moat. Nvidia's chips are excellent, without doubt, but their real moat is the ecosystem around CUDA.
The problem is that "hardware-agnostic PyTorch" is a myth, much like Java's "write once, run anywhere". At the high level (API), the code looks the same, but as soon as you start optimizing for performance, you inevitably drop down to CUDA. As long as researchers are writing their new algorithms in CUDA because it's the de facto language of science, Google will forever be playing catch-up, having to port these algorithms to XLA. An ecosystem is, after all, people and their habits, not just libraries.
I'd love for someone to give me an alternative to CUDA but I don't primarily use GPUs for inference, I do 64-bit unsigned integer workloads and the only people who seem to care even a little about this currently are NVidia, if imperfectly.
I _really_ want an alternative but the architecture churn imposed by targeting ROCm for say an MI350X is brutal. The way their wavefronts and everything work is significantly different enough that if you're trying to get last-mile perf (which for GPUs unfortunately yawns back into the 2-5x stretch) you're eating a lot of pain to get the same cost-efficiency out of AMD hardware.
FPGAs aren't really any more cost effective unless the $/kwh goes into the stratosphere which is a hypothetical I don't care to contemplate.
That's new to me -- what sorts of workloads are centered on 64-bit uints?
PyTorch is only part of it. There is still a huge amount of CUDA that isn’t just wrapped by PyTorch and isn’t easily portable.
... but not in deep learning or am I missing something important here?
Yes, absolutely in deep learning. Custom fused CUDA kernels everywhere.
Yep. MoE, FlashAttention, or sparse retrieval architectures for example.
Yes!
Pytorch, Jax, tensorflow are all examples to me of very capable products, that compete very well in ML space.
But more broadly work like XLA and IREE are very interesting toolkits for mapping a huge variety of computation onto many types of hardware. While Pytorch et al are fine example applications, are things you can do, XLA is the Big Tent idea, the toolkit to erode not just specific CUDA use cases, but to allow hardware in general to be more broadly useful.
*complement
when chatgpt came, I thought google didn't have the leadership and team spirit to recover, seems like i was very wrong
They just need to actually make and market a good product though, and they seem to really struggle with this. Maybe on a long enough timeline their advantages will make this one inevitable.
all this vertical integration no wonder Apple and Google have such a tight relationship.
That is comparing an all to all switched Nvlink fabric to a 3D torus for TPUs. Those are completely different network topologies with different tradeoffs.
For example the currently very popular Mixture of Experts architectures require a lot of all to all traffic (for expert parallelism) which works a lot better on the switched NVlink fabric as opposed where it doesn't need to traverse multiple links in the torus.
This is an underrated point. Comparing just the peak bandwidth is like saying Bulldozer was the far superior CPU of the era because it had a really high frequency ceiling.
Really? Fully-connected hardware is in buildable (at scale) which we already know from the HPC world. Fat trees and dragonfly networks are pretty scalable, but a 3d torus is a very good tradeofff, and respects the dimensionality of reality.
Bisection bandwidth is a useful metric, but is hop count? Per-hop cost tends to be pretty small.
Latency (of different types), jitter, and guaranteed bandwidth are the real underlying metrics. Hop count is just one potential driver of those, but different approaches may or may not tackle each of these parts differently.
NVFP4 is the thing no one saw coming. I wasn't watching the MX process really, so I cast no judgements, but it's exactly what it sounds like, a serious compromise in resource constrained settings. And it's in the silicon pipeline.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
They had a plan.
This reads like a badly done, sponsored hype video on YouTube.
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
[1] https://developer.nvidia.com/blog/introducing-nvfp4-for-effi...
[2] https://rocm.blogs.amd.com/software-tools-optimization/mxfp4...
Looking into NVFP4/Nvidia vs MXFP4/AMD the summation was that seem to be pretty close when including the MI355X which leads in VRAM and throughput but trails in accuracy (slightly)--and for that mixing in MXFP6 makes up for it.
This comment reads as if it were LLM-generated.
Agree.
It's fun when then you read last Nvidia tweet [1] suggesting that still their tech is better, based on pure vibes as anything in the (Gen)AI-era.
Not vibes. TPUs have fallen behind or had to be redesigned from scratch many times as neural architectures and workloads evolved, whereas the more general purpose GPUs kept on trucking and building on their prior investments. There's a good reason so much research is done on Nvidia clusters and not TPU clusters. TPU has often turned out to be over-specialized and Nvidia are pointing that out.
You say that like I d a bad thing. Nvidia architectures keep changing and getting more advanced as well, with specialized tensor operations, different accumulators and caches, etc. I see no issue with progress.
That’s missing the point. Things like tensor cores were added in parallel with improvements to existing computer and CUDA kernels from 10 years ago generally run without modification. Hardware architecture may change, but Nvidia has largely avoided changing how you interact with it.
Modern CUDA programs that hit roofline look absolutely nothing like those from 10 or even 5 years ago. Or even 2 if you’re on Blackwell.
They don't have to, CUDA is a high-level API in this respect. The hardware will conform to the demands of the market and the software will support whatever the compute capability defines, Nvidia is clearer than most about this.
But for research you often don't have to max out the hardware right away.
And the question is what do programs that max out Ironwood look like vs TPU programs written 5 years ago?
Sure, but you do have to do it pretty quick. Let’s pick a H100. You’ve probably heard that just writing scalar code is leaving 90+% of the flops idle. But even past that, if you’re using the tensor core but using the wrong instructions you’re basically capped at 300-400 TFLOPS of the 1000 the hardware supports. If using the new instructions but poorly you’re probably not going to hit even 500 TFLOPS. That’s just barely better than the previous generation you paid a bunch of money to replace.
And yet current versions of Whisper GPU will not run on my not-quite-10-year old Pascal GPU anymore because the hardware CUDA version is too old.
Just because it's still called CUDA doesn't mean it's portable over a not-that-long of a timeframe.
Portable doesn't normally mean that it runs on arbitrarily old hardware. CUDA was never portable, it only runs on Nvidia hardware. The question is whether old versions of Whisper GPU run on newer hardware, that'd be backwards compatibility.
> There's a good reason so much research is done on Nvidia clusters and not TPU clusters.
You are aware that Gemini was trained on TPU, and that most research at Deepmind is done on TPU?
> based on pure vibes
The tweet gives their justification; CUDA isn't ASIC. Nvidia GPUs were popular for crypto mining, protein folding, and now AI inference too. TPUs are tensor ASICs.
FWIW I'm inclined to agree with Nvidia here. Scaling up a systolic array is impressive but nothing new.
Sure, but their company's 4.3 trillion valuation isn't based on how good their GPUs are for general purpose computing, it's based on how good they are at AI.
> NVIDIA is a generation ahead of the industry
a generation is 6 months
For GPUs a generation is 1-2 years.
What in that article makes you think a generation is shorter?
* Turing: September 2018
* Ampere: May 2020
* Hopper: March 2022
* Lovelace (designed to work with Hopper): October 2022
* Blackwell: November 2024
* Next: December 2025 or later
With a single exception for Lovelace (arguably not a generation), there are multiple years between generations.
No, not at all. If this were true Google would be killing it in MLPerf benchmarks, but they are not.
It’s better to have a faster, smaller network for model parallelism and a larger, slower one for data parallelism than a very large, but slower, network for everything. This is why NVIDIA wins.
I mean, Google just isn't participating it seems?
For all the excitement surrounding this, I fail to comprehend how Google can't even meet the current demand for Gemini 3^. Moreover, they are unwilling to invest in expansion directly (apparently have a mandate to double their compute every 6 months without spending more than their current budget). So, pardon me if I can't see how they will scale operations as demand grows while simultaneously selling their chips to competitors?! This situation doesn't make any sense.
^Even now I get capacity related error messages, so many days after the Gemini 3 launch. Also, Jules is basically unusable. Maybe Gemini 3 is a bigger resource hog than anyone outside of Google realizes.
I also suspect Google is launching models it can’t really sustain in volume or that are operating at a loss. Nothing preventing them from like doubling model size compared to the rest or allocating an insane amount of compute just to make the headlines on model performance (clearly it’s good for the stock). These things are opaque anyway, buried deep into the P&L.
OCS is indeed an engineering marvel, but look at NVIDIA's NVL72. They took a different path: instead of flexible optics, they used the brute force of copper, turning an entire rack into one giant GPU with unified memory. Google is solving the scale-out problem, while NVIDIA is solving the scale-up problem. For LLM training tasks, where communication is the bottleneck, NVIDIA's approach with NVLink might actually prove even more efficient than Google's optical routing.
- [deleted]
100 times more chips for equivalent memory, sure.
Check the specs again. Per chip, TPU 7x has 192GB of HBM3e, whereas the NVIDIA B200 has 186GB.
While the B200 wins on raw FP8 throughput (~9000 vs 4614 TFLOPs), that makes sense given NVIDIA has optimized for the single-chip game for over 20 years. But the bottleneck here isn't the chip—it's the domain size.
NVIDIA's top-tier NVL72 tops out at an NVLink domain of 72 Blackwell GPUs. Meanwhile, Google is connecting 9216 chips at 9.6Tbps to deliver nearly 43 ExaFlops. NVIDIA has the ecosystem (CUDA, community, etc.), but until they can match that interconnect scale, they simply don't compete in this weight class.
Isn’t the 9000 TFLOP/s number Nvidia’s relatively useless sparse FLOP count that is 2x the actual dense FLOP count?
Correct --- found a remark on Twitter calling this "Jenson Math".
Same logic when NVidia quote the "bidirectional bandwidth" of high speed interconnects to make the numbers look big, instead of the more common BW per direction, forcing everyone else to adopt the same metric in marketing materials.
Wow, no, not at all. It’s better to have a set of smaller, faster cliques connected by a slow network than a slower-than-clique flat network that connects everything. The cliques connected by a slow DCN can scale to arbitrary size. Even Google has had to resort to that for its biggest clusters.
Is this claim based on observed comm patterns in some particular AI architecture?
I guess “this weight class” is some theoretical class divorced from any application? Almost all players are running Nvidia other than Google. The other players are certainly more than just competing with Google.
> Almost all players are running Nvidia other than Google.
No surprises there, Google is not the greatest company at productizing their tech for external consumption.
> The other players are certainly more than just competing with Google.
TBF, its easy to stay in the game when you're flush with cash, and for the past N-quarters, investors have been throwing money at AI companies, Nvidia's margins have greatly benefited from this largesse. There will be blood on the floor once investors start demanding returns to their investments.
Ok? The person I was replying to was saying that Google’s compute offering is substantially superior to Nvidia’s. What do your comments about market positioning have to do with that?
If Google’s TPUs were really substantially superior, don’t you think that would result in at least short term market advantages for Gemini? Where are they?
They are suggesting it is easier for others to buy buy more NVidia chips and feed them more power. Whilst operating costs are covered by investors. If they move on to competing on having to do inference the cheepest then the TPUs will shine.
The original post made no comments about inference or training or even cost in any way. It said you could hook up more TPUs together with more memory and higher average bandwidth than you could with a datacenter of Nvidia GPUs. From an architectural point of view, it isn’t clear (and is not explained) what that enables. It clearly hasn’t led to a business outcome for Google where they are the clear market leader.
Seemingly fast interconnects benefit training more than inference since training can have more parallel communication between nodes. Inference for users is more embarrassingly parallel (requires less communication) than updating and merging network weights.
My point: cool benchmark, what does it matter? The original post says Nvidia doesn’t have anything to compete with massively interconnected TPUs. It didn’t merely say Google’s TPUs were better. It said that Nvidia can’t compete. That’s clearly bullshit and wishful thinking, right? There is no evidence in the market to support that, and no actual technical points have been presented in this thread either. OpenAI, Anthropic, etc are certainly competing with Google, right?
> My point: cool benchmark, what does it matter?
And then people explained why the effects are smoothed over right now but will matter eventually and you rejected them as if they didn't understand your question. They answered it, take the answer.
> It didn’t merely say Google’s TPUs were better. It said that Nvidia can’t compete.
Can't compete at clusters of a certain size. The argument is that anyone on nVidia simply isn't building clusters that big.
The fact that NVidia are currently winning is undisputed.
Yet everyone uses NVIDIA and Google is at catchup position.
Ecosystem is MASSIVE factor and will be a massive factor for all but the biggest models
Catch-up in what exactly? Google isn't building hardware to sell, they aren't in the same market.
Also I feel you completely misunderstand that the problem isn't how fast is ONE gpu vs ONE tpu, what matters is the costs for the same output. If I can fill a datacenter at half the cost for the same output, does it matters I've used twice the TPUs and that a single Nvidia Blackwell was faster? No...
And hardware cost isn't even the biggest problem, operational costs, mostly power and cooling are another huge one.
So if you design a solution that fits your stack (designed for it) and optimize for your operational costs you're light years ahead of your competition using the more powerful solution, that costs 5 times more in hardware and twice in operational costs.
All I say is more or less true for inference economics, have no clue about training.
Also, isn't memory a bit moot? At scale I thought that the ASICs frequently sat idle waiting for memory.
You're doing operations on the memory once it's been transferred to gpu memory. Either shuffling it around various caches or processors or feeding it into tensor cores or other matrix operations. You don't want to be sitting idle.
I think it's not about the cost but the limits of quickly accessible RAM
Ironwood is 192GB, Blackwell is 96GB, right? Or am i missing something?
182GB and B300 is 288GB. IIRC