It’s Google I/O 2022 this week, among other things, and we were hoping for a deep architectural dive into the TPUv4 matrix math engines Google hinted at at last year’s I/O event. But, alas, no luck. But the search engine and advertising giant, which also happens to be one of the biggest AI innovators on the planet due to the gigantic amount of data it needs to use, provided additional information about TPUv4 processors and the systems that use them.
Google also said it was installing eight pods of the TPUv4 systems in its data center in Mayes County, Oklahoma, which embraces 9 exaflops of aggregate computing capacity, for use by its Google Cloud branch so that researchers and companies have access to the same type and capacity of calculation that Google has to develop and produce its own AI in-house.
Google has operated data centers in Mayes County, northeast of Tulsa, since 2007 and has invested $4.4 billion in facilities since then. It’s located in the geographic center of the United States – well, a little south and west of it – and that makes it useful because of the relatively short latencies in much of the country. And now, by definition, Mayes County has one of the largest ironclad rigs to handle AI workloads on the planet. (If all eight TPUv4 pods were networked together and the work could scale simultaneously, perhaps we could say “largest” unequivocally… Surely Google did, as you’ll see in the quote below.)
During his keynote speech, Sundar Pichai, who is CEO of Google and also of its parent company, Alphabet, mentioned in passing that TPUv4 pods were previewed on his cloud.
“All of the advances we shared today are only possible through the continued innovation of our infrastructure,” Pichai said of some pretty cool natural language and immersive data search engine improvements that he said. has made and which feed all kinds of applications. “Recently, we announced our intention to invest $9.5 billion in data centers and offices across the United States. One of our state-of-the-art data centers is in Mayes County, Oklahoma, and I’m thrilled to announce that we’re launching the world’s largest publicly accessible machine learning center there. for all our Google Cloud customers. This machine learning hub features eight Cloud TPU v4 pods, custom-built on the same network infrastructure that powers Google’s largest neural models. They provide nearly 9 exaflops of computing power in total, giving our customers unprecedented ability to run complex models and workloads. We hope this will fuel innovation in everything from medicine to logistics to sustainability and more.
Pichai added that this TPUv4 pod-based AI hub already has 90% of its power coming from sustainable, carbon-free sources. (He didn’t say how much was wind, solar or hydroelectric.)
Before we get into the speeds and streams of TPUv4 chips and pods, it’s probably worth pointing out that, as far as we know, Google already has TPUv5 pods in its internal data centers, and it might have a collection significantly larger of TPU to drive its own models and augment its own applications with AI algorithms and routines. It would be the old way Google did things: talk about generation NOT of something while he was selling the generation N-1 and had already passed to the generation N+1 for its internal workloads.
This does not seem to be the case. In a blog post written by Sachin Gupta, VP and GM of Infrastructure at Google Cloud, and Max Sapozhnikov, Product Manager for Cloud TPUs, when the TPUv4 systems were built last year, Google gave them early access to researchers from Cohere, LG AI Research, Meta AI and Salesforce Research, and further they added that TPUv4 systems have been used to create the Pathways Language Model (PaLM) that underpins innovations in language processing naturalness and speech recognition that were at the heart of today’s speech. Specifically, PaLM was developed and tested on two TPUv4 pods, each of which has 4,096 TPUv4 matrix math engines.
If Google’s brightest new models are being developed on TPUv4s, then it probably doesn’t have a fleet of TPUv5s hiding in a data center somewhere. Although we add, it would be nice if the TPUv5 machines were hidden away, 26.7 miles southwest of our office, in the Lenoir data center, shown here from our window:
The gray strip down the mountain, under the birch leaves, is Google’s data center. If you squint and look away, Maiden’s Apple Data Center is to the left and considerably further down the line.
Enough. Let’s talk about some flows and speeds. Finally, here are some capabilities that compare TPUv4 to TPUv3:
Last year when Pichai was hinting at TPUv4, we guessed that Google was moving to 7 nanometer processes for this generation of TPU, but given that very low power consumption, it seems like it’s probably etched at using 5 nanometer process. (We assumed Google was trying to keep the power envelope constant, and it clearly wanted to lower it.) We also assumed it was doubling the core count, from two cores on the TPUv3 to four cores on the TPUv4 , something that Google has neither confirmed nor denied.
Doubling the performance while doubling the cores would bring the TPUv4 to 246 teraflops per chip, and going from 16 nanometers to 7 nanometers would allow that doubling within roughly the same power envelope with roughly the same clock speed. The move to 5 nanometers allows the chip to be smaller and run a bit faster while reducing power consumption – and to have a smaller chip with potentially higher efficiency as 5 nanometer processes mature. That the average power consumed decreased by 22.7%, and this corresponds to an 11.8% increase in clock speed considering process node hops at two changes from TPUv3 to TPUv4.
There are some very interesting things in this table and in the statements that Google makes in this blog.
Aside from the 2X cores and the slight boost in clock speed brought about by the chip manufacturing process for the TPUv4, it’s interesting that Google kept the memory capacity at 32GB and didn’t migrate to the HBM3 memory that Nvidia uses with the “Hopper” GH100 GPU accelerators. Nvidia is obsessed with memory bandwidth on devices and, by extension with its NVLink and NVSwitch, memory bandwidth within nodes and now between nodes with a maximum of 256 devices in a single frame.
Google isn’t as concerned about memory atoms (as far as we know) on the proprietary TPU interconnect, device memory bandwidth, or device memory capacity. The TPUv4 has the same 32GB capacity as the TPUv3, it uses the same HBM2 memory, and its speed has only increased by 33% to just under 1.2TB/sec. What interests Google is the bandwidth on the TPU pod interconnect, which switches to a 3D torus design that tightly couples 64 TPUv4 chips with “wraparound connections” which was not possible with the 2D toroid interconnect used with TPUv3 pods. The increasing dimension of the torus interconnect allows more TPUs to be drawn into a tighter subnet for collective operations. (Which begs the question, why not a 4D, 5D or 6D torus then?)
The TPUv4 Pod has 4 times the TPU chips, at 4,096, and has twice the TPU cores, which we estimate at 16,384; We believe that Google has kept the number of math units in the MXU matrix to two per core, but that’s just a hunch. Google could keep the same number of TPU cores and double the MXU units and get the same raw performance; the difference would be the amount of front-end scalar/vector processing to be done on those MXUs. In any case, in the 16-bit BrainFloat (BF16) floating-point format created by Google’s DeepMind unit, the TPUv4 pod delivers 1.1 exaflops, compared to just 126 petaflops at BF16. This represents a factor of 8.7x more raw compute, offset by a factor of 3.3x increase in overall reduction bandwidth on the pod and a 3.75x increase in bi bandwidth -section on the TPUv4 interconnect on the pod.
This blog post intrigued us: “Each Cloud TPU v4 chip has about 2.2x more peak FLOP than Cloud TPU v3, for about 1.4x more peak FLOP per dollar.” If you do the math on this statement, that means the price of renting TPU on Google Cloud has gone up 60% with TPUv4, but it does 2.2x the job. These price and performance jumps are absolutely consistent with the kind of price/performance improvement Google expects from the ASIC switches it purchases for its data centers, which typically offer 2X the bandwidth for 1.3X to 1.5X the cost. The TPUv4 is a bit more expensive, but it has better networking to run larger models, and that comes at a cost too.
TPUv4 Pods can run in virtual machines on Google Cloud ranging in size from four chips to “thousands of chips”, and we assume that means across an entire Pod.