If you work in the data center world as a chip designer, a server manufacturer, or a data center operator, you know we’re in uncharted territory. The leading chip designers and hyperscalers are in the middle of a competition, an arms race that’s affecting all of us.
What’s the arms race? It’s the race to make the fastest, most impactful chips on the planet to power game-changing AI training and inference hardware. The biggest technology companies in the world, from Meta and Amazon to Google and Oracle to Microsoft and NVIDIA, are investing tens of billions of dollars to build the most powerful, efficient, and impactful processors the world has ever seen—to power the explosive growth of generative AI.
In this two-part series, we’ll explore the leading edge of chip design for AI, its implications for power and cooling, and the new technologies we’ll see deployed, at scale, over the next few years. By the end of these posts, you’ll be well-versed in the current developments and upcoming trends to make informed decisions for the future of your data center infrastructure.
Setting the Stage
AI is the buzziest catchword, and since ChatGPT launched in November 2022, generative AI has become the buzziest version of AI. In eighteen months, hundreds of new companies and thousands of new use cases have emerged because generative AI offers the possibility, yet unproven, of radically rethinking everything in the world of technology.
These possibilities, initially unveiled by OpenAI, forced everyone in the data center industry to sit up and take notice, sparking competition among both chipmakers and hyperscalers. Key players, including OpenAI, Microsoft, Meta, Google, NVIDIA, and others, feverishly rethought roadmaps, reallocated budgets, and started a series of strategic moves to position themselves for a radically changing market. As AI emerges as the new path toward billion-dollar revenue growth, investments are surging, and perhaps even more important, expectations are skyrocketing.
However, it’s important to understand that the chip manufacturers and hyperscalers aren’t just changing their strategies because of demand for AI capabilities. They’re changing because generative AI infrastructure is intrinsically different than cloud infrastructure. It’s heavily dependent on specialized silicon and high-performance interconnects that require different designs and mindsets than ever before. These differences are forcing these companies to move in new directions, to innovate, and to adopt new technologies in order to cope with change.
Let’s understand the table stakes.
- Generative AI is projected by Bloomberg to evolve into a $1.3 trillion market by 2032.
- Amazon is betting $150B for data centers to support the AI boom.
- Google spent $11B on new infrastructure in Q4, and will spend substantially more than that in 2024.
- Microsoft and OpenAI are planning a $100B AI supercomputer.
There’s money to be made, and we’re already seeing big moves by market leaders who want to capitalize on the buzz.
But all this spending is gated by access to the specialized chips needed to build AI infrastructure. And today, one chipmaker is arguably calling all the shots.
Today’s Leaders
We’re talking about NVIDIA. Once a formerly conventional GPU manufacturing company, powering gaming PCs and graphics workstations, it’s surprising just how much NVIDIA has changed in eighteen months. The performance of their Hopper architecture GPUs turned out to be the key to unlocking high-performance, massively scalable generative AI, and as a result, NVIDIA has become a $2 trillion company.
NVIDIA has become the essential ingredient for anyone who wants to deploy generative AI. Thanks to their chip design, their mature and proven Compute Unified Device Architecture (CUDA) that simplifies GPU utilization, and their fabrication capacity, they quickly monopolized the market, becoming the new standard chipmaker for AI computing, putting market leaders like Intel and AMD into the shadows.
Of course, NVIDIA wouldn’t be what it is without OpenAI, which, by making generative AI a practical, useable platform, transformed the industry. The relationship between the two organizations was the making of both. Their symbiosis continues to define the entire AI infrastructure space.
Every hyperscaler knows this, and in the short-term, responded by aggressively procuring substantial quantities of NVIDIA’s H100s to power their AI ambitions. For several months in 2023, demand for H100s outstripped supply — NVIDIA’s lead time for the most in-demand processors spiked to as much as eleven months, though the backlog has subsided in recent weeks. NVIDIA seeks to not only solidify its current position but also to fortify its hold on the market, which currently accounts for nearly 80% of the AI chipset market share.
Tomorrow’s Disruptors
As you’d expect, NVIDIA’s semi-monopoly on GPUs, coupled with OpenAI’s leading position, has driven hyperscalers and other chip manufacturers to invest heavily in developing alternatives to NVIDIA, with efforts underway to start developing custom AI chipsets in-house, to power their own efforts at generative AI. NVIDIA and OpenAI’s success serves as a testament to the benefits of early investment in proprietary AI chip technology, and it also highlights the risk to everyone else if NVIDIA chooses winners and losers through GPU supply allocation, has execution problems, or simply increases its prices to intolerable levels.
These market dynamics have spurred an AI chip arms race among major hyperscalers such as Microsoft, Meta, Google, and Amazon as well as the CPU manufacturers, Intel and AMD.
Google: Can They Go It Alone?
Google has a history of building its own chips, beginning with their tensor processing unit (TPU) launched in 2016. They’re using that same chip expertise to train their latest AI project, Gemini. According to Forbes, Google is training Gemini on custom processors to improve their competitive positioning against OpenAI’s chatbot, ChatGPT.
Microsoft: A Two-Pronged Approach
Microsoft is taking a different approach. On the one hand, they’re able to leverage NVIDIA advantages through their partnership with, and investment in, OpenAI. On the other hand, they just unveiled their first custom chips, the Azure Maia AI chip and Azure Cobalt CPU. This move aims to reduce their dependence on NVIDIA and allows them to integrate their processors, servers, and data centers more tightly for better performance, lower energy use, and cost savings.
Amazon: Striking a Balance
On the other hand, Microsoft’s rival, Amazon, has decided to strengthen its relationship with NVIDIA, adopting a two-fold strategy of developing its own AI chips while also offering customers access to NVIDIA’s latest chipsets, reports CNBC. Rather than partnering with OpenAI, Amazon is building its AI capability in-house while also partnering with Anthropic, another provider of generative AI capabilities.
Meta: Overcoming Struggles
Facing early hurdles in AI development, Meta (Facebook’s parent company) scrapped its 2022 launch of custom AI chips. They prioritized revamping their data centers to accommodate the shift from CPUs to GPUs for AI training. To bridge the gap, Meta relied heavily on Nvidia’s H100 GPUs to keep their AI projects on track. This year, however, DCD reports that Meta is finally ready to deploy its own AI chips within its data centers.
AMD & Intel: Pushing for Market Share
Both Intel and AMD sell GPUs and would like a larger piece of the overall GPU market, especially as demand for NVIDIA GPUs has outstripped supply. The hyperscalers do purchase Intel and AMD GPUs as alternatives to NVIDIA. Server OEMs like Lenovo report that demand for AMD’s latest Instinct MI300 family is higher than ever, while Intel will be trying to drive multibillion-dollar sales of its latest Gaudi3 GPU family, which will launch sometime this year.
The NVIDIA Response
Of course, NVIDIA is adapting to others’ moves in an effort to remain THE supplier for everyone. Two weeks ago, NVIDIA extended its dominance by launching the Blackwell GPU architecture, which will, to quote the press release “enable organizations everywhere to build and run real-time generative AI on trillion-parameter large language models at up to 25x less cost and energy consumption than its predecessor.”
Blackwell is a game-changer. Available later this year, every hyperscaler and server OEM has signed up to provide it. And unlike others, NVIDIA has an incredibly robust ecosystem of software, reference architectures and partnerships that will help organizations take full advantage of what Blackwell has to offer.
But that’s not all. In a case of co-opetition that’s probably unprecedented in the history of IT, NVIDIA recently announced a $30 billion initiative to build custom chips for others. For the hyperscalers, at least in the short term, it might be a matter of — if you can’t beat them, join them!
The Hidden Challenge
As you can see, we’re in a phase of unprecedented innovation. Every major player is working feverishly to augment, expand and accelerate their capacity for AI. Tens of billions of dollars are being spent every month to build AI infrastructure. All of this sounds fantastic – and it certainly is – but there’s a small problem with all these initiatives. It’s a problem that can’t be solved with yesterday’s technologies, and it roadblocks the ambitions of hyperscalers. Unless they solve this challenge, they can’t proceed at scale.
In our next post, we’re going to take a hard look at this problem, a problem where there’s no industry consensus and a collection of competing technologies. We’ll cut through the confusion and offer some clarity on what’s needed to keep AI data centers growing in line with demand.







