争夺 AI 入场券：中国大公司竞逐 GPU

In the second half of 2022, as ChatGPT became popular, the famous Silicon Valley venture capital firm a16z visited dozens of AI startups and large technology companies. They found that startups handed over 80%-90% of their early-stage financing to cloud computing platforms to train their models. They estimated that even when these companies' products mature, they would have to give 10%-20% of their annual revenue to cloud computing companies, equivalent to an "AI tax."

This creates a large market for providing model capabilities and training services in the cloud, renting computing power to other customers and startups. In China alone, there are at least dozens of startups and small and medium-sized companies building complex large language models, which all need to rent GPUs from cloud computing platforms. According to a16z estimates, only when a company's annual AI computing spending exceeds $50 million can it have enough economies of scale to support bulk GPU purchases.

According to LatePost's understanding, after this year's Spring Festival, Chinese internet companies with cloud computing businesses all placed large orders for Nvidia GPUs. ByteDance ordered over $1 billion worth of GPUs from Nvidia this year, while another large company's order also exceeded RMB 10 billion.

ByteDance's orders alone this year may be close to the total sales of commercial GPUs by Nvidia in China last year. In September last year, when the US government imposed export controls on the A100 and H100 (Nvidia's latest two-generation commercial data center GPUs), Nvidia responded that this could affect its potential sales of $400 million (about RMB 2.8 billion) in the Chinese market in the fourth quarter of last year. Based on this calculation, Nvidia's sales of data center GPUs in China in 2022 are estimated to be about RMB 10 billion.

Compared to foreign giants, Chinese technology companies are more eager to purchase GPUs. In the past two years of cost reduction and efficiency improvement, some cloud computing platforms have reduced GPU purchases and have insufficient reserves. In addition, no one can guarantee that today's high-performance GPUs will not be subject to new restrictions tomorrow.

From cutting orders to purchasing more, and transferring internal resources

Before the beginning of this year, the demand for GPUs by large Chinese technology companies was lukewarm.

GPUs have two main uses in large Chinese internet technology companies: to support internal business and to conduct cutting-edge AI research, and to sell GPUs externally on cloud computing platforms.

A person from ByteDance told LatePost that after the release of GPT-3 by OpenAI in June 2020, ByteDance trained a generative language model with billions of parameters using GPUs such as the previous generation A100, V100. Due to the limited parameter scale, this model had average generation capabilities, and ByteDance could not see its potential for commercialization at the time, and the return on investment (ROI) was not feasible, so the attempt was abandoned.

Alibaba also actively purchased GPUs between 2018 and 2019. A person from Alibaba Cloud said that the purchase volume at that time reached at least tens of thousands in scale, mainly V100 and earlier released T4 models from Nvidia. However, only about one-tenth of this batch of GPUs was given to the Damo Academy for AI research and development. In 2021, after the release of the trillion-parameter large model M6, Damo Academy revealed that training for M6 used 480 V100 GPUs.

The GPUs purchased by Alibaba at that time were mainly given to Alibaba Cloud for external rental. However, Chinese cloud computing companies, including Alibaba Cloud, overestimated the AI demand in the Chinese market. A technology investor said that before the large model craze, the GPU computing power in domestic major cloud manufacturers was not in short supply but difficult to sell. Cloud service providers even had to lower prices to sell resources. Last year, Alibaba Cloud cut prices six times, with GPU rental prices dropping by more than 20%.

Under the background of cost reduction, pursuing "quality growth" and profits, it is understood that Alibaba has scaled back its GPU procurement since 2020, and Tencent also cut back on a batch of Nvidia GPUs at the end of last year.

However, at the beginning of 2022, ChatGPT changed everyone's perception, and the consensus was quickly reached: large models are not to be missed.

Company founders personally followed the progress of large models: Zhang Yiming, founder of ByteDance, began reading AI papers; Zhang Yong, chairman of Alibaba's board of directors, took charge of Alibaba Cloud, and said at the Alibaba Cloud Summit that "all industries, applications, software, and services are worth redoing based on large model capabilities."

A person from ByteDance said that in the past, when applying for GPU purchases within the company, they had to explain the input-output ratio, business priorities, and importance. But now, large model businesses are strategically important new businesses that must be invested in, even if the ROI is not clear yet.

Developing a general-purpose large model is just the first step; the bigger goal for each company is to launch their cloud services that provide large model capabilities, which is the real big market that matches their investment.

Microsoft's Azure cloud service did not have a strong presence in the Chinese cloud computing market and mainly served multinationals' businesses in China for the past decade. But now, customers are lining up for it because it is the only cloud agent for OpenAI's commercialization.

At the April Cloud Summit, Alibaba reiterated that MaaS (Model as a Service) is the future trend of cloud computing. In addition to testing its self-developed general-purpose basic model "Universal Thousand Questions," it also released a series of tools to help customers train and use large models in the cloud. Soon after, Tencent and ByteDance's Volcano Engine released their new training cluster services. Tencent said using its new generation of clusters to train trillion-parameter large models can now be compressed to four days, while ByteDance said its new clusters support the training of ten-thousand-card level large models, and most of the dozens of domestic companies developing large models are already using the Volcano Engine.

All of these platforms use either Nvidia A100, H100 GPUs, or the downgraded version A800, H800 launched by Nvidia after the ban last year. The bandwidth of these two processors is about 3/4 and half of the original version, respectively, avoiding the control standards for high-performance GPUs.

A new round of ordering competition has begun around H800 and A800 GPUs for large Chinese technology companies.

A person from a cloud service company said that companies like ByteDance and Alibaba mainly talk directly with Nvidia for procurement, and agents and second-hand markets cannot meet their huge demands.

Nvidia offers discounts based on the purchase volume according to the catalog price. According to Nvidia's official website, the selling price of the A100 is $10,000/piece (about RMB 71,000), and the H100 is $36,000/piece (about RMB 257,000). It is understood that the prices of A800 and H800 are slightly lower than the original versions.

Whether Chinese companies can grab the GPUs depends more on their business relationships, such as whether they were already large customers of Nvidia. "There is a difference between talking with Nvidia in China and going to the United States to talk directly with Jensen Huang (Jensen Huang, Nvidia's founder and CEO)," said a person from a cloud service company.

Some companies also cooperate with Nvidia in "business cooperation", buying other products when purchasing popular data center GPUs to compete for priority supply. This is similar to Hermès distribution, where if you want to buy a popular bag, you often have to pair it with tens of thousands of dollars worth of clothes and shoes.

Combining the industry information we have obtained, ByteDance's new order action this year is relatively aggressive, exceeding the $1 billion level.

A person close to Nvidia said that ByteDance has a total of 100,000 units of A100 and H800 GPUs, both delivered and undelivered. Among them, H800 just started production in March this year, and this part of the chips should come from this year's extra purchase. It is understood that with the current production schedule, some H800s will not be delivered until the end of this year.

ByteDance started building its data center in 2017. The data center used to rely more on CPUs that adapted to all types of computing, and up until 2020, ByteDance's spending on Intel CPUs was still higher than Nvidia GPUs. The change in ByteDance's purchase volume also reflects the overtaking of intelligent computing to general computing in the computing needs of large technology companies.

It is understood that a large internet company has already placed an order for at least 10,000 GPUs to Nvidia this year, worth over RMB 10 billion at the catalog price.

Tencent was the first to announce its use of H800 GPUs. In March this year, Tencent Cloud released a new version of high-performance computing services using H800 GPUs, claiming that it was the first in China. At present, this service has opened a test application for enterprise customers, which is faster than the progress of most Chinese companies.

It is understood that Alibaba Cloud also proposed an "Intelligent Calculation Battle" as this year's top battle in May, setting three major goals: machine scale, customer scale, and revenue scale; among them, the important indicator of machine scale is the number of GPUs.

Before the arrival of new GPUs, companies have also transferred internally, giving priority to large model research and development.

One way to release more resources at once is to cut off some less important directions or projects that do not have a clear prospect in the short term. "Many large companies have a lot of half-dead businesses occupying resources," an AI practitioner from a large internet company said.

In May this year, Alibaba's Damo Institute closed its autonomous driving laboratory: more than 300 employees were affected, with about one-third being reassigned to the Cainiao technical team and the rest being laid off. Damo Academy no longer retained its autonomous driving business. Researching autonomous driving also requires the use of high-performance GPUs for training. This adjustment may not have a direct relationship with large models, but it did indeed provide Alibaba with a batch of "free GPUs."

ByteDance and Meituan, on the other hand, transferred GPUs directly from their business technology teams, which bring advertising revenue to the companies.

LatePost learned that shortly after this year's Spring Festival, ByteDance allocated a batch of A100 GPUs originally planned for the ByteDance business technology team to Zhu Wenjia, the TikTok product technology leader. Zhu Wenjia is currently leading ByteDance's large model research and development. The business technology team is the core business department that supports Douyin's advertising recommendation algorithm.

Meituan began developing large models in the first quarter of this year. It is understood that Meituan recently transferred a batch of 80G A100 GPUs with top memory configurations from multiple departments and prioritized them for large models, making these departments use lower-configuration GPUs instead.

Bilibili (B Station), a company with far less financial resources compared to large platforms, also has plans for large models. It is understood that B Station has already accumulated hundreds of GPUs. This year, B Station has been continuously purchasing GPUs and coordinating within each department to allocate GPUs for large models. "Some departments give 10, some give 20," said a person close to B Station.

ByteDance, Meituan, B Station, and other internet companies originally had some excess GPU resources to support search and recommendation technologies. Without harming their existing businesses, they are all now "squeezing computing power".

However, this method of shifting resources from one place to another can only obtain a limited number of GPUs, and the bulk of the GPUs required for training large models still relies on the accumulation of past purchases and waiting for new GPUs to be delivered.

The global competition for Nvidia's data center GPUs

The competition for Nvidia data center GPUs also occurs on a global scale. However, overseas giants purchased large quantities of GPUs earlier, in larger quantities, and with more continuous investments in recent years.

In 2022, Meta and Oracle already had significant investments in the A100 GPUs. Meta collaborated with Nvidia in January last year to build the RSC supercomputing cluster, which included 16,000 A100 GPUs. In November of the same year, Oracle announced the purchase of tens of thousands of A100 and H100 GPUs to build a new data center. The data center has now deployed over 32,700 A100 GPUs and is gradually launching new H100 GPUs.

Microsoft has provided tens of thousands of GPUs to OpenAI since it first invested in the company in 2019. This year in March, Microsoft announced that it has helped OpenAI build a new data center, including tens of thousands of A100 GPUs. In May, Google launched a new Compute Engine A3 computing cluster with 26,000 H100 GPUs to serve companies interested in training large models on their own.

Chinese companies are currently more eager and anxious than overseas giants. For example, Baidu's new GPU orders this year amount to as many as tens of thousands. This is on the same order of magnitude as companies like Google, even though Baidu's scale is much smaller, with last year's revenue being RMB 123.6 billion, only 6% of Google's.

It is understood that ByteDance, Tencent, Alibaba, and Baidu, the four Chinese technology companies investing the most in AI and cloud computing, have accumulated tens of thousands of A100 GPUs in the past. Among them, ByteDance has the most A100 GPUs. Excluding this year's new orders, ByteDance has close to 100,000 A100 and the previous generation product V100 GPUs combined.

Among growth-stage companies, SenseTime announced this year that a total of 27,000 GPUs have been deployed in its "AI Giant Device" computing cluster, including 10,000 A100s. Even quant investment firm Guofang previously purchased 10,000 A100 GPUs.


Even seemingly unrelated AI companies like Guofang previously purchased 10,000 A100 GPUs. Judging by these numbers, it seems that these GPUs are more than enough for companies to train large models. According to NVIDIA's official website, OpenAI used 10,000 V100 GPUs to train GPT-3 with 175 billion parameters; the training duration has not been disclosed. NVIDIA estimates that if the A100 was used to train GPT-3, it would require 1,024 A100 GPUs training for one month; the A100 has a 4.3 times performance improvement compared to the V100. However, the large number of GPUs purchased by Chinese companies in the past were used to support existing businesses or to be sold on cloud computing platforms, and could not be used freely for large model development and external customer support for large model needs.

This also explains the huge differences in estimates of computing power resources among Chinese AI practitioners. In a speech at the Tsinghua Forum at the end of April, Zhang Yaqin, Dean of Tsinghua Institute of Intelligent Industry, said, "If we add up China's computing power, it is equivalent to 500,000 A100 GPUs, and training five models would be no problem." However, in an interview with Caixin, Canvisi CEO Yin Qi said that the current total of A100 GPUs available for large model training in China is only about 40,000.

The main difference reflects the capital expenditure levels in fixed assets such as chips, servers, and data centers, and can directly illustrate the order of magnitude difference in computing resources between domestic and foreign companies.

Baidu, which started testing ChatGPT products the earliest, has had annual capital expenditures of between $800 million and $2 billion since 2020, Alibaba between $6 billion and $8 billion, and Tencent between $7 billion and $11 billion. During the same period, Amazon, Meta, Google, and Microsoft, the four US technology companies that have built their own data centers, had annual capital spending of at least $15 billion.

In the past three years of the pandemic, foreign companies' capital spending continued to rise. Last year, Amazon's capital spending reached $58 billion, Meta and Google both reached $31.4 billion, and Microsoft came close to $24 billion. Chinese companies' investments began shrinking after 2021, and both Tencent and Baidu saw their capital expenditures fall by more than 25% year-on-year last year.

The GPUs used for training large models are no longer sufficient. If Chinese companies really want to invest in large models for the long term and make money from providing services to other models, they will need to continue increasing their GPU resources.

OpenAI, which has been moving faster, has encountered this challenge. In mid-May, OpenAI CEO Sam Altman said in a small-scale exchange with a group of developers that due to a lack of GPUs, OpenAI's API services are not stable enough and not fast enough. GPT-4's multimodal capabilities will not be expanded to every user until more GPUs are available, and they do not plan to release new consumer products in the near future. Technology consulting firm TrendForce released a report in June this year stating that OpenAI needs approximately 30,000 A100 GPUs to continuously optimize and commercialize ChatGPT.

Microsoft, which has collaborated closely with OpenAI, also faces a similar situation. In May this year, users complained that the response speed of New Bing became slower, which Microsoft responded to by saying that the GPU replenishment speed could not keep up with the growth in users. Microsoft Office 365 Copilot, which has embedded large model capabilities, has not been widely opened, and the latest figures show that more than 600 companies are trying it out, while the total number of global Office 365 users is close to 300 million.

If Chinese companies are not aiming to simply train and release a single large model, but instead truly want to create products that serve more users with large models while also further supporting other clients in training even more large models on the cloud, they will need to build up more GPU resources in advance.

Why can there only be these four cards?

Currently, in AI large model training, there are no alternatives to A100, H100, and their reduced versions for China, A800 and H800. According to calculations by quant hedge fund Khaveen Investments, NVIDIA's data center GPU market share reached 88% in 2022, with AMD and Intel splitting the remaining share.

NVIDIA GPU's irreplaceability in large model training stems from the training mechanism of large models, which centers around pre-training and fine-tuning. Pre-training is the foundation and is akin to receiving a general education up to college graduation; fine-tuning optimizes for specific scenarios and tasks to improve work performance.

The pre-training phase, in particular, consumes a significant amount of computational power, placing extremely high demands on the performance of individual GPUs and the data transfer capabilities between multiple GPUs.

Now, only the A100 and H100 can provide the computational efficiency needed for pre-training. They may appear expensive, but they are actually the most cost-effective choice. AI is still in its early stages of commercialization, and costs directly impact whether a service can be used.

Some previous models, such as VGG16, which can recognize that a cat is a cat, had only 130 million parameters, and some companies used RTX-series consumer-grade graphics cards for gaming to run AI models. In contrast, the parameter scale of GPT-3, released just over two years ago, reached 175 billion.

Under the enormous computing demands of large models, using more low-performance GPUs together for computing power is no longer viable. This is because when using multiple GPUs for training, data must be transferred and parameters synchronized between the chips, leaving some GPUs idle and unable to work at full capacity continuously. The lower the single-card performance, the more cards used, and the greater the computing power loss. OpenAI's computing power utilization rate was less than 50% when they used 10,000 V100 GPUs to train GPT-3.

A100 and H100 not only offer high single-card computing power but also feature high bandwidth to improve data-transfer between cards. A100's FP32 (calculations encoded and stored with 4 bytes) computing power reaches 19.5 TFLOPS (1 TFLOPS is equivalent to one trillion floating-point operations per second), while H100's FP32 computing power is even higher at 134 TFLOPS, about four times that of its competitor, AMD's MI250.

A100 and H100 also provide efficient data transfer capabilities, minimizing idle computing power. NVIDIA's unique formula includes NVLink and NVSwitch communication protocols, introduced in 2014 and beyond. The fourth-generation NVLink used on the H100 can increase bidirectional communication bandwidth between GPUs in the same server to 900 GB/s (transferring 900 GB of data per second), more than seven times the latest generation PCIe (a high-speed serial point-to-point transmission standard).

Last year, the US Department of Commerce's export regulations for GPUs were set at a maximum of 4,800 TOPS for computing power and a maximum of 600 GB/s for bandwidth.

A800 and H800 have similar computing power to their original versions but have lower bandwidth. The bandwidth of the A800 has been reduced from 600GB/s for the A100 to 400GB/s, and the specifics of the H800 have not been disclosed yet. According to Bloomberg, its bandwidth is about half of the H100's (900 GB/s), and it takes 10%-30% more time to perform the same AI task as the H100. An AI engineer speculated that the training results of the H800 might be even worse than the A100, but more expensive.

Nevertheless, the performance of A800 and H800 still surpasses other large companies and start-ups offering similar products. While limited by their performance and more specific architecture, AI chips or GPU chips from various companies are currently mainly used for AI inference, and it is difficult for them to handle the pre-training of large models. In simple terms, AI training is model creation, while AI inference is model use; training requires higher chip performance.

Besides the performance gap, NVIDIA's more profound moat is its software ecosystem.

As early as 2006, NVIDIA launched the CUDA computing platform, a parallel computing software engine that allows developers to use CUDA more efficiently for AI training and inference, making the most of GPU computing power. Today, CUDA has become the foundation of AI infrastructure, and mainstream AI frameworks, libraries, and tools have been developed based on CUDA.

GPUs and AI chips from outside NVIDIA that want to access CUDA must provide their own adapter software, but they only have part of CUDA's performance, and their update iterations are slower. AI frameworks like PyTorch are trying to break CUDA's software ecosystem monopoly and offer more software capabilities to support other manufacturers' GPUs, but this has limited appeal to developers.

An AI practitioner said that his company had been in contact with a non-NVIDIA GPU manufacturer. The manufacturer's chip and service quotes were lower than NVIDIA's and promised to provide more timely service, but they judged that the overall cost of training and development with other GPUs would be higher than with NVIDIA, and they would also have to bear the uncertainty of outcomes and spend more time.

"Although the A100 is expensive, it's actually the cheapest to use," he said. For large technology companies and top startups who want to seize the opportunity of large models, money is often not the problem; time is a more precious resource.

In the short term, the only factor that might affect NVIDIA's data center GPU sales is TSMC's production capacity.

The H100/800 uses a 4 nm process, while the A100/800 uses a 7 nm process, and all four chips are manufactured by TSMC. According to Taiwanese media reports, NVIDIA has added 10,000 data center GPU orders to TSMC this year and has placed ultra-urgent orders, shortening production time by up to 50%. Under normal circumstances, it takes TSMC several months to produce the A100. The current production bottleneck mainly lies in the insufficient advanced packaging capacity, with a gap of 10-20%, which requires 3-6 months to gradually increase.

Ever since GPUs suitable for parallel computing were introduced to deep learning, the development of AI has been driven by the overlap of hardware and software, GPU computing power and models and algorithms. As models develop, so does the demand for computing power; as computing power grows, so does the possibility of larger-scale training that was once difficult to achieve.

In the last wave of deep learning frenzy, focusing on image recognition, China's AI software capabilities were on par with the global cutting-edge level, while computing power is the current difficulty; designing and manufacturing chips requires longer accumulation, involving lengthy supply chains and numerous patent barriers.

Large models represent another significant leap in the progress of models and algorithms. There is no time for slow progress, and companies looking to develop large models or provide large model cloud computing capabilities must quickly acquire a sufficient amount of advanced computing power. The scramble for GPUs will not stop before the wave of enthusiasm for this round makes the first batch of companies excited or disappointed.