Baidu Shen Dou: Upgrade computing platform capability for 100000 card computing power cluster, Wenxin large model daily usage exceeds 700 million times
嫦娥的情人矩
发表于 2024-9-26 13:34:22
1186
0
0
As the parameter scale of large models becomes larger, the demand for computing power shows an exponential growth trend. At the 2024 Baidu Cloud Intelligence Conference held on September 25, Shen Shao, executive vice president of Baidu AI Cloud Group and president of Baidu Smart Cloud Business Group, said that the famous scaling law in the field of large-scale models is still continuing. This law pointed out that model performance will improve with the increase of parameters, computing power and data set size, and "soon, more 100000 calorie computing power clusters will appear".
According to Shen Dou's observation, in the past year, we have felt a sharp increase in the demand for model training from customers. He introduced, "The landing of the big model industry in 2024 is accelerating. Currently, on the Qianfan big model platform, Wenxin big model has been adjusted more than 700 million times a day, helping users fine tune 30000 big models and developing over 700000 enterprise level applications
The increasing demand for large model training means that the required computing power cluster size is getting larger, and at the same time, the expectation of a continuous decrease in model inference costs is also increasing. Shen Dou stated that these have raised higher requirements for the stability and effectiveness of GPU management. On September 25th, Baidu upgraded its AI heterogeneous computing platform Baige 4.0, which has the ability to deploy and manage 100000 card clusters.
Shen Dou introduced that GPU computing power clusters have three characteristics - extreme scale, extreme high density, and extreme interconnection. Building a 10000 card cluster alone can cost billions of yuan in GPU procurement costs. Shen Dou emphasized that building computing power resources is not simply about buying GPUs and connecting them, but requires a lot of technology. For example, there are more diverse models of GPU chips and more complex management; GPU needs to perform a large amount of parallel computing; The transmission volume of data has increased and the demand for speed has become higher, "he said. Therefore, the Baige computing platform needs to support heterogeneous chips, high-speed interconnection, and efficient storage.
Shen Dou also stated that managing a 100000 card cluster is fundamentally different from managing a 10000 card cluster. Firstly, at the physical level, deploying a cluster with a capacity of 100000 cards would occupy approximately 100000 square meters of space, equivalent to the area of 14 standard football fields. Secondly, in terms of energy consumption, these servers consume approximately 3 million kilowatt hours of electricity per day, equivalent to the daily electricity consumption of residents in the eastern urban area of Beijing. The huge demand for space and energy in a 100000 card cluster far exceeds the capacity of traditional data center deployment methods. If cross regional deployment of data centers is considered, it will bring huge challenges at the network level. In addition, GPU failures in the 100000 card cluster will be very frequent, and the proportion of effective training time will also face new challenges.
Shen Dou introduced that in response to these challenges, Baige 4.0 has built a large-scale congestion free HPN high-performance network at the 100000 card level, a 10ms level ultra high precision network monitoring, and a minute level fault recovery capability for 100000 card clusters. Baige 4.0 is designed for deploying large-scale clusters of 100000 cards. Today's Baige 4.0 already has mature capabilities for deploying and managing 100000 card clusters, aiming to overcome these new challenges and provide a continuously leading computing platform for the entire industry, "said Shen Dou.
Not only Baidu, but more and more tech giants are facing the demand for AI big models and improving their computing infrastructure capabilities. In early September, Musk announced that Colossus, a super AI training cluster created by his AI startup xAI, had been officially launched, equipped with 100000 Nvidia H100 GPU acceleration cards, and will double the number of GPUs in the coming months. On September 19, 2024, at the Yunqi Conference, Alibaba Cloud also stated that GPU based AI computing power will be the dominant computing paradigm in the future. Alibaba Cloud is upgrading its AI infrastructure for the future from chips, servers, networks, storage to cooling, power supply, data centers, and other aspects.
Logomoney.com is an information publishing platform that only provides information storage space services.
Disclaimer: The views expressed in this article are those of the author only, this article does not represent the position of CandyLake.com, and does not constitute advice, please treat with caution.
Disclaimer: The views expressed in this article are those of the author only, this article does not represent the position of CandyLake.com, and does not constitute advice, please treat with caution.
You may like
- The Apple official website was hacked! IPhone 16 Partial Models' Secondless'
- Baidu Wu Tian: Knowledge Enhancement Big Model Refactoring Industry Digital Engine
- The delivery time for two iPhone 16 models has been shortened! What signal?
- Apple lowers prices of various iPhone models in India
- Meta releases heavyweight new products: $299 Quest 3S headset, AR glasses prototype, multimodal AI model
- Baidu World 2024 will be held on November 12th, and the daily average number of adjustments for the Wenxin large model has exceeded 700 million times
- 挑战Model Y 蔚来的品牌下沉“阳谋”
- Ford CEO tired of making 'boring' car models, personalized and electrified products become 'new favorites'
- Dialogue | Baidu Li Tao: The overlap between automotive intelligence and the wave of big models is a historical inevitability
- Boeing announces 10% layoffs, first delivery of 777X model postponed to 2026
-
Beijing, October 12 (Xinhua) -- According to foreign media reports on the 11th, Boeing plans to lay off about 10% of its workforce, or approximately 17000 people. The company will also delay the laun ...
- dawis168
- day before yesterday 10:24
- Up
- Down
- Reply
- Favorite
-
The stock market is like a game of chess. When opening an account, it's important to plan ahead and seize investment opportunities at any time! According to CCTV News on the 13th, on October 12th loc ...
- 7zi
- Yesterday 10:23
- Up
- Down
- Reply
- Favorite
-
① As the prospects for the Federal Reserve's interest rate cut path become increasingly uncertain, bond investors are now taking defensive measures Last week, higher than expected US CPI and mixed l ...
- yoyn
- 12 Hourago
- Up
- Down
- Reply
- Favorite
-
On October 14th, Southern Finance and Economics reported that according to the Tianyancha App, NIO Mobile Technology Co., Ltd. has undergone a business change recently, with the addition of mobile te ...
- snowiori
- 10 Hourago
- Up
- Down
- Reply
- Favorite