ANNOUNCEMENT: Huawei Cloud at KubeCon EU 2024: Unleashing the smart era with continued open-source innovation (1)

(Information sent by the signatory company).

ANNOUNCEMENT: Huawei Cloud at KubeCon EU 2024: Unleashing the smart era with continued open-source innovation (1)

(Information sent by the signatory company)

PARIS, March 25, 2024 /PRNewswire/ -- At KubeCon CloudNativeCon Europe 2024, held in Paris on March 21, Dennis Gu, Chief Architect of Huawei Cloud, said in a keynote speech titled "Cloud Native x AI: Unleashing the intelligent era with continuous open source innovation", that the integration of cloud-native and AI technologies is crucial to drive industry transformation. Huawei Cloud plans to continue innovating open source projects and collaborating with developers to achieve a smart era.

AI poses key challenges to the cloud-native paradigm.

In recent years, cloud-native technologies have revolutionized traditional IT systems and accelerated digital advances in areas such as the Internet and government services. Cloud native has introduced new possibilities, such as lightning-fast sales and agile operations, such as DevOps, through microservices governance. These changes have had a significant impact on people's lives, and the rapid growth and widespread adoption of AI, including large-scale models, has become the core of industrial intelligence.

According to an Epoch survey conducted in 2023, the computing required for basic models has grown 10-fold every 18 months, which is five times faster than the growth rate predicted by Moore's Law for general computing. The emergence of this “New Moore's Law” due to AI and the prevalence of large-scale AI models presents challenges for cloud-native technologies. In his speech, Dennis Gu highlighted the following key points:

Huawei Cloud AI innovation offers developers ideas to address challenges.

The increasing sizes of AI models demand more computing, which creates challenges for cloud-native technologies, but also creates opportunities for innovation in the industry. Dennis Gu shared stories about Huawei Cloud's AI innovation, giving developers a reference point to navigate challenges.

Huawei Cloud used KubeEdge, a cloud-native edge computing platform, to create a multi-robot programming and management platform. With this platform, users can use natural language commands to tell the platform what to do, and the system will coordinate multiple robots at the edge to perform complex tasks. The system is designed with a three-part architecture (cloud, edge node, and robot) to address challenges such as natural language understanding, efficient programming and management of multiple robots, and cross-type robot access management. It uses large models to execute natural language commands and perform traffic predictions, task allocation, and route planning. The three-part architecture greatly improves the flexibility of the robotic platform, improves management efficiency by 25%, reduces the time required for system deployment by 30%, and reduces the time required to deploy new robots. months to days.

For a leading content sharing platform in China, which has more than 100 million active users per month, its main service is recommendations on the home page. This feature is powered by a model with almost 100 billion parameters. To train this model, the platform uses a training cluster with thousands of computing nodes, including hundreds of ps and workers for a single training task. Therefore, there is a great demand for better topology scheduling, high performance and high cost-effectiveness. Volcano, an open source project, improves support for AI or machine learning workloads in Kubernetes and offers a variety of advanced scheduling and job management policies. Volcano incorporates algorithms such as topology-based scheduling, container packaging, and service level agreement (SLA)-based scheduling, resulting in a 20% improvement in overall training performance and a significant reduction in operation complexity and platform maintenance.

Serverless AI is at the forefront of cloud-native development.

Many businesses and developers face the challenge of running AI applications efficiently and reliably while minimizing operational costs. Huawei Cloud has developed a solution to this problem by identifying the key requirements of cloud-native AI platforms and introducing a new concept called Serverless AI.

During his keynote, Dennis Gu explained that serverless AI is designed to simplify complex training and inference tasks by intelligently recommending parallel policies, making it easier for developers to use. It also includes an adaptive GPU/NPU auto-scaling feature that dynamically adjusts resource allocation based on changes in workload in real-time, ensuring efficient task execution. Additionally, there is a fail-safe GPU/NPU cluster in Serverless AI, freeing developers from worrying that hardware failures may disrupt services. Most importantly, Serverless AI is compatible with leading AI frameworks, allowing developers to easily integrate their existing AI tools and models.

Serverless AI is also a very significant advancement for cloud service providers. Serverless AI provides multiple benefits, such as improved GPU/NPU utilization, more efficient hybrid workloads for training, inference, and development, and green computing through better energy efficiency, so you can save money on electricity. Additionally, Serverless AI enables GPU/NPU sharing between multiple tenants in different spaces or at different times, which improves the resource reuse rate. The most important aspect of Serverless AI is its ability to provide guaranteed Quality of Service (QoS) and SLA for both training and inference tasks, ensuring a stable and high-quality service.

Serverless AI uses a flexible resource scheduling layer that is based on a virtualized operating system. This layer encapsulates essential functions of application frameworks in the application resource mediation layer. Dennis Gu presented the reference architecture for serverless AI. He believes this architecture design allows serverless AI to automatically drive AI resources at scale. This includes accurately analyzing resource usage patterns, sharing resources from heterogeneous hardware pools, and ensuring fault tolerance during AI training tasks through GPU/NPU virtualization and live load migration. Additionally, multidimensional scheduling and adaptive elastic scaling improve resource utilization.

In the sub-forum, Huawei Cloud technical experts pointed out that AI or machine learning workloads running on Kubernetes have been steadily increasing. As a result, numerous companies are building cloud-native AI platforms on multiple Kubernetes clusters spanning data centers and a wide range of GPU types. Karmada and Volcano can intelligently schedule GPU workloads across multiple clusters, supporting fault transfer and ensuring consistency and efficiency within and across clusters. They can also balance system-wide resource utilization and QoS of workloads with different priorities to address the challenges of managing large-scale, heterogeneous GPU environments.

Karmada delivers immediate and reliable automated application management in hybrid cloud and multi-cloud scenarios. An increasing number of users use Karmada to create adaptable and efficient solutions in production environments. Karmada was officially upgraded to the CNCF incubation project in 2023 and the community expects more partners and developers to join.

Volcano Gang Scheduling is a solution for big data and distributed AI training scenarios and addresses the problems of endless waiting and deadlock in distributed training tasks. With task topology and I/O-aware scheduling, the transmission delay of distributed training tasks is minimized, improving training performance by 31%. In addition, minResources resolves resource contention between the Spark driver and executor in high-concurrency scenarios, optimizes the degree of parallelism, and improves performance by 39.9%.

Dennis Gu believes that the key to improving AI productivity lies in the agility of cloud-native technologies and the innovation of heterogeneous AI computing platforms. Huawei Cloud is dedicated to open source innovation and aims to work with industry peers to usher in a new era of intelligence.

Photo -

View original content: