Cisco is constructing data center network infrastructure to accommodate AI/ML workloads, and part of that effort includes the acquisition of Splunk and its relationship with Nvidia.
A whiteboard would have been the exclusive domain for concepts for the redesign of data center networking operations to manage AI workloads not so long ago. Nonetheless, over the previous 12 months, things have radically changed.
Kevin Wollenweber, senior vice president and general manager of Cisco’s networking, data center, and provider connectivity organization, stated, “AI and ML were on the radar, but the past 18 months or so have seen significant investment and development – especially around generative AI. What we expect in 2024 is more enterprise data-center organizations will use new tools and technologies to drive an AI infrastructure that will let them get more data, faster, and with better insights from the data sources,” Businesses will also be in a better position to “better handle the workloads that entails,” according to him.
Recent activity from Cisco attests to the expansion of AI in the industry.
It is anticipated that Cisco’s $28 billion acquisition of Splunk, which finalized last week, will improve AI in Cisco’s security and observability portfolios, among other areas. Additionally, Cisco and Nvidia have recently signed a deal that will result in networking gear and integrated software that should make it easier for clients to spin up infrastructure to enable AI applications.
In order to enable AI and data-intensive applications in the data center and at the edge, the companies announced that Nvidia’s most recent Tensor Core GPUs will be available in Cisco’s M7 Unified Computing System (UCS) rack and blade servers, including UCS X-Series and UCS X-Series Direct. Nvidia AI Enterprise software, which offers pretrained models and development tools for AI that is ready for production, will be included in the integrated package.
“The Nvidia alliance is actually an engineering partnership, and we are building solutions together with Nvidia to make it easier for our customers – enterprises and service providers – to consume AI technology,” stated Wollenweber. In addition to providing toolsets for creating, overseeing, and debugging the fabrics to ensure optimal performance, Wollenweber stated that their solutions will facilitate AI productivity. “Driving this technology into the enterprise is where this partnership will grow in the future.”
AI Speeds Up Network Investments
Industry observers point out that larger network capacity will be necessary for AI implementations.
IDC, a research firm, projects that in 2023, the data center segment of the Ethernet switching market will have a 13.6% increase in revenues due to the growing need for faster Ethernet switches by companies and service providers to meet the quickly evolving AI workloads. Revenues for 200/400 GbE switches increased 68.9% for the entire year in 2023, as reported by IDC analyst Brandon Butler in a Network World article.
According to Butler, “the impact of AI dominated the Ethernet switching market in 2023, with the overall market rising 20.1% in 2023 to reach $44.2 billion.”
AI networks will hasten the shift to faster speeds, according to a recent article from the Dell’Oro Group. “For example, 800 Gbps is expected to comprise the majority of the ports in AI back-end networks by 2025, within just two years of the latest 800 Gbps product introduction,”
“While most of the market demand will come from Tier 1 Cloud Service Providers, Tier 2/3 and large enterprises are forecast to be significant, approaching $10 B over the next five years. The latter group will favor Ethernet,” The latter segment will prefer Ethernet, according to Boujelbene.
According to Wollenweber, Ethernet is a technology that attracts a lot of investment and advances rapidly. “We’re building 1.6 terabit Ethernet now, and it’s also the predominant networking technology for the rest of the data center. We’ve gone from 100G to 400G to 800G,” Wollenweber stated.
This week, The 650 Group revealed that in order to keep up with the workloads associated with AI and machine learning (ML), networking speeds will continue to rise quickly. Early 2024 1.6 terabit Ethernet (1.6 TbE) demos demonstrate that Ethernet is keeping up with the demands of AI/ML networking, and by 2030, 1.6 TbE solutions are expected to dominate port speeds, according to 650 Group’s projections.
Ethernet and AI Combined
Nowadays, the majority of enterprise data center networks are built on Ethernet. Thus, according to Wollenweber, it makes sense for businesses to continue with Ethernet when adding GPU-based systems for AI workloads because IT and engineering staff members are familiar with it, can integrate these AI compute nodes, and can obtain consistent performance from Ethernet technology.
In a blog post regarding AI networking, Wollenweber stated that “An AI/ML workload or job – such as for different types of learning that use large data sets – may need to be distributed across many GPUs as part of an AI/ML cluster to balance the load through parallel processing,”
“To deliver high-quality results quickly – particularly for training models – all AI/ML clusters need to be connected by a high-performance network that supports non-blocking, low-latency, lossless fabric,” Wollenweber wrote. “While less compute-intensive, running AI inferencing in edge data centers will also involve requirements on network performance, scale and latency control to help quickly deliver real-time insights to a large number of end-users.”
In order to increase throughput and decrease latency on compute and storage traffic, Wollenweber mentioned the remote direct memory access (RDMA) over Converged Ethernet (RoCE) network protocol. RoCEv2 is used to provide memory access on a remote host without CPU participation.
According to Wollenweber, “Ethernet fabrics with RoCEv2 protocol support are designed with advanced congestion management to help intelligently control latency and loss, and are optimized for AI/ML clusters with widely adopted standards-based technology, easier migration for Ethernet-based data centers, and proven scalability at lower cost-per-bit.”
The Artificial Intelligence Infrastructure of Cisco
More effective scheduling of AI/ML workloads across GPUs will require improved operational tools from clients. The Nexus Dashboard is one of Cisco’s tools.
“How do we actually make it simpler and easier for customers to tune these Ethernet networks and connect this massive amount of compute as efficiently as possible? That’s what we are looking at,” Wollenweber stated,
Cisco has shaped its AI data center directions with a series of recent announcements that expand on past work. For instance, Cisco released a plan last summer outlining how businesses may enhance AI workloads using their current data center Ethernet networks.
According to Cisco’s Data Center Networking Blueprint for AI/ML Applications, one of the main features of that plan is its Nexus 9000 data center switches, which “have the hardware and software capabilities available today to provide the right latency, congestion management mechanisms, and telemetry to meet the requirements of AI/ML applications.” “Coupled with tools such as Cisco Nexus Dashboard Insights for visibility and Nexus Dashboard Fabric Controller for automation, Cisco Nexus 9000 switches become ideal platforms to build a high-performance AI/ML network fabric.”
High-end programmable Silicon One processors, targeted at large-scale AI/ML infrastructures for companies and hyperscalers, are another component of Cisco’s AI network architecture.