Cisco AI Infrastructure Power Workshop – 3 days (AI-I-PW3)
-
Overview
Description, Pre requisites -
Content
Lessons, Course Structure
Artificial intelligence (AI) is a major focus in all the sectors of industry and government. It is a rapidly evolving space with many advanced features that provide greater insight, knowledge and operational efficiencies in many areas of operation.
Many businesses have indicated that AI is a strategic objective, but few have advanced to implementation and use. The aim of this session is to enable attendees to design, size, deploy and configure a Cisco Centric AI Infrastructure Solution. Attendees are assumed to be familiar with fundamentals of AI technology and concepts.
Objectives
- Provide an overview of how to design and size Cisco AI POD solution
- Overview of Extended GPU Operations and AI Inferencing
- Technologies that form the AI landscape and how Cisco infrastructure and ecosystem interact
- Methodology framework to migrate an AI Cloud solution to on-prem
Target audience:
- IT Architects and Designers
- Presales SEs
- Network Engineers
- Server Administrators
- AI Integrators
Pre-requisite skills:
- Network Admin skills
- Understanding of programming concepts
- Conceptual understanding of VM and containers
- Basic knowledge of Cisco UCS server environment
- Basic Linux overview
- Fundamentals of AI
Duration:
3 days
Module 1 – AI System Overview
- Overview of AI solution
- Cisco AI POD overview
- Components – HW and SW
Module 2 – Cisco AI Pod Networking
- NDFC and NXOS config
- QOS and CLI output
- Infra networks
- Front End
- Backend
- Host Networking
- SuperNIC configuration
- DOCA
Module 3 – AI storage
- Server disks
- Boot
- Openshift Data …
- Cepht
- What is on disks
- Image
- Container volumes
- Data for training
- Shared storage
- Object storage
- Repos – local in OpenShift
- K8 storage
- Persistent volumes
Module 4 – Extended GPU Operations
- GPU clustering
- Backend network connectivity
- ROCE2
- RDMA
- IP protocol headers
- Supporting software
- NVLINK integration
Module 5– AI Inferencing Advanced
- Inference server operations
- Inference frameworks
- VLLM operations
- Nvidia NIMS
Module 6 – How to Manage OpenShift
- Operators for NVIDIA and distributed GPUs
- Network config for Backend
Module 7 – Cloud Migration to On-prem
- Design points
- Reverse engineer solution on prem
- AI workloads requirements
- Infra requirements
- Scoping
- Sizing details
- How get metrics and quantify
- Cloud provider scoping and sizing guides
- Migration strategies and framework
- Quantitative
- Metrics