Cisco AI Infrastructure Power Workshop – 3 days (AI-I-PW3)

Artificial intelligence (AI) is a major focus in all the sectors of industry and government.  It is a rapidly evolving space with many advanced features that provide greater insight, knowledge and operational efficiencies in many areas of operation.

Many businesses have indicated that AI is a strategic objective, but few have advanced to implementation and use. The aim of this session is to enable attendees to design, size, deploy and configure a Cisco Centric AI Infrastructure Solution. Attendees are assumed to be familiar with fundamentals of AI technology and concepts.

Objectives

  • Provide an overview of how to design and size Cisco AI POD solution
  • Overview of Extended GPU Operations and AI Inferencing
  • Technologies that form the AI landscape and how Cisco infrastructure and ecosystem interact
  • Methodology framework to migrate an AI Cloud solution to on-prem

Target audience:

  • IT Architects and Designers
  • Presales SEs
  • Network Engineers
  • Server Administrators
  • AI Integrators

Pre-requisite skills:

  • Network Admin skills
  • Understanding of programming concepts
  • Conceptual understanding of VM and containers
  • Basic knowledge of Cisco UCS server environment
  • Basic Linux overview
  • Fundamentals of AI

Duration:

3 days

Module 1 – AI System Overview

  • Overview of AI solution
    • Cisco AI POD overview
    • Components – HW and SW

Module 2 – Cisco AI Pod Networking

  • NDFC and NXOS config
  • QOS and CLI output
  • Infra networks
    • Front End
    • Backend
  • Host Networking
    • SuperNIC configuration
    • DOCA

Module 3 – AI storage

  • Server disks
    • Boot
    • Openshift Data …
    • Cepht
  • What is on disks
    • Image
    • Container volumes
    • Data for training
  • Shared storage
    • Object storage
  • Repos – local in OpenShift
  • K8 storage
    • Persistent volumes

Module 4 – Extended GPU Operations

  • GPU clustering
  • Backend network connectivity
  • ROCE2
    • RDMA
    • IP protocol headers
  • Supporting software
  • NVLINK integration

Module 5– AI Inferencing Advanced

  • Inference server operations
  • Inference frameworks
  • VLLM operations
  • Nvidia NIMS

Module 6 – How to Manage OpenShift

  • Operators for NVIDIA and distributed GPUs
  • Network config for Backend

Module 7 – Cloud Migration to On-prem

  • Design points
    • Reverse engineer solution on prem
    • AI workloads requirements
    • Infra requirements
  • Scoping
    • Sizing details
    • How get metrics and quantify
    • Cloud provider scoping and sizing guides
  • Migration strategies and framework
    • Quantitative
    • Metrics