AI Fundamentals and Cisco AI-Ready Infrastructure (AI-PW)

The focus of this Artificial Intelligence (AI) Power Workshop training is to enable Cisco engineers, who have an infrastructure background, to understand the environment upon which AI systems are built.

The aim is not to explore AI programming or model development.  However sufficient detail will be provided so as to understand the requirements of AI system and how to provision relevant infrastructure.

The focus will be on provisioning AI using On-Prem environment with a focus on the Cisco AI strategy.

The areas that will be of focus include:

  • AI overview and key concepts
  • Server components such as GPU
  • Networking and inter device communications requirements
  • Common AI environments and tools
  • Containerization and bare metal software provisioning
  • Cisco AI offerings and frameworks
  • Associated partner integrations. This will include RedHat and NVIDIA.

The sessions will be augmented with lab exercises to enhance and reinforce learning objectives

Target audience:

  • IT Architects
  • Presales SEs
  • Network Engineers
  • Server Administrators

Pre-requisite skills:

  • Network Admin skills
  • Basic Linux operations
  • Understanding of programming logic
  • Conceptual understanding of VM and containers
  • Basic knowledge of Cisco UCS server environment

Duration:

5 days

Module 1 – AI Overview

Module overview

  • What it covers
    • AI infra engineering
    • Build environment up to point where,
      • Data Engineers can apply their models
      • An off the shelf AI environment can be deployed
  • What is not covered
    • Customer Business AI requirements
    • ML and building AI logic

Overview of AI

  • History and evolution
  • Hype vs reality
    • AI vs intelligence
  • The AI landscape: Narrow AI, General AI, Super AI.
    • Future long way off
  • AI and pattern recognition
  • AI type overview
    • AI catch all
    • ML
    • Deep learning
  • Supervised and unsupervised
  • Learning and knowledge
    • Garbage in Garbage out
    • Bias
    • Hallucinations

Use case

  • Industry uses – who is using
    • Real-world examples: healthcare, finance, transportation, entertainment, etc.
  • Use cases
    • Network analysis
      • Predictive
      • Root cause
      • Log analysis
    • Security
      • Cisco Hypersecure
  • Fraud detection
  • Cars
  • chatbots
  • etc

Impact of AI

  • Social and jobs
  • Work and efficiency

Where are we at now

  • Why recent boom
    • Hardware
    • Transforms
    • Code / software availability
  • Is Market ready?
    • % of business say they are ready
    • What do they want
  • Next steps and future
  • Where is Cisco now

AI operations

  • Patterns and predictions
  • AI Stages
    • Preprocess and validate data
    • Training
    • Post work
      • Validating and adjusting
      • Reinforcing
      • other
    • Inference
  • ML algorithms
    • Normal calculations using Cuda
  • Deep learning
    • Neural network overview
    • Neurons, layers, activation functions.
    • Tensors
    • Models
      • Convolutional Neural Networks (CNNs)
      • Recurrent Neural Networks (RNNs)
      • Transforms
  • Natural Language Processing (NLP)
    • LLM and text
    • Word guess
    • Text preprocessing: tokenization, stemming, lemmatization.
  • Computer Vision
    • Image processing fundamentals.
    • Feature extraction: edge detection, corner detection.
  • Hybrid models

Model overview

  • What is a model
  • Numbers and naming
    • Parameters
    • Model Size
    • Data set size
    • Layers
    • FLOPs
      • computational cost of a model
    • Granularity and accuracy
  • Tree structure and weightings
  • Model types
    • GPT
    • LLAMA
    • DALLE
    • Etc
  • Common Models
    • ChatGPT – OpenAI
    • Llama – Meta
    • Claude – Anthropic
    • Gemini – Google
    • Watson – IBM
    • Qwen – Alibaba
    • Deepseek
  • Learning resources
  • Publicly Model sources
    • GIT
    • Ollama
    • Hugging Face
    • NVIDIA NGC
      • {Add URLS}
  • Training data sources
    • Internet
    • Opensource data sets
    • In house

Lab 

  • Connect to NVIDIA site and look over the different types of AI
  • Look at Various Repos and what models they have

Module 2 – AI Deployments

AI Deployment Options

  • Deployment Options overview
    • Cloud vs on prem
      • hybrid
    • Turnkey/bundle vs DIY
  • Cloud
    • Google Cloud AI
      • Google Colab
    • AWS AI
      • Amazon SageMaker Studio
    • Azure AI
    • IBM
    • Other key players?
  • On-premises design overview
    • Deployment options and big picture
      • How it all hands together
      • blueprints
    • Bear metal
    • Virtualization and containers

On prem Infrastructure components

  • Compute nodes
    • Generic
    • Cisco models (more later)
  • OS
    • Drivers
  • PCIe
    • Operation
    • speeds
  • CPU vs GPU vs TPU vs DPU vs NPU
  • CPU
    • Intel Xeon
      • 5th Gen with 4th Gen
      • AMX and AVS-512
      • PyTorch (IPEX)
      • DeepSpeed
      • multi-CPU inferencing
  • AMD
  • Embedded Inference capabilities
  • GPU operations
    • Architecture
      • vRAM and cache
      • cores
        • CUDA
        • Tensor
      • Parallelisation
      • CPU to GPU comms
      • Control channels
  • GPU cards and models
    • Nvidia
    • AMD
    • Intel
  • Performance metrics and requirements
    • Stats and GPU requirements
      • GPU to work task expectations
    • CPU
    • RAM and vRAM
    • IO
  • Sharing Single GPU
    • MIG
    • vGPU
    • time slicing
  • Networking
    • NIC and offload
      • DPU
    • Speeds
      • 400-800G
      • throughput/bandwidth
      • Stats
    • MTU
    • Switches and fabrics
    • Latency, QOS and loss/delay
      • Converged Ethernet – PFC
      • TOS
  • InfiniBand positioning
  • Storage and databases
    • Size requirements
    • Performance considerations
    • Connectivity options
      • Local
      • IP/NAS
      • FC

Containerization and Virtualization

  • Ready-made containers vs BareMetal
  • Container primer
    • Container fundamentals
      • Runtimes
      • Supporting software
  • Podman and docker
    • Desktop and cli
    • Options and how setup
  • Images vs containers
  • Image sources/repos
  • Config files
  • Kubernetes overview
    • {Base as very deep – enough for openshift}
    • {Flow TBD}
  • AI and Containers
    • GPUs and Containers
  • VMs and AI
    • Direct io
    • VMware

Building an AI environment

  • Host setup
    • Tools, drivers and software frameworks
      • Nvidia-smi
    • other
  • How run the AI
    • python
  • Opensource tools
    • Ollama
    • Podman AI
  • Human interaction
    • Open web ui
    • Other
      • N8N – https://n8n.io/
  • API interaction
    • Access to and usability
    • Keys
    • Tools and orchestrators
      • Ansible
      • etc

Lab

  • Connect Housley labs
  • Connect to vms/host with GPU
  • Build container AI environment
  • Explore build environments in a podman environment
    • External and internal to container
  • Run various explore options with different models etc

 

Module 3 – AI Enhancements and Tuning

RAG

  • How works
  • Vector DBs

Retraining

  • How works
  • Pros and cons

AI Customization

  • Prompt engineering
  • Pipelines/workflows
  • Filtering and privacy
  • Search Watch and update
    • Scan WWW
  • Agents
  • Tools

Scaling AI

  • AI GPU clusters
    • What used for
      • Distribute workload over multiple GPUs
    • How works
      • RDMA
      • SRIOV
    • Transports
      • RoCE
      • Infiniband
      • NVlink
    • Controls and

Lab 

  • Add docs to base model to create RAG enhanced env
  • Explore Prompt engineering and filtering
  • Look at Workflows and enhanced AI tools
    • Possibly on www
  • Explore preconfigured Clustered GPU environment

 

Module 4 – Cisco AI

Cisco AI approach

  • Roadmap and vision
  • CVD –options and overview
    • AI stack – Cisco and partners
  • Hyperfabric AI pod

Cisco Equipment

  • Networking
    • Hyperfabric
    • Nexus/NDFC
    • ACI
  • UCS
    • C885 and c845
      • Architecture
    • UCS X
    • Intersights and
    • Other servers

Partners

  • Ecosystem overview
  • Nvidia overview
  • Redhat overview
  • Storage
    • Pure
  • Databases

NVIDIA AI

  • NVAIE overview
  • Local apps
  • DC GPU manager

Redhat, Openshift and AI

  • Overview
  • RHEL and AI
  • Openshift overview and architecture
  • AI platform
    • extensions
    • Operators

Ongoing ownership

  • Monitoring and Troubleshooting
    • Tools

Lab 

  • Explore
    • lab backend
    • Hyperfabric
    • Intersights
    • NVAIE
    • Openshift
    • Monitoring tools and

Module 5 – Customer Home grown AI

Development environments overview

  • Development environment overview
  • Python Libraries overview
    • TensorFlow, PyTorch, NumPy, Pandas, Scikit-learn
  • Jupyter Notebook
  • NVAIE workbench
    • overview
    • Tools
  • Openshift workbench
    • overview
    • Tools

Lab 

  • Make sample model in NVAIE workbench and Jupyter Notebook