Completed
Text Generation Inference Speed
Class Central Classrooms beta
YouTube videos curated by Class Central.
Classroom Contents
How to Pick a GPU and Inference Engine for Large Language Models
Automatically move to the next video in the Classroom when playback concludes
- 1 How to pick a GPU and software for inference
- 2 Video Overview
- 3 Effect of Quantization on Quality
- 4 Effect of Quantization on Speed
- 5 Effect of GPU bandwidth relative to model size
- 6 Effect of de-quantization on inference speed
- 7 Marlin Kernels, AWQ and Neural Magic
- 8 Inference Software - vLLM, TGI, SGLang, NIM
- 9 Deploying one-click templates for inference
- 10 Testing inference speed for a batch size of 1 and 64
- 11 SGLang inference speed
- 12 vLLM inference speed
- 13 Text Generation Inference Speed
- 14 Nvidia NIM Inference Speed
- 15 Comparing vLLM, SGLang, TGI and NIM Inference Speed.
- 16 Comparing inference costs for A40, A6000, A100 and H100
- 17 Inference Setup for Llama 3.1 70B and 405B
- 18 Running inference on Llama 8B on A40, A6000, A100 and H100
- 19 Inference cost comparison for Llama 8B
- 20 Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
- 21 Inference cost comparison for Llama 70B and 405B
- 22 OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
- 23 Final Inference Tips
- 24 Resources