How to Pick a GPU and Inference Engine for Large Language Models
Overview
Syllabus
How to pick a GPU and software for inference
Video Overview
Effect of Quantization on Quality
Effect of Quantization on Speed
Effect of GPU bandwidth relative to model size
Effect of de-quantization on inference speed
Marlin Kernels, AWQ and Neural Magic
Inference Software - vLLM, TGI, SGLang, NIM
Deploying one-click templates for inference
Testing inference speed for a batch size of 1 and 64
SGLang inference speed
vLLM inference speed
Text Generation Inference Speed
Nvidia NIM Inference Speed
Comparing vLLM, SGLang, TGI and NIM Inference Speed.
Comparing inference costs for A40, A6000, A100 and H100
Inference Setup for Llama 3.1 70B and 405B
Running inference on Llama 8B on A40, A6000, A100 and H100
Inference cost comparison for Llama 8B
Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
Inference cost comparison for Llama 70B and 405B
OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
Final Inference Tips
Resources
Taught by
Trelis Research