How to Pick a GPU and Inference Engine for Large Language Models

Overview

Dive into a comprehensive video tutorial on selecting the right GPU and inference engine for machine learning projects. Learn about the impact of quantization on model quality and speed, the relationship between GPU bandwidth and model size, and the effects of de-quantization on inference speed. Explore advanced topics like Marlin Kernels, AWQ, and Neural Magic. Compare popular inference software including vLLM, TGI, SGLang, and NIM, and discover how to deploy one-click templates for inference. Analyze detailed performance comparisons across various GPUs (A40, A6000, A100, H100) and model sizes (Llama 3.1 8B, 70B, 405B), including cost considerations. Gain insights into OpenAI GPT4 inference costs compared to Llama models. Conclude with valuable tips for optimizing inference setups and access additional resources for further learning.

Syllabus

How to pick a GPU and software for inference
Video Overview
Effect of Quantization on Quality
Effect of Quantization on Speed
Effect of GPU bandwidth relative to model size
Effect of de-quantization on inference speed
Marlin Kernels, AWQ and Neural Magic
Inference Software - vLLM, TGI, SGLang, NIM
Deploying one-click templates for inference
Testing inference speed for a batch size of 1 and 64
SGLang inference speed
vLLM inference speed
Text Generation Inference Speed
Nvidia NIM Inference Speed
Comparing vLLM, SGLang, TGI and NIM Inference Speed.
Comparing inference costs for A40, A6000, A100 and H100
Inference Setup for Llama 3.1 70B and 405B
Running inference on Llama 8B on A40, A6000, A100 and H100
Inference cost comparison for Llama 8B
Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
Inference cost comparison for Llama 70B and 405B
OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
Final Inference Tips
Resources

Taught by

Trelis Research

Reviews

Start your review of How to Pick a GPU and Inference Engine for Large Language Models

Taught by

AWQ for LLM Quantization - Efficient Inference Framework for Large Language Models

Faster Inference Using Output Predictions with OpenAI and vLLM

Quant-LLM: Accelerating Large Language Model Serving via FP6-Centric Algorithm-System Co-Design

Never Stop Learning.