How to Pick a GPU and Inference Engine for Large Language Models

How to Pick a GPU and Inference Engine for Large Language Models

Trelis Research via YouTube Direct link

Text Generation Inference Speed

13 of 24

13 of 24

Text Generation Inference Speed

Class Central Classrooms beta

YouTube videos curated by Class Central.

Classroom Contents

How to Pick a GPU and Inference Engine for Large Language Models

Automatically move to the next video in the Classroom when playback concludes

  1. 1 How to pick a GPU and software for inference
  2. 2 Video Overview
  3. 3 Effect of Quantization on Quality
  4. 4 Effect of Quantization on Speed
  5. 5 Effect of GPU bandwidth relative to model size
  6. 6 Effect of de-quantization on inference speed
  7. 7 Marlin Kernels, AWQ and Neural Magic
  8. 8 Inference Software - vLLM, TGI, SGLang, NIM
  9. 9 Deploying one-click templates for inference
  10. 10 Testing inference speed for a batch size of 1 and 64
  11. 11 SGLang inference speed
  12. 12 vLLM inference speed
  13. 13 Text Generation Inference Speed
  14. 14 Nvidia NIM Inference Speed
  15. 15 Comparing vLLM, SGLang, TGI and NIM Inference Speed.
  16. 16 Comparing inference costs for A40, A6000, A100 and H100
  17. 17 Inference Setup for Llama 3.1 70B and 405B
  18. 18 Running inference on Llama 8B on A40, A6000, A100 and H100
  19. 19 Inference cost comparison for Llama 8B
  20. 20 Running inference on Llama 70B and 405B on A40, A6000, A100 and H100
  21. 21 Inference cost comparison for Llama 70B and 405B
  22. 22 OpenAI GPT4o Inference Costs versus Llama 3.1 8B, 70B, 405B
  23. 23 Final Inference Tips
  24. 24 Resources

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.