Deploy LLM to Production on Single GPU - REST API for Falcon 7B with QLoRA on Inference Endpoints

Deploy LLM to Production on Single GPU - REST API for Falcon 7B with QLoRA on Inference Endpoints

Venelin Valkov via YouTube Direct link

- Inference with the Merged Model

6 of 10

6 of 10

- Inference with the Merged Model

Class Central Classrooms beta

YouTube playlists curated by Class Central.

Classroom Contents

Deploy LLM to Production on Single GPU - REST API for Falcon 7B with QLoRA on Inference Endpoints

Automatically move to the next video in the Classroom when playback concludes

  1. 1 - Introduction
  2. 2 - Text Tutorial on MLExpert.io
  3. 3 - Google Colab Setup
  4. 4 - Merge QLoRA adapter with Falcon 7B
  5. 5 - Push Model to HuggingFace Hub
  6. 6 - Inference with the Merged Model
  7. 7 - HuggingFace Inference Endpoints with Custom Handler
  8. 8 - Create Endpoint for the Deployment
  9. 9 - Test the Rest API
  10. 10 - Conclusion

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.