Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training

sentdex via YouTube

Overview

Learn how to build a QLoRA fine-tuning dataset for language models in this comprehensive video tutorial. Explore various Reddit dataset options, including torrent files, Archive.org, and BigQuery. Follow step-by-step instructions on exporting BigQuery Reddit data, decompressing gzip archives, and recombining archives for target subreddits. Discover the proper data structure, build training samples, and save them to a database. Create customized training JSON files and dive into QLoRA training and results. Gain valuable insights into dataset building for fine-tuning language models through practical demonstrations and explanations.

Syllabus

- Introduction to Dataset building for fine-tuning.
- The Reddit dataset options Torrent, Archive.org, BigQuery
- Exporting BigQuery Reddit and some other data
- Decompressing all of the gzip archives
- Re-combining the archives for target subreddits
- How to structure the data
- Building training samples and saving to database
- Creating customized training json files
- QLoRA training and results

Taught by

sentdex

Reviews

Start your review of Building an LLM Fine-Tuning Dataset - From Reddit Comments to QLoRA Training

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.