LLM Pipelines: Seamless Integration on Embedded Devices - Optimizing Large Language Models for Edge Computing

Overview

Watch a technical presentation exploring the deployment of Large Language Models (LLMs) on embedded devices through NXP's LLM Pipelines project. Learn about advanced solutions for improving LLM implementation through quantization and fine-tuning techniques, specifically focusing on NXP's high-end MPUs including i.MX 8M Plus, i.MX 93, and i.MX 95. Discover how machine learning quantization techniques can reduce model size and improve execution time while maintaining accuracy, particularly in auto-regressive models. Explore the implementation of Retrieval Augmented Generation (RAG) for specialized use cases like car assistants, including methods for handling hardware constraints and managing out-of-topic queries. Gain insights into comparing different quantization approaches and addressing RAG-related challenges in embedded systems development.