L-3 | Building LLM Tokenizers From Scratch (With Code!)

3.0K views· 301 likes· 37:31· Dec 9, 2025

ShareTwitter Facebook LinkedIn Instagram

🛍️ Products Mentioned (2)

Github Product

Available on github →

Github Product

Available on github →

In the last lecture, we built our own TinyGPT LLM from scratch using manual tokenization. Today, we upgrade that system using real, production-level tokenizers. GitHub: ( both links have same code ) https://github.com/codewithaarohi/Build_mini_gpt_with_tokenizer https://github.com/AarohiSingla/Build-a-Mini-GPT-Model-From-Scratch_with_tokenizer 📧 You can also reach me at: aarohisingla1987@gmail.com 📸 Follow me on Instagram (English) : @codewithaarohi 🔗 https://www.instagram.com/codewithaarohi/ 📸 Follow me on Instagram (Hindi) : @codewithaarohihindi 🔗 https://instagram.com/codewithaarohihindi If you haven’t watched the previous lecture I highly recommend watching it first—we built the entire TinyGPT model step-by-step. In this video, you will learn: What tokenizers really do How LLMs convert text → tokens → numbers How to use SentencePiece How to use BPE (Byte Pair Encoding) How to use pretrained tokenizers like GPT-2, BERT, LLaMA, T5 How to train your own tokenizer from your own dataset How vocabulary size, domain-specific text, and language mix affect tokens How embedding layers convert token IDs into vectors How to integrate everything into our TinyGPT model Libraries Covered sentencepiece (train your own tokenizer) tokenizers (BPE, ByteLevelBPETokenizer) gensim (Word2Vec, FastText embeddings) transformers (HuggingFace tokenizers) 👍 Support the Channel Your support pushes me to create even better videos. Please Like, Comment, Share, and Subscribe ❤️

Watch on YouTube