Is this your channel?

L-7 | How Transformers Calculate Attention Scores | LLM Series

1.3K views· 74 likes· 21:04· Jan 4, 2026

ShareTwitter Facebook LinkedIn Instagram

In this lecture, we dive deep into Scaled Dot-Product Attention, one of the most important concepts in Transformer models, introduced in the paper “Attention Is All You Need”. This video is a continuation of the previous lecture where we discussed Query, Key, and Value (Q, K, V) and how they are computed using learned weight matrices. Today, we focus on how attention scores are calculated step by step inside the Transformer encoder. 🔍 What you’ll learn in this video: How Transformers prepare input using tokenization, embeddings, and positional encoding The role of Query, Key, and Value in self-attention How to compute Q, K, and V using weight matrices Step-by-step calculation of dot-product attention Why we scale attention scores using √dₖ How softmax converts scores into attention weights How attention weights are multiplied with Value vectors Understanding matrix shapes: Q, K, Kᵀ, QKᵀ, and output dimensions Intuition behind context-aware representations in the encoder By the end of this lecture, you will clearly understand how: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) × V This video is ideal for: Beginners learning Transformers Students studying Deep Learning / NLP Anyone preparing for interviews or research in AI 📸 Follow me on Instagram: @codewithaarohihindi 🔗 https://instagram.com/codewithaarohihindi 📧 You can also reach me at: aarohisingla1987@gmail.com

Watch on YouTube