Question 1

Transformer decoder का input क्या होता है?

Accepted Answer

Decoder का input encoder का output नहीं होता—ये सबसे common confusion है। Decoder की input वो tokens होते हैं जो decoder ने खुद generate किए होते हैं। पहली बार जब decoder start करता है, तब मैं SOS (Start Of Sequence) token देता/देती हूँ, फिर softmax से next token आता है और वो input में add होता जाता है।

Question 2

Decoder SOS token से ही क्यों start करता है?

Accepted Answer

क्योंकि initial step पर decoder के पास कोई generated token होता ही नहीं है। इसलिए हम एक special token देते हैं—SOS—ताकि generation process start हो सके। फिर SOS की embedding + positional encoding से 1*512 जैसी input shape बनती है और आगे masked attention चलता है।

Question 3

Masked self-attention में masking की जरूरत क्यों पड़ती है?

Accepted Answer

Transformers/LLMs autoregressive होते हैं—left to right एक-एक token generate करते हैं। मैं masking इसलिए लगाता/लगाती हूँ ताकि decoder future tokens को ‘देख’ न पाए और सिर्फ previous generated tokens पर depend करके next token निकाले। Mask लगाने से attention matrix में कुछ positions block हो जाती हैं और softmax के बाद उनकी probability zero हो जाती है।

Question 4

Decoder में attention matrix का size कैसे decide होता है?

Accepted Answer

Attention matrix का size tokens की count पर depend करता है—rows queries होती हैं और columns keys। अगर decoder input में 2 tokens हैं तो attention matrix 2*2 बनेगी, 3 tokens हैं तो 3*3। यही कारण है कि जैसे-जैसे tokens add होते हैं, computation graph का size भी बढ़ता है।

Question 5

Negative infinity mask क्या करता है softmax के साथ?

Accepted Answer

Masking में मैं future-token वाली positions पर negative infinity (बहुत बड़ा negative number) डालता/डालती हूँ। Softmax में e^(negative infinity) लगभग zero बन जाता है, इसलिए उस position की probability zero हो जाती है। इसका मतलब decoder उस token की key/value को attend ही नहीं कर सकता।

Question 6

Decoder में multi-head attention क्यों use करते हैं?

Accepted Answer

एक head से आपको attention का एक perspective मिलता है, लेकिन multiple heads (जैसे paper में 8) parallel अलग-अलग patterns सीखते हैं। फिर मैं उन outputs को concatenate करके linear projection (W_o) से एक final representation बनवाता/बनवाती हूँ। इससे model richer information पकड़ पाता है।

Question 7

Cross-attention (encoder-decoder attention) decoder में कैसे काम करता है?

Accepted Answer

Cross-attention में decoder को दो inputs मिलते हैं—encoder का output matrix और masked attention के बाद decoder की intermediate output। नाम से ही समझो: encoder-decoder attention यानी decoder encoder की learned meaning वाली representation को use करके generation करता है। मैंने वीडियो में इसे second sub-layer के रूप में highlight किया है।

L-9 Transformer Decoder Explained Step-by-Step | Masked Attention & Cross Attention

About This Video

Frequently Asked Questions

🎬 More from Code With Aarohi Hindi