LLMs Learning Path
Fundamentals
From Seq-to-Seq and RNN
to Attention and Transformers
Attention In LLMs
- Self-Attention : Calculates attention using queries, keys, and values from the same block (encoder or decoder).
- Cross Attention: It is used in encoder-decoder architectures, where encoder outputs are the queries, and key-value pairs come from the decoder.
- Sparse Attention : To speedup the computation of Self-attention, sparse attention iteratively calculates attention in sliding windows for speed gains.
- Flash Attention : To speed up calculating attention using GPUs, flash attention employs input tiling to minimize the memory reads and writes between the GPU high bandwidth memory (HBM) and the on-chip SRAM.
NLP Fundamentals
Tokenization
- Wordpiece
- Byte pair encoding (BPE)
- UnigramLM
Encoding Positions
- Alibi
- RoPE
Language Modeling and llms
- Full Language Modeling
- Prefix Language Modeling
- Masked Language Modeling
- Unified Language Modeling
HF Transformers, pyTorch