LLM Frameworks

LangChain LangGraph LlamaIndex Prompt Engineering Chain of Thought Few-Shot Learning Zero-Shot Learning ReAct Prompting DSPy Instructor

Core AI & RAG

PyTorch HuggingFace HuggingFace API Transformer Architecture Attention Mechanisms Tokenization Context Window Multimodal AI RAG Fundamentals Advanced RAG GraphRAG Embeddings Chunking Strategies Sentence Transformers

Search & Retrieval

Semantic Search Hybrid Search Reranking

Fine-Tuning

Fine-tuning LLMs Fine-Tuning Fundamentals LoRA / QLoRA LoRA Fundamentals Unsloth Axolotl OpenAI Fine-Tuning HuggingFace Fine-Tuning Pretraining vs Fine-Tuning SFT DPO RLHF

Models & APIs

OpenAI Anthropic API Gemini API Llama Qwen Ollama vLLM Groq Mistral API DeepSeek API Together AI Reasoning Models AI APIs Comparison

Agents & Tools

CrewAI AutoGen Smolagents MCP Pydantic AI Mem0 Guardrails AI Function Calling Structured Outputs Document AI PDF Processing Data Extraction OCR for AI SQL Generation Cursor GitHub Copilot Claude Code LiteLLM

Observability & Eval

LangSmith Langfuse MLflow Weights & Biases RAGAS DeepEval DVC MLOps

Cloud & Production

AWS Bedrock AWS SageMaker Azure OpenAI Google Vertex AI Kubernetes for ML Scikit-learn XGBoost Diffusers Quantization Docker for ML FastAPI for ML AI Cost Optimization AI in Production Model Selection

About All Topics

Intermediate Course

Transformer Architecture Intermediate

49 lessons across 7 chapters. Every lesson is standalone — start anywhere.

49 lessons 7 chapters

Beginner Intermediate Advanced

Start Intermediate Course — Lesson 1 →

1 Scaling Laws 7 lessons

What scaling laws discovered

Compute, data, parameters: the three variables

The Chinchilla law: optimal training ratio

Why bigger is not always better

Emergent capabilities at scale

Return on investment of scale

Limits of Scaling

2 Modern Architecture Improvements 7 lessons

Grouped Query Attention: fewer KV heads

Sliding window attention: local context

RoPE: rotary positional embeddings

ALiBi: Attention with Linear Biases

Flash Attention: memory-efficient attention

Why these changes matter for long context

The compute-memory tradeoff

3 Mixture of Experts Architecture 7 lessons

What MoE is: routing to specialized sub-networks

Expert layers: structure and size

Router mechanism: token routing

Top-K routing: selecting active experts

Total vs active parameters

Why MoE achieves quality at lower compute

MoE models: Mixtral, Llama 4, Qwen3

4 Context Window Architecture 7 lessons

How context window is determined

Quadratic attention cost: the scaling wall

Techniques to extend context: RoPE scaling

KV cache: what it stores

KV cache memory requirements

Why 1M+ context is architecturally hard

Practical vs theoretical context limits

5 Training vs Inference Architecture 7 lessons

Training: all tokens processed in parallel

Autoregressive inference: one token at a time

KV cache: avoiding recomputation

Prefill vs decode phases

Batch effects on inference throughput

Why inference is memory-bandwidth bound

Speculative decoding intuition

6 Architecture Comparisons 7 lessons

GPT architecture: decoder stack

BERT architecture: encoder stack

T5 architecture: encoder-decoder

Llama architecture: modern decoder

Mistral architectural innovations

What makes architectures converge

Architecture choices that survive scaling

7 Common Misconceptions 7 lessons

Transformers do not have memory: correction

Bigger context equals better understanding: nuance

Attention equals the model understands: correction

Parameters equal intelligence: nuance

All tokens treated equally: correction

Training equals memorization: the nuance

Architecture vs training data: what matters more