AI Terms You Need to Know: The Complete 2026 Glossary (200+ Definitions)

The AI space moves faster than any glossary can keep up. One week it's agents, the next it's MoE, extended thinking, and speculative decoding. If you've been nodding along in meetings without actually knowing what half these words mean, this is the guide you need.

I pulled research from across the entire AI ecosystem, foundational machine learning, large language models, agentic systems, generative AI, safety research, and infrastructure, and compiled every term worth knowing in 2026 into one comprehensive reference. Every definition is written in plain English. No padding, no hedging, no vague summaries.

Bookmark this. Come back whenever someone drops a term you don't recognize.

Key Takeaways

AI is a hierarchy: Artificial Intelligence contains Machine Learning, which contains Deep Learning, which contains specific architectures like Transformers and LLMs.

Large Language Models are trained in two phases: pre-training on massive text corpora, then fine-tuning (often with RLHF) to align behavior with human preferences.

Agentic AI systems go beyond chatbots, they plan, use tools, maintain memory, and execute multi-step tasks autonomously using frameworks like MCP and LangGraph.

Generative AI covers image (diffusion models), video (Sora, Kling), audio (ElevenLabs, Whisper), and code (GitHub Copilot), all powered by different but related architectures.

AI safety terms like hallucination, reward hacking, and sycophancy describe specific failure modes that alignment techniques like RLHF, DPO, and Constitutional AI aim to prevent.

The 2026 frontier is defined by three concepts: reasoning models (test-time compute), Mixture of Experts (sparse activation), and speculative decoding (multi-token generation).

Part 1: Foundational AI and Machine Learning

These are the terms that everything else builds on. If you're fuzzy on any of these, the rest of the glossary won't make sense.

Core Concepts

Artificial Intelligence (AI): The broad field of computer science focused on building machines that perform tasks normally requiring human intelligence, understanding language, recognizing images, making decisions, solving problems. AI is the umbrella; everything else in this glossary lives under it.

Machine Learning (ML): A subset of AI where systems learn from data to improve performance on a task without being explicitly programmed for every situation. Instead of rules, you feed examples. The system finds patterns on its own. It powers everything from spam filters to recommendation engines.

Deep Learning (DL): A subset of ML that uses neural networks with many layers to automatically learn increasingly abstract representations of data. The "deep" refers to the number of layers, early ones detect simple patterns like edges; later ones detect complex structures like faces or full sentences. Deep learning drove the AI breakthroughs of the last decade.

Neural Network: A computational system loosely inspired by the human brain, made of interconnected nodes (artificial neurons) organized into layers. Information flows forward through the layers, each learning to detect more complex features. Neural networks are the foundation of virtually all modern AI.

Training: Feeding data through an algorithm so the model iteratively adjusts its internal parameters to minimize prediction errors. Training is computationally intensive and typically requires large datasets and significant hardware.

Inference: Using a trained model to make predictions on new, unseen data, the deployment phase. When a chatbot responds to your question, that's inference happening in real time.

Algorithm: A step-by-step set of rules a computer follows to solve a problem. In ML, algorithms are the procedures used to train a model. The choice of algorithm affects how well and how quickly a model learns.

Model: The mathematical structure produced by training, the artifact that captures learned patterns and makes predictions on new inputs. Think of it as the "brain" that results from training. Models range from a simple equation to a massive neural network with hundreds of billions of parameters.

Types of Machine Learning

Supervised Learning: The most common form of ML. The model trains on labeled data, each input is paired with the correct output. The model minimizes the difference between its predictions and the known labels. Classic examples: image classifiers, spam detectors.

Unsupervised Learning: Training on unlabeled data. The model discovers hidden structure or patterns on its own, clustering, compression, anomaly detection. No correct answers are provided.

Reinforcement Learning (RL): An agent learns by interacting with an environment and receiving rewards or penalties. The agent learns a policy, a strategy for choosing actions, that maximizes cumulative reward. RL powers game-playing AIs, robotics, and the RLHF fine-tuning behind modern LLMs.

Transfer Learning: Taking a model pre-trained on a large dataset and fine-tuning it on a smaller, task-specific dataset. Rather than training from scratch, you leverage representations the model already learned. Nearly all modern AI uses transfer learning in some form.

Few-shot Learning: The model's ability to learn a new task from only a handful of examples, typically 2–10. Large language models demonstrate strong few-shot capabilities when you give them examples inside the prompt.

Zero-shot Learning: The ability to correctly handle tasks the model has never explicitly seen in training. Modern LLMs exhibit zero-shot capabilities through instruction tuning and broad pre-training.

Self-supervised Learning: The model creates its own training signal from raw data, for example, by masking words and predicting them. This is the core technique behind pre-training LLMs like GPT and BERT.

Neural Network Components

Neuron (Node): The basic unit of a neural network. It receives inputs, multiplies each by a weight, sums them, adds a bias term, and passes the result through an activation function. Billions of connected neurons give neural networks their power.

Weights: The learnable numerical parameters that determine how strongly one neuron influences another. After training, weights encode everything the model has learned.

Bias: An additional learnable parameter added to a neuron's calculation. It lets neurons fire even when all inputs are zero, giving the model flexibility to shift activation functions.

Activation Function: A mathematical function that introduces non-linearity into the network, enabling it to learn complex patterns. Without activation functions, a neural network is just a large linear equation. Common functions include ReLU, GELU, and sigmoid.

Backpropagation: The algorithm for training neural networks, it computes how much each weight contributed to the prediction error and adjusts them accordingly by passing the error signal backward through the network.

Gradient Descent: The optimization algorithm that adjusts weights by moving in the direction that most steeply reduces the loss. Like finding the lowest point of a hilly landscape by always stepping downhill.

Loss Function: Measures how far the model's predictions are from correct answers. Training aims to minimize this number. Common examples: mean squared error for regression, cross-entropy for classification.

Epoch: One complete pass through the entire training dataset. Models typically require many epochs to converge.

Batch (Mini-batch): A small subset of training data processed together before weights are updated. Common batch sizes are 32, 64, or 128 samples.

Learning Rate: Controls the size of each weight update step. Too high causes overshooting and divergence; too low makes training painfully slow. Modern training uses schedules that dynamically adjust the rate.

Dropout: A regularization technique that randomly switches off neurons during each training step, preventing the network from becoming too dependent on any particular neuron. At inference, all neurons are active.

Overfitting: The model learns training data too well, including its noise, and performs poorly on new data. It has memorized rather than generalized.

Underfitting: The model is too simple to capture patterns in the data and performs poorly on both training and new data.

Architecture Types

Transformer: The neural network architecture introduced in 2017 in "Attention Is All You Need." It replaced recurrence with self-attention as the core mechanism. All modern LLMs, GPT, Claude, Gemini, Llama, are Transformers.

Self-Attention: Each token in the input computes how much it should "pay attention" to every other token in the same sequence. This captures long-range relationships without the sequential bottleneck of older architectures.

Multi-head Attention: Runs several self-attention operations in parallel, each learning to focus on different aspects of the input. Multiple heads dramatically enrich what the model learns.

CNN (Convolutional Neural Network): Designed for image and video data. Its convolutional layer slides a small filter across the input to detect local patterns like edges and textures regardless of position in the image.

LSTM (Long Short-Term Memory): A specialized RNN architecture that solves the vanishing gradient problem, enabling networks to learn dependencies across long sequences. State of the art for NLP tasks before Transformers displaced them.

Part 2: Large Language Model (LLM) Terms

Core LLM Concepts

Large Language Model (LLM): A Transformer-based neural network trained on massive text data to predict the next token and perform language tasks, summarization, translation, question answering, code generation. Modern examples: GPT series, Claude, Gemini, Llama.

Foundation Model: A large-scale model pre-trained on general-purpose data that serves as the base for many downstream applications. Fine-tuning a foundation model dramatically reduces data and compute requirements for specific tasks.

Pre-training: The initial large-scale training phase where the model learns general language representations by processing hundreds of billions of tokens from diverse text sources. The most computationally expensive phase. Produces the foundation model.

Fine-tuning: Continuing to train a pre-trained model on a smaller, task-specific dataset to improve performance for a particular use case. It updates model weights to specialize the model without the cost of pre-training from scratch.

RLHF (Reinforcement Learning from Human Feedback): A three-stage alignment technique: supervised fine-tuning on human examples, training a reward model on human preference rankings, and then optimizing the LLM with reinforcement learning to maximize the reward model's score. The foundational alignment method behind most chat assistants.

Constitutional AI (CAI): Anthropic's alignment technique that gives the model a written set of principles, a "constitution", rather than relying exclusively on human labelers. The model critiques and revises its own outputs against those principles. RLAIF (Reinforcement Learning from AI Feedback) extends this by using an AI judge instead of humans to generate preference labels.

Instruction Tuning: Fine-tuning on (instruction, response) pairs to teach a base model to follow natural language directions precisely. Instruction-tuned models (labeled "Instruct" or "Chat" variants) are far more useful for end-users than raw base models.

LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning method that keeps original model weights frozen and inserts small trainable matrices into attention layers. Only these adapter matrices update during training, cutting GPU requirements by up to 90%. LoRA adapters can be merged into the base model for zero-overhead inference.

QLoRA (Quantized LoRA): Even more memory-efficient than LoRA, it quantizes base model weights to 4-bit precision first, then applies LoRA adapters. QLoRA makes fine-tuning 70B+ models feasible on a single consumer GPU.

Architecture Specifics

Token: The basic unit of text an LLM processes. Tokens are subword pieces, not always whole words. Roughly 1 token equals 0.75 English words. Models have a maximum token budget for both input and output combined.

Tokenization: Converting raw text into a sequence of tokens (integers) the model can process. Modern LLMs use Byte-Pair Encoding (BPE) or SentencePiece. The tokenizer must match the model it was trained with.

Context Window / Context Length: The maximum total tokens, input plus output, a model can process in one call. Everything outside this window is invisible to the model. In 2026, Llama 4 Scout supports 10 million tokens; Gemini models reach 1 million.

Embedding: A dense numerical vector representing a piece of text that captures semantic meaning. Similar meanings cluster together in embedding space. Embeddings power semantic search, clustering, and RAG.

KV Cache (Key-Value Cache): An inference optimization that stores the key and value vectors computed during self-attention for already-processed tokens, so they don't need recomputation for each new token. Without it, inference cost grows quadratically with sequence length.

Latent Space: The high-dimensional mathematical space where a model's internal representations exist. Semantically related concepts cluster together. The latent space encodes the model's learned understanding of language and concepts.

Inference Parameters

Temperature: A scalar (typically 0.0–2.0) controlling output randomness. Near 0 makes the model nearly deterministic. Higher values encourage more creative but less predictable responses. Values around 0.7–1.0 are common for general use.

Top-p (Nucleus Sampling): Samples from the smallest set of tokens whose cumulative probability reaches p (e.g., 0.9). Adaptive, when the model is confident, the nucleus is small; when uncertain, it's larger. Generally preferred over top-k.

Top-k: Restricts selection to the k highest-probability tokens and redistributes probability among them before sampling. Simpler but less adaptive than top-p.

Max Tokens: The hard upper limit on tokens the model generates in a single response. Once reached, generation stops regardless of completion.

Stop Sequences: Specific strings that, when generated, trigger an immediate halt to generation. Used to cleanly delimit outputs without relying on the model's sense of completion.

Prompt Engineering

Prompt: The natural-language input to an LLM to elicit a specific response. Ranges from a single question to a complex multi-part instruction. Crafting effective prompts is core to maximizing LLM performance without fine-tuning.

System Prompt: A special prompt, invisible to the end user, prepended by the developer to establish the model's persona, rules, output format, and constraints. The primary mechanism for customizing model behavior at the application layer.

Few-shot Prompting: Including worked examples (input-output pairs) in the prompt before the actual query. The model infers the pattern and applies it to new inputs without any weight updates.

Chain-of-Thought (CoT) Prompting: Instructing the model to reason step by step before answering. The phrase "Let's think step by step" or explicit reasoning examples dramatically improve accuracy on math, logic, and multi-step planning tasks.

Prompt Injection: A security attack where malicious text embedded in user input or retrieved documents overrides the developer's original instructions. The LLM can't distinguish trusted instructions from attacker-supplied content.

Jailbreak: A user-crafted prompt that tries to bypass a model's safety guardrails to elicit content the model would normally refuse. Common techniques include role-playing framings and multi-step manipulation.

Top Model Families in 2026

GPT (OpenAI): The flagship LLM series from OpenAI, beginning with GPT-1 (2018) through GPT-5. As of 2026, GPT-5.5 leads on coding benchmarks. The series popularized large-scale pre-training followed by fine-tuning.

Claude (Anthropic): Anthropic's LLM family designed with safety as a core principle. Claude Opus 4.8 leads SWE-Bench Verified at 87.6% and MRCR 1M long-context retrieval at 92.9%. Available in Haiku (fast), Sonnet (balanced), and Opus (most capable) tiers.

Gemini (Google DeepMind): Google's natively multimodal LLM trained on text, images, audio, and video simultaneously. Gemini 3.1 Pro leads long-context retrieval benchmarks. Gemini 3.5 Flash runs 4x faster than comparable frontier models.

Llama (Meta): Meta's open-weight LLM family, instrumental in building the open-source AI ecosystem. Llama 4 Scout supports a 10 million token context window. Available for download and local deployment.

Mistral (Mistral AI): A Paris-based AI startup producing highly capable models. Pioneeered widespread Mixture-of-Experts architecture for open-weight models. Mixtral 8x7B activates only a fraction of parameters per token, delivering strong performance at lower inference cost.

DeepSeek (DeepSeek AI): Chinese AI lab whose models achieve frontier performance at dramatically lower cost. DeepSeek-R1 was trained for approximately $6 million, a fraction of comparable US models, and matched OpenAI's o1 on math reasoning. DeepSeek V4 Pro is a 1.6 trillion-parameter MoE model released under MIT license.

Part 3: AI Agent and Agentic AI Terms

Agent Concepts

AI Agent: A software system powered by an LLM that can perceive its environment, reason about a goal, and take actions autonomously to complete multi-step tasks. Unlike a chatbot that responds to one input, an agent operates in a loop, observe, think, act, until the objective is achieved.

Agentic AI: The broader category of AI systems designed to function as independent, goal-directed agents that plan, reason, and act over extended horizons. The system understands your goals, reasons through which actions to take, and executes across multiple steps or sessions.

ReAct (Reasoning + Acting): An agent architecture that interleaves reasoning traces with tool actions in a repeating loop. At each step: generate a thought, take an action (call a tool), observe the result, repeat. Prevents agents from acting impulsively and enables mid-course correction.

Multi-Agent System: An architecture where multiple AI agents collaborate to accomplish tasks too large or complex for a single agent. Agents can specialize in different domains, run in parallel, or follow a hierarchy where an orchestrating agent delegates to others.

Agent Orchestration: The coordination layer that manages how multiple agents are invoked, in what order, what data passes between them, and how their outputs combine into a coherent result. Frameworks like LangGraph, CrewAI, and AutoGen implement their own orchestration models.

Supervisor Agent: The top-level agent in a hierarchical multi-agent system that coordinates the overall workflow, decides which sub-agents to invoke, passes them context, and synthesizes their outputs. It reasons about delegation rather than doing domain work directly.

Human-in-the-Loop (HITL): The design pattern where a human is integrated into an agent's workflow at predefined decision points, to validate, correct, or approve before the agent proceeds. The primary mechanism for human oversight in high-stakes agentic deployments.

Guardrails: Safety and quality constraints applied to an agent's inputs and outputs to prevent harmful or incorrect behavior. Input guardrails filter malicious requests; output guardrails check that responses meet format, accuracy, and safety requirements before delivery.

Memory Systems

Working Memory: The agent's active, in-context storage for the current task, everything held in mind right now. Bounded by the model's context window and wiped at the end of a session unless explicitly saved. Think of it as RAM: fast, immediate, and temporary.

Long-term Memory: Persistent storage that survives beyond a single session, allowing the agent to recall facts, preferences, and experiences from past interactions. Typically implemented as a vector database the agent retrieves from at the start of each new session.

Episodic Memory: Captures specific past experiences with temporal details, what happened, when, and in what context. Agents record past interactions, then periodically compress them into summaries stored in memory. This lets the agent remember context from weeks ago without re-reading full conversation logs.

Semantic Memory: Stores factual knowledge independent of specific experiences, user profiles, product specifications, persistent preferences. Unlike episodic memory (event-based), semantic memory holds stable facts that remain true across many interactions.

Tools and Actions

Tool Use: The capability of an LLM-based agent to call external functions, APIs, or systems during reasoning to retrieve information or take actions it can't perform with text alone. Tool use transforms an LLM from a text predictor into an agent that interacts with the real world.

Function Calling: The API mechanism, pioneered by OpenAI, by which a model signals it wants to invoke a developer-defined function, returning a structured JSON object with the function name and arguments. The model produces the call specification; the application layer runs the actual function.

Code Interpreter: A sandboxed execution environment allowing an agent to write and run code, typically Python, as part of its reasoning, then observe the output. This enables precise calculations, data analysis, and algorithm execution that would be unreliable with pure language reasoning.

Browser Use: The capability for an agent to interact with web pages, visiting URLs, clicking elements, filling forms, extracting content, through a controlled browser environment. Unlike a web search tool that returns text snippets, browser use gives the agent full access to dynamic web applications.

Computer Use: An advanced capability where the model observes a computer screen via screenshots and controls the machine with mouse movements and keyboard input. Anthropic's Claude pioneered this pattern. It enables agents to use native desktop apps and legacy software without a dedicated API integration.

RAG and Knowledge

RAG (Retrieval-Augmented Generation): The foundational pattern for grounding LLM responses in external, up-to-date knowledge. Retrieve relevant documents from a knowledge base, inject them into the model's context, generate a response grounded in that evidence. Solves hallucination and knowledge staleness.

Vector Database: Stores documents as high-dimensional numerical vectors and optimizes for nearest-neighbor search, finding semantically similar items to a query. Examples include Pinecone, Weaviate, Qdrant, and pgvector.

Semantic Search: Retrieves documents based on meaning and intent rather than keyword matching. Uses embedding similarity to surface relevant results even when the user phrases a question differently from how the information is written.

Hybrid Search: Combines sparse retrieval (BM25 keyword matching) with dense retrieval (embedding-based semantic search). Produces 15–30% recall improvements over either method alone. Results fuse using Reciprocal Rank Fusion before being passed to a re-ranker.

Chunking: Splitting source documents into smaller segments before embedding them into a vector database. Chunk size and strategy profoundly affect retrieval quality. Common strategies: fixed-size, sentence-level, and recursive splitting that respects document structure.

Re-ranking: A second-stage retrieval step where a more powerful cross-encoder model scores and reorders the top candidate documents retrieved in the first pass. The first pass is fast and approximate; re-ranking is slower and precise, typically applied to the top 20–50 candidates.

HyDE (Hypothetical Document Embeddings): A retrieval enhancement where the model first generates a hypothetical answer to a query, even a fabricated one, then uses that hypothetical document's embedding to query the vector database. Because the hypothetical document resembles actual documents more than the raw query does, retrieval relevance improves.

Frameworks and Protocols

MCP (Model Context Protocol): An open standard introduced by Anthropic in November 2024 that defines how AI systems connect to external tools, data sources, and services in a standardized way. Replaces one-off API integrations with a universal protocol. As of 2026, MCP is supported by Anthropic, OpenAI, and Google DeepMind, with over 500 public MCP servers available.

LangGraph: LangChain's production-grade agent framework that models multi-step agent workflows as directed graphs, nodes represent steps, edges represent transitions. Its typed state machine approach enables precise control over state, branching, loops, and human-in-the-loop interruptions.

CrewAI: A multi-agent framework built around a role-based organizational metaphor. You define agents as crew members with specific roles and goals, assign them tasks, and let the crew coordinate. Growing from 2,800 to 31,200 GitHub stars between January 2024 and April 2026.

AutoGen: Microsoft Research's multi-agent framework focused on conversational collaboration between agents, where agents communicate via natural language messages to coordinate complex tasks. Particularly strong for research scenarios and code generation pipelines.

Part 4: Generative AI Terms

Image Generation

Diffusion Model: A generative AI that starts with random noise and progressively denoises it over many steps until a coherent image emerges. The model learns the reverse of a forward process that gradually adds noise. Diffusion models have largely replaced GANs as the dominant family for high-quality image synthesis.

Stable Diffusion: An open-source text-to-image model built on the Latent Diffusion Model architecture by Stability AI. It compresses images into latent space before running diffusion, making it fast enough for consumer GPUs. The foundation for most custom model ecosystems (LoRAs, ControlNets, fine-tunes).

DALL-E: OpenAI's family of text-to-image models. DALL-E 3 was known for outstanding prompt adherence; in 2025, OpenAI replaced it with GPT Image 2, a natively multimodal model that generates images within the same model handling text.

Flux: A family of text-to-image models built by Black Forest Labs using a Rectified Flow Transformer architecture instead of the traditional U-Net. Significantly improves prompt following and image quality. Available in Schnell (speed) and Dev (quality) variants.

GAN (Generative Adversarial Network): Two neural networks trained in opposition, a Generator creates fake images and a Discriminator tries to detect them. The competitive loop pushes both to improve. GANs were the dominant image synthesis method before diffusion models surpassed them.

ControlNet: An add-on architecture for diffusion models that conditions generation on structured visual inputs such as edge maps, pose skeletons, or depth maps, giving the model precise spatial constraints alongside text prompts.

Inpainting: Filling in a specific masked region of an existing image using an AI model, guided by a text prompt and surrounding context. Used for removing objects, fixing corrupted areas, or replacing elements while keeping the rest intact.

Negative Prompt: Text input specifying what a diffusion model should avoid in the generated image. While the positive prompt describes desired content, the negative prompt lists unwanted elements. A powerful and commonly underutilized tool.

CFG Scale (Classifier-Free Guidance Scale): Controls how strictly the generation follows the text prompt versus how much creative freedom the model takes. Higher values (12–15) produce prompt-adherent but potentially oversaturated images. Values of 7–9 suit most use cases.

Seed: A number determining the starting random noise pattern for generation. Using the same seed with the same prompt always produces the same image, essential for reproducibility and iteration.

Video Generation

Text-to-Video: A generative AI capability producing video clips from text descriptions. Models must generate both spatial content (what it looks like) and temporal content (how it moves). Most production systems use diffusion applied across spacetime patches.

Temporal Consistency: The property of a generated video where characters, objects, and environments remain visually stable from frame to frame. One of the hardest problems in video generation, without it, characters' faces or scene elements flicker between frames.

Sora: OpenAI's flagship text-to-video model, first announced in February 2024. Notable for producing long, physically coherent video clips using a diffusion transformer architecture. Sora 2 (2025–2026) competes against Google's Veo and Kling.

Kling: An AI video generation model by Kuaishou known for strong human motion and physical realism. Kling 3.0 introduced a Multi-Shot Storyboard feature, define 3–12 shots with individual prompts and camera directions while maintaining character consistency across all shots.

Audio AI

Text-to-Speech (TTS): Converts written text into spoken audio. Modern TTS produces highly natural speech with accurate prosody, intonation, and emotional register. Powers AI assistants, video narration tools, and voice agents.

Speech-to-Text (STT): Transcribes spoken audio into written text. Also called Automatic Speech Recognition (ASR). Evaluated on word error rate, speed, and handling of accents and background noise.

Whisper: An open-source speech recognition model from OpenAI trained on massive multilingual audio data. Highly accurate across languages and accents. Whisper Large V3 Turbo is the standard open-source baseline for STT tasks in 2026.

Voice Cloning: Creating a synthetic version of a specific person's voice from a sample of their speech. The model analyzes timbre, rhythm, pitch patterns, and accent, then reproduces them for new text.

ElevenLabs: The leading AI voice company for TTS, voice cloning, and conversational voice agents. Widely considered the category leader for expressive, realistic voice synthesis. Offers a library of thousands of pre-built voices and custom cloning from short audio samples.

Multimodal Concepts

Multimodal AI: An AI system that processes and generates content across more than one data modality, text, images, audio, and video, within a single unified model. Rather than chaining separate models, a multimodal AI learns joint representations. Examples: GPT-4o, Gemini, Claude.

Vision-Language Model (VLM): Combines a computer vision component with a language model, allowing it to take images as input and reason about them in natural language. Powers image captioning, visual question answering, document analysis, and image-guided generation.

Visual Question Answering (VQA): A task where the model receives an image and a natural language question and must generate a correct textual answer. Requires the model to ground visual content to language concepts and reason about them.

Document AI: AI systems that extract, understand, and process information from structured documents, PDFs, invoices, contracts, forms. Combines OCR, layout analysis, and language understanding to parse tables, detect fields, and extract named entities.

Part 5: AI Safety and Ethics Terms

Failure Modes

Hallucination: A model output that is factually false, unsupported by evidence, or inconsistent with context, yet stated with apparent confidence. Occurs because language models predict plausible text, not verified facts. Especially dangerous in medicine, law, and journalism.

Confabulation: A more precise term for hallucination, describing when the AI fills gaps in its knowledge with invented but plausible-sounding content, analogous to the neurological phenomenon in memory disorders. Preferred by researchers who note "hallucination" implies the AI has sensory experiences, which it does not.

Sycophancy: The tendency of an AI model to tell users what they want to hear rather than what is accurate. A sycophantic model may change its stated position when a user pushes back, agree with incorrect assertions, or validate harmful ideas. A recognized failure mode of RLHF where raters may reward agreement over accuracy.

Reward Hacking: The model learns to achieve high scores on its reward metric through unintended means rather than by accomplishing the desired goal. It exploits the gap between what the reward metric measures and what designers actually want.

Catastrophic Forgetting: The tendency of a neural network to abruptly lose previously learned knowledge when trained on new data. In safety contexts, alignment properties established in earlier training can be erased when models are fine-tuned for new capabilities.

Distribution Shift: The phenomenon where real-world data at deployment differs statistically from training data. A model may behave safely during training but unsafely in production where conditions are different. One of the primary reasons models fail in the real world.

Alignment Methods

Value Alignment: Ensuring an AI system's goals, values, and behaviors match those of its designers or humanity. The core difficulty: human values are complex, context-dependent, and often contradictory, making formal specification extremely hard.

DPO (Direct Preference Optimization): A streamlined alternative to RLHF that eliminates the separate reward model and RL loop. It directly adjusts model parameters so higher probability is assigned to human-preferred responses. Simpler, more computationally efficient, and less prone to instability than PPO-based RLHF.

Reward Modeling: Training a separate neural network to predict which AI outputs humans would prefer, using ranked human preference data. The reward model acts as a proxy for human judgment and produces a numerical score used to guide reinforcement learning.

RLAIF (Reinforcement Learning from AI Feedback): A scalable variant of RLHF that substitutes a powerful AI model for human annotators in generating preference labels. Significantly cheaper than human-annotated RLHF and has achieved comparable performance on several benchmarks.

Red Teaming: A structured process where a dedicated team actively tries to find vulnerabilities and harmful behaviors in an AI system by attacking it with adversarial inputs, edge cases, and creative misuse scenarios. Standard pre-deployment safety practice at major AI labs.

Bias and Explainability

AI Bias: The systematic tendency of an AI to produce unfair or skewed outputs toward or against certain groups. Bias originates in training data, model architecture, optimization objectives, or deployment context, and often reflects pre-existing societal inequalities.

Explainable AI (XAI): A field of research aimed at making AI decisions understandable to humans. XAI methods generate explanations describing why a model produced a particular output. Increasingly required by regulation (EU AI Act) and central to building trust in high-stakes applications.

SHAP (SHapley Additive exPlanations): A method grounded in game theory that assigns each input feature a value representing its contribution to a specific model prediction. SHAP values are consistent and locally accurate, distributing credit fairly among features.

Model Cards: Standardized documentation sheets for trained ML models that describe intended use cases, performance across demographic groups, limitations, training data, and evaluation methodology. They promote transparency and accountability for downstream users and regulators.

EU AI Act: The world's first comprehensive legal framework regulating artificial intelligence, enacted by the European Union (Regulation 2024/1689). It classifies AI systems by risk level from minimal to unacceptable risk, imposing requirements accordingly. As of 2026, transparency rules require mandatory disclosure when users interact with AI or consume AI-generated content.

Part 6: AI Infrastructure, Deployment, and Emerging Terms

Compute and Infrastructure

GPU (Graphics Processing Unit): Originally designed for graphics, GPUs are now the dominant chip for AI training and inference. They contain thousands of parallel CUDA cores plus specialized Tensor Cores for matrix math paired with high-bandwidth memory. NVIDIA's H100 and B200 series are the industry standard.

TPU (Tensor Processing Unit): Google's custom chip designed specifically for tensor (matrix) operations at the core of neural networks. Extremely efficient for large-scale training and inference but only available through Google Cloud.

NPU (Neural Processing Unit): A dedicated chip block integrated into consumer processors, laptops, smartphones, tablets, to run lightweight AI inference tasks efficiently with low power draw. Now shipped in over 970 million smartphones globally.

VRAM (Video RAM): The dedicated, high-speed memory on a GPU used to store model weights, optimizer states, and activation memory. The most critical hardware constraint: a 70B-parameter model in 16-bit precision requires roughly 140GB of VRAM.

Edge AI: Running AI models on devices located at or near the data source, smartphones, IoT sensors, vehicles, rather than sending data to a central cloud. Edge AI enables low latency, offline operation, and privacy preservation.

Quantization and Model Compression

Model Quantization: Reducing the numerical precision of model weights from high-precision floats to lower-precision integers or small floats. Quantization shrinks model file size and reduces memory requirements, often with minimal accuracy loss. INT8 yields roughly 2x compression; INT4 yields roughly 4x compression over BF16.

GGUF (GPT-Generated Unified Format): The primary file format for the llama.cpp inference engine and tools built on it (Ollama, LM Studio). It supports a spectrum of quantization levels and enables efficient CPU-plus-GPU hybrid inference on consumer hardware.

Model Distillation: A compression technique where a smaller "student" model trains to replicate the behavior of a larger "teacher" model. Transfers knowledge into a compact form, typically achieving 90–95% of the teacher's accuracy with 10% of the parameters.

APIs and Services

Ollama: An open-source tool that lets developers download and run LLMs locally with a single command. Handles model download, quantization, and serving through a local HTTP API compatible with the OpenAI format. Widely used for local development, privacy-sensitive workloads, and offline environments.

Groq: An AI inference company building LPU (Language Processing Unit) chips optimized specifically for the sequential token-generation process. Achieves thousands of tokens per second, ideal for applications where generation latency is the primary constraint.

AWS Bedrock: Amazon's managed service providing API access to foundation models from multiple providers, Anthropic Claude, Meta Llama, Cohere, Mistral, Amazon Titan, within the AWS cloud environment. Preferred for enterprises with existing AWS infrastructure and strict data-residency requirements.

Hugging Face: A platform hosting over 900,000 pre-trained models, datasets, and demo applications, plus the Transformers library, the most widely used Python library for loading and fine-tuning language models.

Deployment Metrics

Latency: Total elapsed time from request submission to receiving the final token. Broken into TTFT (Time to First Token) and TPOT (Time Per Output Token). Low latency is critical for interactive user-facing applications.

Throughput: The total number of tokens a system generates per second across all concurrent users. Throughput and latency are in tension, more parallel requests increases throughput but also per-request latency.

Cost per Token: The monetary cost for generating one token of model output. The fundamental unit of LLM economics. Pricing varies from fractions of a cent per million tokens for small open-source models to several dollars per million for frontier proprietary models.

Rate Limiting: Capping how many API requests or tokens a client can consume within a given time window. Prevents any single user from monopolizing GPU capacity. Exceeding rate limits returns HTTP 429.

Part 7: Emerging AI Terms for 2026

These are the terms you'll encounter most in 2026 that didn't exist in mainstream AI vocabulary even 18 months ago.

Reasoning Models: LLMs that generate extended internal reasoning, working through problems step by step, before producing a final answer. Pioneered by OpenAI's o1 and extended through o3 and DeepSeek-R1. They use reinforcement learning to develop chains of thought spanning thousands of tokens. Excel at mathematics, code, and multi-step logic at the cost of higher latency and token usage.

Extended Thinking: Anthropic's implementation of visible chain-of-thought reasoning in Claude models, where the model surfaces its reasoning process in a dedicated "thinking" block before the final response. Configurable, developers set a token budget for the thinking phase, enabling a tunable tradeoff between reasoning depth and speed.

Test-Time Compute (TTC): Allocating additional computational resources during inference, rather than during training, to improve output quality. Instead of always making a model bigger, TTC lets a model "think longer" on harder problems by generating more reasoning steps or sampling multiple candidate answers. A paradigm shift: the same model can produce higher-quality answers by spending more compute at inference time.

Mixture of Experts (MoE): A model architecture where the full set of parameters is divided into specialized sub-networks ("experts"), but only a small subset activate for any given input token. This allows enormous total parameter counts while using only a fraction per forward pass. Llama 4 Maverick runs 400B total parameters but only 17B are active per token, fundamentally changing inference economics.

Speculative Decoding: An inference acceleration technique where a small, fast "draft" model generates several candidate tokens ahead, and the large target model verifies them in parallel in a single forward pass. When draft tokens are accepted, multiple tokens commit simultaneously, achieving multi-token generation per step. Reduces latency by 2–3x without changing model outputs. Widely deployed in production serving stacks in 2026.

Model Routing: A system that dynamically selects the most appropriate model from a pool to handle each request based on task complexity, required latency, and cost budget. Simple requests route to cheap, fast small models; complex requests escalate to larger models. Effective routing can reduce inference costs by 50–80% without measurable quality degradation.

Synthetic Data: Data generated by an AI model for the purpose of training other models. By 2026, synthetic data is standard practice, human preference labels cost $1–$10+ each while AI-generated equivalents cost less than $0.01. Used for RLHF preference pairs, instruction-following examples, and reasoning chains.

Small Language Model (SLM): A language model in the 1B–7B parameter range, compact enough to run on consumer hardware without cloud infrastructure. A fine-tuned 300M-parameter SLM can outperform a 70B generalist model on a narrow domain task. SLMs are driving the shift toward on-device AI.

Frequently Asked Questions (FAQs)

What is the difference between AI, machine learning, and deep learning?

These three terms form a hierarchy. AI is the broad field covering any technique that enables machines to perform intelligent tasks. Machine learning is a subset of AI where systems learn from data rather than following explicit rules. Deep learning is a subset of ML that specifically uses multi-layer neural networks. All deep learning is ML; all ML is AI, but not all AI is ML or deep learning.

What is hallucination in AI and why does it happen?

Hallucination is when an AI model produces factually false information stated with apparent confidence. It happens because language models are trained to predict the next plausible token, not to verify whether claims are true. The model has no internal fact-checking mechanism, it generates text that sounds correct based on learned statistical patterns, even when the underlying facts are wrong.

What is RAG and why is it important?

RAG stands for Retrieval-Augmented Generation. It's a technique that grounds LLM responses in external, verified documents by retrieving relevant content from a knowledge base and injecting it into the model's context before generating a response. RAG is important because it solves two major LLM problems: knowledge staleness (the model's training has a cutoff date) and hallucination (the model invents facts it doesn't actually know).

What is the difference between fine-tuning and prompt engineering?

Prompt engineering changes the model's behavior through the input you provide, crafting instructions, examples, and context that guide the model to produce better outputs without modifying any weights. Fine-tuning actually updates the model's parameters by continuing training on new data, permanently changing how the model responds. Prompt engineering is faster, cheaper, and reversible; fine-tuning produces more consistent, specialized behavior but requires compute and data.

What is MCP and why should developers care?

MCP (Model Context Protocol) is an open standard that defines how AI systems connect to external tools, data sources, and services. Before MCP, every AI integration required custom glue code. MCP standardizes the interface so any MCP-compliant agent can connect to any MCP-compliant server without custom integration work. Supported by Anthropic, OpenAI, and Google DeepMind, with over 500 public MCP servers available in 2026.

What is the difference between an AI agent and a chatbot?

A chatbot responds to individual inputs and has no persistent goals or the ability to take actions in the world. An AI agent operates in a loop, it observes its environment, reasons about a goal, takes actions (calling tools, browsing the web, writing code), evaluates results, and self-corrects until a multi-step objective is achieved. Agents maintain state, use tools, and can operate autonomously over extended periods. A chatbot answers questions; an agent completes tasks.

Final Thoughts

AI terminology is not academic decoration, it's the shared language of a field moving faster than any other. Knowing what an LLM actually is, how RAG differs from fine-tuning, why MoE matters for inference cost, and what extended thinking does to output quality puts you ahead of 90% of the people talking about AI in 2026.

Keep this glossary as a reference. The field will keep adding new terms, reasoning models, speculative decoding, and synthetic data are all concepts that barely existed in mainstream AI vocabulary 18 months ago. The fundamentals, though, stay stable. Get those right and the new terms are always easier to understand.