How Large Language Models Work: What’s Actually Inside ChatGPT

Large language models like ChatGPT are reshaping the world, but most people have no idea how they actually work. This deep-dive explains transformers, training, emergent abilities, and limitations.

by

12 minutes

Read Time

ChatGPT crossed one million users in five days after launch — faster than any technology product in history. Within two years, it had over 100 million weekly active users. Yet most people who use it daily have only a vague understanding of what is actually happening when they type a message and receive a reply. Understanding how large language models work does not require a computer science degree — but it does fundamentally change how you use and evaluate AI systems.

Table of Contents

What Is a Large Language Model?

A large language model is a type of artificial intelligence trained on vast amounts of text to predict and generate human language. The “large” in the name refers primarily to the number of parameters — adjustable numerical weights that the model uses to process and generate text. GPT-4, for example, is estimated to have over one trillion parameters. For comparison, GPT-2 from 2019 had 1.5 billion parameters. This thousand-fold scale increase over five years is largely responsible for the leap in capability that made these models suddenly seem intelligent.

At the most fundamental level, an LLM is a very sophisticated text prediction machine. Given a sequence of words, it predicts what word (or more precisely, what token) should come next. Do this prediction recursively — predict the next word, add it to the sequence, predict the next word again — and you get generated text. The remarkable thing is that predicting text well enough, at large enough scale, produces a system that appears to understand, reason, translate, code, and create.

This is not a trick or an illusion — at least not entirely. To predict text well across the enormous diversity of human writing, a model must develop internal representations that capture grammar, facts, reasoning patterns, stylistic conventions, and conceptual relationships. It does not explicitly program these capabilities. It learns them as necessary intermediate representations for the prediction task.

The Transformer Architecture

All major LLMs today — GPT-4, Claude, Gemini, LLaMA, Mistral — are built on the transformer architecture, introduced in the landmark 2017 paper “Attention Is All You Need” by Google Brain researchers. Before transformers, language models used recurrent neural networks (RNNs) and long short-term memory (LSTM) networks that processed text sequentially — one word at a time — which made them slow to train and limited in their ability to capture long-range relationships in text.

Transformers process entire sequences in parallel rather than sequentially. This enables training on dramatically larger datasets in dramatically less time — the key enabler for scaling to the billions and trillions of parameters that produce modern LLM capabilities. The parallelism also aligns well with the architecture of modern GPU hardware, which is designed for parallel numerical computation.

A transformer model consists of stacked encoder and decoder blocks (or just decoder blocks in autoregressive models like GPT). Each block contains two key components: a multi-head self-attention mechanism and a feed-forward neural network. These two components, repeated dozens to hundreds of times in a deep stack, are what perform the computation that transforms input tokens into output tokens.

How LLMs Are Trained

Pre-Training: Learning From the Internet

The first training phase — pre-training — exposes the model to enormous quantities of text from the internet, books, scientific papers, code repositories, and other sources. The training objective is simple: predict the next token given all previous tokens in a sequence. For each prediction, the model’s output is compared to the actual next token in the training data, and the difference (the “loss”) is used to adjust the model’s parameters through a process called backpropagation and gradient descent.

This process is repeated billions of times across the training corpus. GPT-3 was trained on approximately 300 billion tokens of text. GPT-4 and Claude 3 are estimated to have been trained on trillions of tokens. The computational cost is enormous — training a frontier model requires thousands of specialized GPUs running for months, costing tens to hundreds of millions of dollars. This is why only a handful of well-funded organizations can train frontier LLMs from scratch.

Fine-Tuning: Specializing the Model

After pre-training, the base model is a capable text predictor but is not yet optimized for being a useful assistant. It will complete any text it receives in whatever direction the training data suggests — which may produce toxic, incorrect, or unhelpful completions. Fine-tuning on carefully curated datasets of high-quality examples of helpful AI behavior shapes the model’s behavior toward assistant-like patterns. Fine-tuning is far cheaper than pre-training — it adjusts an already capable model rather than training from scratch.

RLHF: Aligning With Human Preferences

Reinforcement Learning from Human Feedback (RLHF) is the process that transformed capable text models into systems that are genuinely helpful, harmless, and honest. Human evaluators rate model outputs on quality, helpfulness, and safety. These ratings are used to train a “reward model” that predicts what humans would prefer. The LLM is then fine-tuned using reinforcement learning to maximize the reward model’s predicted score. This is why ChatGPT and Claude produce conversational, helpful responses rather than raw text completions.

Tokens: The Building Blocks

LLMs do not process words directly — they process tokens, which are chunks of text that can be whole words, parts of words, or individual characters. The tokenization process converts raw text into sequences of token IDs that the model’s numerical architecture can process. GPT-4 uses a tokenizer called tiktoken that splits text into approximately 100,000 different token types.

Understanding tokens matters practically. Common words like “the”, “is”, and “cat” are typically single tokens. Rare words, technical terms, or non-English words may be split into multiple tokens — “cryptocurrency” might be three tokens, and a Chinese character might also be multiple tokens. This is why LLMs handle some languages less efficiently than others, and why pricing for API usage is based on token counts rather than word counts. One token is roughly 0.75 English words on average.

The context window — the maximum amount of text an LLM can process at once — is measured in tokens. GPT-4 supports up to 128,000 tokens in a single context window; Claude 3’s context window extends to 200,000 tokens. This is roughly 150,000 to 250,000 words — enough for several long novels. The context window determines how much text the model can “remember” and reason about in a single conversation or task.

Attention: How Models Understand Context

The attention mechanism is the key innovation that makes transformers powerful. When processing any given word (or token), the attention mechanism allows the model to selectively focus on other relevant words anywhere in the input sequence — regardless of how far apart they are. This is what enables understanding that “it” in the sentence “The bank was steep. John slid down it” refers to “bank” rather than some other candidate.

Multi-head attention means the model runs multiple attention computations in parallel, each potentially focusing on different types of relationships between words — one head might focus on grammatical dependencies, another on semantic similarity, another on coreference (which words refer to the same thing). The outputs of all attention heads are combined and processed by the feed-forward layer.

The attention mechanism’s computational cost scales quadratically with sequence length — doubling the context window quadruples the computation required. This is the fundamental constraint on context window size and one of the active areas of research in making LLMs more efficient. Techniques like sparse attention, sliding window attention, and architectural innovations in models like Mamba and Mistral aim to break this quadratic scaling barrier.

Emergent Abilities and Scaling Laws

One of the most striking discoveries in LLM research is that certain capabilities emerge discontinuously as model scale increases. Below a certain parameter threshold, a model cannot perform a task at all; above the threshold, it performs it well. These are called “emergent abilities” because they were not specifically trained for and were not predictable from the model’s performance at smaller scales.

Multi-step arithmetic, logical reasoning, code generation, and translation in rare languages have all shown emergent improvement patterns. This is simultaneously exciting and concerning — exciting because it suggests that increasing model scale unlocks new capabilities unpredictably, concerning because it means we cannot fully anticipate what a larger model will be able to do until it is built.

Scaling laws, documented by researchers at OpenAI and DeepMind, describe how model performance improves predictably as compute, data, and parameters increase. These laws have guided the strategic decisions of AI labs — providing a roadmap for how much investment in scaling is expected to produce how much capability improvement. According to research published at arXiv on scaling laws for neural language models, performance scales smoothly as a power law function of compute — though with emergent jumps layered on top of the smooth underlying curve.

RLHF: Making Models Helpful and Safe

The gap between a capable base model and a useful, aligned assistant is bridged primarily by Reinforcement Learning from Human Feedback. The process involves three stages. First, supervised fine-tuning on demonstrations of ideal assistant behavior. Second, training a reward model by presenting human evaluators with pairs of model outputs and having them rank which is better. Third, optimizing the LLM’s outputs using the reward model’s preferences through reinforcement learning, specifically an algorithm called Proximal Policy Optimization (PPO).

RLHF is why Claude, ChatGPT, and Gemini are genuinely helpful rather than simply capable — they have been shaped by human feedback to be useful, polite, and to avoid producing harmful outputs. The quality and diversity of the human feedback determines much of the assistant’s alignment quality. This is why organizations like Anthropic, OpenAI, and Google invest heavily in the human feedback pipeline — the training data for RLHF is as important as the base model’s capabilities.

Constitutional AI (CAI), developed by Anthropic, is an extension of RLHF that uses a set of explicit principles — a “constitution” — to guide the model’s self-critique and revision during training, reducing reliance on human feedback for safety-related judgments. This allows more efficient safety training and more consistent principled behavior across diverse inputs.

Fundamental Limitations of LLMs

Hallucination

LLMs hallucinate — they generate confident, plausible-sounding falsehoods. This is an inherent consequence of their training objective. Because they are trained to produce fluent, contextually appropriate text, they will generate text that sounds right even when it is not. They have no internal “certainty meter” that gates output based on confidence — a model that does not know something will still produce a response that looks similar to one from a model that does know.

No True World Model

LLMs have learned statistical patterns in text, not a causal model of how the world works. They can describe physical processes with apparent accuracy without any understanding of causality, mechanism, or physical reality. This produces impressive outputs in many cases but catastrophic failures in others — particularly novel situations where the right answer requires reasoning from first principles rather than pattern matching to training data.

Knowledge Cutoff

Training data has a cutoff date. Without access to tools like web search, an LLM has no knowledge of events after its training cutoff. This is addressed in deployed systems through retrieval-augmented generation (RAG) — connecting the model to external knowledge bases or search engines at inference time — and through tool use APIs that allow models to call external services for current information.

Context Window Limitations

Despite increasingly large context windows, LLMs do not process the full context with uniform attention. Research has shown the “lost in the middle” problem — information at the beginning and end of a long context is used more reliably than information buried in the middle. Very long context windows also increase computational cost significantly, affecting response latency and API pricing.

Frequently Asked Questions

Does ChatGPT actually understand language?

This is one of the most contested questions in AI. LLMs demonstrably process and use semantic, syntactic, and pragmatic information in sophisticated ways. Whether this constitutes “understanding” depends heavily on how you define understanding. Philosophically, the debate remains open. Practically, the distinction matters less than what the model can reliably do and where it fails.

What is the difference between GPT-4 and earlier models?

GPT-4 represents a substantial capability leap over GPT-3.5 across reasoning, code generation, instruction following, and factual accuracy. It was also the first GPT model to be multimodal, accepting image inputs. The scale difference (estimated 10-100x more parameters than GPT-3.5) combined with improved training methods accounts for the performance jump. Each generation has shown similar qualitative capability improvements alongside quantitative scale increases.

What is retrieval-augmented generation (RAG)?

RAG connects an LLM to an external knowledge source — a document database, a website, a search engine — at the time of a query. Relevant documents are retrieved based on the user’s question, then provided to the LLM as context. The model generates its response based on both its trained knowledge and the retrieved documents. RAG dramatically reduces hallucination on factual questions and enables LLMs to answer questions about content they were not trained on.

What comes after LLMs?

Research is active on several fronts. Multimodal models that natively process and generate text, images, audio, and video are the near-term frontier, with GPT-4o and Gemini Ultra already moving in this direction. Models with genuine world models — internal causal representations of how the world works rather than just statistical text patterns — are a longer-term research goal. Neuromorphic architectures, mixture-of-experts models (like Mixtral), and test-time compute scaling (investing more inference compute for harder problems) are near-term architectural directions receiving significant research attention.

Discover more from i2notes

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from i2notes

Subscribe now to keep reading and get access to the full archive.

Continue reading