
ChatGPT is built on the Transformer architecture, a deep learning model introduced in the 2017 paper “Attention is All You Need” by Vaswani et al. The key innovation of the Transformer is the attention mechanism, especially self-attention, which allows the model to weigh and connect different words in a sequence, regardless of their position.
Model Architecture
ChatGPT, based on GPT (Generative Pretrained Transformer), uses a decoder-only Transformer. Key components include:
- Tokenisation: Inputs are broken into subword tokens using Byte-Pair Encoding (BPE) or similar schemes.
- Embedding: Each token is mapped to a high-dimensional vector. Positional encodings are added to provide sequence order.
- Stacked Decoder Blocks: Each block contains:
- Multi-head self-attention: Allows the model to attend to different positions in the input simultaneously.
- Feedforward neural network: Processes attended outputs through non-linear transformations.
- Layer normalisation and residual connections: Stabilise training and improve gradient flow.
Training Process
- Pretraining: The model is trained on large corpora using the objective of next-token prediction (causal language modeling). At timestep t, the model predicts token t+1 given tokens [1, …, t]. This forces the model to build a statistical representation of language patterns.
- Unsupervised Learning: During pretraining, the model has no explicit labels. It learns from vast datasets scraped from the internet, filtered for quality and safety.
- Supervised Fine-Tuning (SFT): Human-labelled data guides the model in following instructions more effectively.
- Reinforcement Learning from Human Feedback (RLHF): Human preferences guide model outputs. Steps include:
- Generating multiple outputs for a prompt.
- Ranking outputs based on quality.
- Training a reward model to imitate rankings.
- Using Proximal Policy Optimisation (PPO) to fine-tune the model based on this reward model.
Inference
At runtime, given a prompt, the model:
- Tokenizes the input.
- Feeds tokens into the Transformer layers.
- Computes the probability distribution over the vocabulary for the next token.
- Selects a token using sampling (e.g., top-k, nucleus sampling) or greedy decoding.
- Repeats until stopping criteria (e.g., end-of-sequence token or max length).

Limitations
- Fixed context window: Only a finite number of tokens can be processed (e.g., 8Kβ128K, depending on the model variant).
- Lack of true understanding: Outputs are probabilistic completions, not grounded in factual memory unless connected to retrieval systems.
- Susceptibility to hallucination: Confident but incorrect outputs can occur due to statistical generalisation.
Applications
Used for text generation, summarisation, translation, programming assistance, tutoring, question answering, and conversational agents. Enhanced performance is achieved via system prompts, tool integrations, and retrieval-augmented generation (RAG).
Variants
- GPT-3.5: 175B parameters, trained on public internet text.
- GPT-4: Multi-modal, improved reasoning and instruction-following.
- GPT-4-turbo: Optimised variant with lower latency and cost.
Conclusion
ChatGPT is a probabilistic autoregressive model grounded in deep neural networks and attention mechanisms, trained via self-supervision and human feedback, capable of generating coherent, contextually relevant text.
References:
- Attention Is All You Need –
Tags: Chat GPT, ChatGPT