A LLM Primer
This section provides a high-level overview of key concepts related to Large Language Models (LLMs). If you are familiar with LLMs, you may want to proceed to LLM Ops.
Large Language Models (LLMs), a subset of Machine Learning (ML) that most recently use Transformer technology, serve as a foundational component in enabling computers to understand and generate human language. Drawing from large datasets, they learn linguistic structures, semantics, and contexts through a process called training. These models, in their most sophisticated form, can produce text that is remarkably human-like, answer questions, translate languages, and even generate creative content like document analysis or SQL queries.
Model Foundations
Large language models are built by training massive datasets of tokenized text to learn patterns and relationships between words. Through an intensive compute process, models ingest sequences of tokens to predict next words in context. As models train, their parameters are tuned to generate human-like responses. The size of a model, measured in parameters, determines its power and performance. While state-of-the-art models have hundreds of billions of parameters, open-source models range from 7 to 65 billion parameters.
- Name
Token
- Type
- Numerical representation of language
- Description
Tokens are a unique numerical representation of a word or partial word. Tokenization allows LLMs to handle and process text data.
Most LLMs have a 1.3:1 ratio of tokens to English words.
- Name
Parameters
- Type
- Internalized knowledge
- Description
Parameters represent the learned patterns and relationships between tokens within the training data. ML engineers convert massive datasets into tokenized training data for training.
Commonly used datasets include The Pile, CommonCrawl, OpenAssistant Conversations, or websites (Reddit, StackOverflow, Github).
- Name
Training
- Type
- Model computation
- Description
Training is the process of converting tokenized content into model parameters. The result is a reusable model for inference. The model is fed sequences of training tokens and it learns to predict the next token in the sequence. The goal is to fine-tune the model's parameters for accurate and contextually appropriate responses.
- Name
Model Size
- Type
- Number of parameters in training
- Description
Number of parameters is the typical measurement for model size. State of the Art models (GPT4, PaLM2) trend in hundreds to thousands of billions of parameters, while emerging open source models trend between 7-65B parameters (MPT, Falcon).
Model Inference
Inference is the process of using a trained LLM model to make predictions on new content to generate. A model is loaded into memory, new data is presented in the form of a prompt, and the model generates a completion. The size of the context window will have a significant impact on the depth of the LLM predictions.
- Name
Context Window
- Type
- Total tokens at inference
- Description
The context window is the total number of tokens used during inference. This includes both the input prompt and generated output.
Early versions of GPT3 and most open source models have a context window of 2048 tokens. GPT3 / 3.5 now has a context window of 4096 tokens. The GPT-3.5-turbo 16k variant has a context window of 16385 tokens. GPT4 has a context window of 8192 tokens.
- Name
Prompt
- Type
- Initial model input
- Description
A prompt provides initial input to steer the model's response in a particular direction. Like setting the stage, prompts focus the model on a specific topic, style or genre. This narrows the internal search space of the model.
- Name
Completion
- Type
- Model-generated response
- Description
The completion refers to the text generated by the model in response to the prompt. The content variability and length of the completion depends on the prompt and model config parameters like temperature and max tokens.
Hardware Requirements
Large language models require substantial computational resources for both training and inference. A cloud GPU server will cost between $100 and $1000 per day depending on the number of GPU and memory.
Training LLMs typically necessitates the use of multiple high-end GPUs, such as NVIDIA's A100 or H100 GPUs, which are renowned for their superior memory capacity (80 GB per card) and robust processing capabilities. Meta's LLaMA model (65B parameters), was trained on 2048 A100 GPUs with 80 GB over 21 days. Renting 2048 A100 GPUs from AWS (256 P4DE Instances) for 21 days would cost approximately $3.8M.
Inference breakthroughs enable running LLMs locally on Apple M1/2 hardware.
- Name
Training
- Type
- Millions of dollars
- Description
Training hardware requirements is a function of parameters, batch size, and training time. Computational power doubles every 3.4 months resulting in significant breakthroughs each year. GPT3 cost roughly $5M in 2020, but would cost $500k in 2023 – primarily due to software improvements.
- Name
Finetuning
- Type
- Thousands of dollars
- Description
Finetuning hardware requirements is a function of model size, batch size, and finetuning time.
Finetuning takes a pre-trained model and trains it on a new dataset. Finetuning is significantly faster and cheaper than training from scratch. The LoRA (Low-Rank Adaptation) method enables finetuning a 65B parameter model on a new instruction-following dataset in hours/days. Finetuning requires roughly 12x the GPU memory of the model size.
- Name
Inference
- Type
- Hundreds of dollars
- Description
Inference hardware requirements is a function of model size, context window, and concurrent inference requests.
Inference typically requires 2.1x the GPU memory of the model size. This is due to the need to store the model parameters and intermediate activations during the forward pass.
Quantization breakthroughs enable running LLMs on significantly smaller hardware requiring only 70% of the GPU memory of the model size.