LLM Ops

As organizations adopt LLMs into their product and operations workflows, a new practice called "LLM Ops" has rapidly emerged to address the challenges and skills required to scale production services.

LLM Ops is an extension of MLOps, focusing on managing the entire lifecycle of large language models. Unlike MLOps, deploying LLM functionality often does not require a data science or ML engineering team, introducing unknown challenges for the teams responsible for this work. This guide is primarily focused on the concerns of product, engineering, or operations teams that are responsible for deploying and maintaining LLM functionality in production.

Model Providers

Name
OpenAI
Type
Ada, Davinci, GPT-3.5-turbo, GPT-4
Description
Industry-leading SOTA models across all use cases focused on English.
Name
Azure OpenAI
Type
Ada, Davinci, GPT-3.5-turbo, GPT-4
Description
Azure private-cloud deployments of OpenAI models.
Name
PPLX Labs
Type
pplx-7b-online, pplx-70b-online
Description
Online models connected to search engines for up-to-date generations. Best for search retrieval.
Name
Anthropic
Type
Claude 2, Claude Instant
Description
Innovating on alignment, with fast, large context window (100-200k) models. Best for chat or creative writing.
Name
Google Vertex
Type
PaLM 2 Text Bison, PaLM 2 Chat Bison, Embeddings Gecko
Description
GCP private-cloud deployments of Google models innovating on speed and translation. Best for factual recall and translation.
Name
Cloudflare AI
Type
Mistral 7b, Llama 2
Description
Edge-deployed open source models.
Name
AWS Bedrock
Type
Amazon Titan, AI21, Anthropic
Description
AWS private-cloud deployments of Amazon FMs, AI21, and Anthropic. Additional open source models from Stability and others.
Name
Huggingface & Replicate
Type
Open Source Models
Description
API-first deployment of open source models (Alpaca, Falcon, Meta LLaMA, Vicuna, and Wizard Vicuna).
Name
AI21
Type
J2, J2 Instruct, Task-specific APIs
Description
Innovating on instruct and writing models with greater support for EU languages.
Name
NLP Cloud & GooseAI
Type
EleutherAI Models
Description
API-first deployment of open source EleutherAI models (GPT NeoX, GPT-J).

OpenAI, Anthropic, AI21, Google, AWS, and Huggingface/Replicate provide the current State of the Art (SOTA) models. OpenAI and Anthropic are market leaders in research and innovation putting outside efforts into model alignment/safety and high-speed, high-quality capabilities. These providers represent the majority of the market, however there is a long-tail of smaller providers and open source models.

Klu supports all major providers (OpenAI, Anthropic, Google Vertex, AWS Bedrock) and open source models via Huggingface or Replicate deployments.

Models

Large Language Models are available via API or in private clouds. As of Summer 2023, all major cloud providers enable private-cloud deployment of SOTA models or have this capability on their roadmap. This provides organizations with choice and flexibility in how they deploy and manage their models. Estimates place the R&D costs of GPT-4, which includes compute, research, and engineering, at $150M. Open source progress is accelerating via the self-organizing community building upon Meta's LLaMA.

After deploying LLMs in production, monitoring their performance becomes crucial to guarantee they continue delivering desired results with real-world data and usage. Models are generally categorized by size of the model and the number of tokens they can process in a single inference. The larger the model, the more accurate and capable it is, but the more expensive it is to run and the slower it is to run inference. The number of parameters and token context window are the two most important factors in determining the cost and speed of a model.

2023 Benchmarks

All benchmarks point to OpenAI's GPT-4 as the most capable and comprehensive model, however other models may be more appropriate for cost, speed, or privacy needs. For example, translating 100 tokens from English to Georgian with OpenAI GPT-4 takes just over 300s for inference to complete, where as Anthropic Claude takes just over 15s and Google Bison takes just 2s. OpenAI's GPT-3.5-turbo is the current benchmark in which all other models currently compare to in terms of cost, speed, accuracy, and user preference.

Benchmarks from H1 2023 show Azure GPT-3.5-turbo is 3x faster per token than OpenAI's GPT-3.5-turbo. Comparative open source models are 2-3 years behind the SOTA models.

Klu supports A/B model benchmarking enabling selection of the best provider/model for your use case.

Cost

Name
API Usage
Type
Input & Output Tokens
Description
$0.0001 - 0.12 per 1,000 tokens, variable by model and token length.

Name
Fine-tuning
Type
Training & Usage
Description
Variable by training data size and usage tokens.

Name
Infrastructure
Type
GPU or TPU
Description
$100 - 1000 per day per model for self-hosted models in production.

API-first models offer the fastest time to market and lowest cost of ownership until hitting scale. This is similar to any SaaS service or Cloud infrastructure.

Implementing strategies like prompt engineering (smaller or shorter prompts), fine-tuning to minimize prompt length, and caching will help optimize production costs. Choosing the smallest possible model that meets your requirements will offer the most leverage when improving costs and resource demands.

Klu recommends starting with SOTA API-first models and then moving to private-cloud deployments for specific use cases as usage scales.

Privacy

Privacy is a key issue for many organizations. The most private, yet costly and resource-intensive option, is to deploy open-source models in your own private cloud. As of 2023 Q1 OpenAI does not use API usage for model training or improvement. OpenAI monitors API abuse and misuse, retaining data logs for up to 30 days. Private-cloud deployments through Azure also retain data, but offer opt-out programs that limit Microsoft's access to your data.

Klu recommends starting with API-first models for most use cases to validate use cases and demand. The Klu Platform enables customers to obfuscate and redact sensitive data sent to models. As usage scales, Klu recommends moving to hybrid or private-cloud deployments for sensitive use cases.

Latency

Unlike most APIs, LLMs may take seconds to minutes to complete. This is due to factors like model size, current workload, token type, network conditions, and the sequential nature of the model output. Azure GPT-3.5-turbo completes 1,000 tokens (~700 words) in 10-15s. Anthropic Claude 2 completes 1,000 tokens in 25-40s. OpenAI GPT-4 completes 1,000 tokens in 60-80s. Google Bison Chat completes 1,000 tokens in 7-9s.

Techniques like streaming can mitigate latency issues by allowing an interface to start outputting tokens before completing the entire output. Additional UX techniques like thinking / reading / processing indicators can help set expectations with users.

Klu enables both non-streaming and streaming outputs, and we recommend using streaming outputs for most real-time interaction use cases.

Availability

Due to the current demand for Generative AI and LLMs, most models available through APIs and private-cloud deployment are available with limited access. Until July 2023, GPT-4 was not generally available. Most SOTA models are available, but with limits on number of tokens or requests per minute. API requests may get queued or rejected during high-demand times. OpenAI experiences partial or complete API outages 1-2+ times per month.

Klu automatically retries requests during API availability access issues or downtime.

Alignment

AI alignment exists on a spectrum from instruction following to high-quality outputs. Organizations like OpenAI and Anthropic invest considerable resources in alignment research and development. For most users, alignment means following input prompts, but for many general public use cases, alignment means preventing biased or racist content generations with the potential for brand or reputational damage. Most open source models have instruction-following alignment, but are missing bias or safety alignment. Chat models like Anthropic Claude and OpenAI 4 feature significant fine-tuning based on human feedback for output quality and style.

Hallucination

Hallucination is the phenomenon where the model generates text that is syntactically and semantically correct but disconnected from reality or based on false assumptions. Here is a quick example:

Input: Where did Jeff Bezos go to school?
Output: Jeff Bezos attended Stanford University for his undergraduate studies.

Bezos attended Princeton, not Stanford. The model hallucinated the answer, but the sentence and grammar are correct – in this case, the model produces outputs that do not match any training data. This can lead to the generation of incorrect, nonsensical, or fabricated information. This occurs more frequently in smaller models or with topics that are not well represented in the training data. This issue arises when the model predicts the most likely word, instead of recalling the actual word. It's like a person recalling a story, but getting one or two details wrong without noticing the mistake.

Klu recommends testing new models or experiences with small groups of users to identify and correct any potential hallucinations. The Klu Platform enables users to flag hallucinations and provide feedback to improve model accuracy. Additionally, Klu enables knowledge retrieval providing accurate context to the model to reduce hallucinations.