LLM Ops

As organizations adopt LLMs into their product and operations workflows, a new practice called "LLM Ops" has rapidly emerged to address the challenges and skills required to scale production services.

LLM Ops is an extension of MLOps, focusing on managing the entire lifecycle of large language models. Unlike MLOps, deploying LLM functionality often does not require a data science or ML engineering team, introducing unknown challenges for the teams responsible for this work. This guide is primarily focused on the concerns of product, engineering, or operations teams that are responsible for deploying and maintaining LLM functionality in production.


Model Providers

  • Name
    OpenAI
    Type
    Ada, Davinci, GPT-3.5-turbo, GPT-4
    Description

    Industry-leading SOTA models across all use cases focused on English.

  • Name
    Azure OpenAI
    Type
    Ada, Davinci, GPT-3.5-turbo, GPT-4
    Description

    Azure private-cloud deployments of OpenAI models.

  • Name
    PPLX Labs
    Type
    pplx-7b-online, pplx-70b-online
    Description

    Online models connected to search engines for up-to-date generations. Best for search retrieval.

  • Name
    Anthropic
    Type
    Claude 2, Claude Instant
    Description

    Innovating on alignment, with fast, large context window (100-200k) models. Best for chat or creative writing.

  • Name
    Google Vertex
    Type
    PaLM 2 Text Bison, PaLM 2 Chat Bison, Embeddings Gecko
    Description

    GCP private-cloud deployments of Google models innovating on speed and translation. Best for factual recall and translation.

  • Name
    Cloudflare AI
    Type
    Mistral 7b, Llama 2
    Description

    Edge-deployed open source models.

  • Name
    AWS Bedrock
    Type
    Amazon Titan, AI21, Anthropic
    Description

    AWS private-cloud deployments of Amazon FMs, AI21, and Anthropic. Additional open source models from Stability and others.

  • Name
    Huggingface & Replicate
    Type
    Open Source Models
    Description

    API-first deployment of open source models (Alpaca, Falcon, Meta LLaMA, Vicuna, and Wizard Vicuna).

  • Name
    AI21
    Type
    J2, J2 Instruct, Task-specific APIs
    Description

    Innovating on instruct and writing models with greater support for EU languages.

  • Name
    NLP Cloud & GooseAI
    Type
    EleutherAI Models
    Description

    API-first deployment of open source EleutherAI models (GPT NeoX, GPT-J).

OpenAI, Anthropic, AI21, Google, AWS, and Huggingface/Replicate provide the current State of the Art (SOTA) models. OpenAI and Anthropic are market leaders in research and innovation putting outside efforts into model alignment/safety and high-speed, high-quality capabilities. These providers represent the majority of the market, however there is a long-tail of smaller providers and open source models.

Klu supports all major providers (OpenAI, Anthropic, Google Vertex, AWS Bedrock) and open source models via Huggingface or Replicate deployments.

Models

Large Language Models are available via API or in private clouds. As of Summer 2023, all major cloud providers enable private-cloud deployment of SOTA models or have this capability on their roadmap. This provides organizations with choice and flexibility in how they deploy and manage their models. Estimates place the R&D costs of GPT-4, which includes compute, research, and engineering, at $150M. Open source progress is accelerating via the self-organizing community building upon Meta's LLaMA.

After deploying LLMs in production, monitoring their performance becomes crucial to guarantee they continue delivering desired results with real-world data and usage. Models are generally categorized by size of the model and the number of tokens they can process in a single inference. The larger the model, the more accurate and capable it is, but the more expensive it is to run and the slower it is to run inference. The number of parameters and token context window are the two most important factors in determining the cost and speed of a model.

2023 Benchmarks

All benchmarks point to OpenAI's GPT-4 as the most capable and comprehensive model, however other models may be more appropriate for cost, speed, or privacy needs. For example, translating 100 tokens from English to Georgian with OpenAI GPT-4 takes just over 300s for inference to complete, where as Anthropic Claude takes just over 15s and Google Bison takes just 2s. OpenAI's GPT-3.5-turbo is the current benchmark in which all other models currently compare to in terms of cost, speed, accuracy, and user preference.

Benchmarks from H1 2023 show Azure GPT-3.5-turbo is 3x faster per token than OpenAI's GPT-3.5-turbo. Comparative open source models are 2-3 years behind the SOTA models.

Klu supports A/B model benchmarking enabling selection of the best provider/model for your use case.


Cost

  • Name
    API Usage
    Type
    Input & Output Tokens
    Description

    $0.0001 - 0.12 per 1,000 tokens, variable by model and token length.

  • Name
    Fine-tuning
    Type
    Training & Usage
    Description

    Variable by training data size and usage tokens.

  • Name
    Infrastructure
    Type
    GPU or TPU
    Description

    $100 - 1000 per day per model for self-hosted models in production.

API-first models offer the fastest time to market and lowest cost of ownership until hitting scale. This is similar to any SaaS service or Cloud infrastructure.

Implementing strategies like prompt engineering (smaller or shorter prompts), fine-tuning to minimize prompt length, and caching will help optimize production costs. Choosing the smallest possible model that meets your requirements will offer the most leverage when improving costs and resource demands.

Klu recommends starting with SOTA API-first models and then moving to private-cloud deployments for specific use cases as usage scales.


Privacy

Privacy is a key issue for many organizations. The most private, yet costly and resource-intensive option, is to deploy open-source models in your own private cloud. As of 2023 Q1 OpenAI does not use API usage for model training or improvement. OpenAI monitors API abuse and misuse, retaining data logs for up to 30 days. Private-cloud deployments through Azure also retain data, but offer opt-out programs that limit Microsoft's access to your data.

Klu recommends starting with API-first models for most use cases to validate use cases and demand. The Klu Platform enables customers to obfuscate and redact sensitive data sent to models. As usage scales, Klu recommends moving to hybrid or private-cloud deployments for sensitive use cases.


Latency

Unlike most APIs, LLMs may take seconds to minutes to complete. This is due to factors like model size, current workload, token type, network conditions, and the sequential nature of the model output. Azure GPT-3.5-turbo completes 1,000 tokens (~700 words) in 10-15s. Anthropic Claude 2 completes 1,000 tokens in 25-40s. OpenAI GPT-4 completes 1,000 tokens in 60-80s. Google Bison Chat completes 1,000 tokens in 7-9s.

Techniques like streaming can mitigate latency issues by allowing an interface to start outputting tokens before completing the entire output. Additional UX techniques like thinking / reading / processing indicators can help set expectations with users.

Klu enables both non-streaming and streaming outputs, and we recommend using streaming outputs for most real-time interaction use cases.


Availability

Due to the current demand for Generative AI and LLMs, most models available through APIs and private-cloud deployment are available with limited access. Until July 2023, GPT-4 was not generally available. Most SOTA models are available, but with limits on number of tokens or requests per minute. API requests may get queued or rejected during high-demand times. OpenAI experiences partial or complete API outages 1-2+ times per month.

Klu automatically retries requests during API availability access issues or downtime.


Alignment

AI alignment exists on a spectrum from instruction following to high-quality outputs. Organizations like OpenAI and Anthropic invest considerable resources in alignment research and development. For most users, alignment means following input prompts, but for many general public use cases, alignment means preventing biased or racist content generations with the potential for brand or reputational damage. Most open source models have instruction-following alignment, but are missing bias or safety alignment. Chat models like Anthropic Claude and OpenAI 4 feature significant fine-tuning based on human feedback for output quality and style.


Hallucination

Hallucination is the phenomenon where the model generates text that is syntactically and semantically correct but disconnected from reality or based on false assumptions. Here is a quick example:

Input: Where did Jeff Bezos go to school?
Output: Jeff Bezos attended Stanford University for his undergraduate studies.

Bezos attended Princeton, not Stanford. The model hallucinated the answer, but the sentence and grammar are correct – in this case, the model produces outputs that do not match any training data. This can lead to the generation of incorrect, nonsensical, or fabricated information. This occurs more frequently in smaller models or with topics that are not well represented in the training data. This issue arises when the model predicts the most likely word, instead of recalling the actual word. It's like a person recalling a story, but getting one or two details wrong without noticing the mistake.

Klu recommends testing new models or experiences with small groups of users to identify and correct any potential hallucinations. The Klu Platform enables users to flag hallucinations and provide feedback to improve model accuracy. Additionally, Klu enables knowledge retrieval providing accurate context to the model to reduce hallucinations.