Evaluating Prompt & Model Performance in Klu

As you begin to utilize LLMs in a production environment, the need for an efficient system to manage and leverage the generated data becomes apparent. Klu allows you to easily import your existing data via UI or SDK, manage and filter feedback, and export data whenever necessary.


Create Test Cases

To enhance your model's performance, importing your evaluation data into Klu is straightforward.

Importing via the UI

Go to the Evaluations section in your App and click on Import Evaluation Data in the header menu. For optimal compatibility, we suggest using the JSONL format, for which we provide easy-to-use templates.

Importing via the SDK

The Klu SDK simplifies the process of importing your evaluation data. Follow these steps:

  • Initialize your Workspace and acquire your API key
  • Generate a new App and Evaluation Action
  • Specify the Evaluation Action GUID upon import

Evaluation Data Import

from klu import Klu

klu = Klu("YOUR_API_KEY")

prompt = "Your evaluation prompt"
response = "Model's response"
evaluation_action = "evaluation_action_guid"
evaluation_score = "Score based on your criteria"

eval_data = klu.evaluation.create(prompt=prompt, response=response, evaluation_action=evaluation_action, score=evaluation_score)

print(eval_data)

This process allows for the efficient integration of your evaluation data into Klu, facilitating model optimization.


Create Evaluators

Access the Evaluations section through the sidebar navigation in your App.

This area displays all the evaluation data points for your application, which you can filter to isolate specific datasets.

Klu currently supports these filters for evaluation data:

  • Date Range: Select the period for the evaluation data.
  • Evaluation Action: Identify the specific action associated with the evaluation.
  • Prompt/Response Search: Find particular prompts or responses.
  • Score: Filter by the evaluation scores, allowing for analysis of model performance.
  • Source: Trace the origin of the evaluation data.
  • Insights: Filter by metrics such as accuracy, relevance, or any custom metric.
  • Klu User: Filter by the user or SDK key that submitted the evaluation.
  • Evaluation Issue: Isolate data based on identified issues like inaccuracies or biases.

For long-term analysis or model retraining, you have the option to save your filtered dataset as a named Evaluation Set.


Run Evals

  • Name
    Feedback
    Type
    Description

    Klu enables you to give feedback on any evaluation data point. This feedback is crucial for refining and enhancing model outputs. Simply use the thumbs up or thumbs down icons in the app to indicate the quality of an evaluation.

The thumbs up and thumbs down icons in the evaluation section make it simple to rate an evaluation as positive or negative.

For more detailed feedback, such as editing responses, flagging them, or adding context, click on the icon to the right of the evaluation data point.

In the feedback modal, you can specify:

  • Evaluation Context - provide additional context for the evaluation, such as the scenario or parameters used.
  • Model Issue - identify issues with the model's response, including inaccuracies or biases.
  • Response Improvement - suggest improvements or corrections to the model's response.

These feedback mechanisms are invaluable for continuous model improvement and optimization.


Compare Performance

Klu empowers you to maintain control over your evaluation data, offering flexible export options when needed.

We provide two formats for exporting evaluation data:

  • CSV - for easy analysis in spreadsheet applications
  • JSONL - for use in other systems, such as model retraining or further evaluations.

Exporting evaluation data is also supported through the API or SDK.


Real-time Monitoring

Klu empowers you to maintain control over your evaluation data, offering flexible export options when needed.

We provide two formats for exporting evaluation data:

  • CSV - for easy analysis in spreadsheet applications
  • JSONL - for use in other systems, such as model retraining or further evaluations.

Exporting evaluation data is also supported through the API or SDK.