Introduction

Testing prompts and language models is crucial for ensuring the reliability and consistency of AI-powered applications. While basic tests provide a foundation, advanced testing techniques are necessary for thoroughly evaluating complex scenarios, handling edge cases, and validating multi-step prompts. This guide explores Langtail’s advanced testing features, including different assertion types and configs, best practices, and a step-by-step example.

Familiarity with Basic Tests is assumed.

When to Use Advanced Tests

Advanced tests in Langtail become especially valuable when working with complex language models or prompts that require more sophisticated validation. Consider using advanced tests in the following scenarios:

Evaluating different language models or adjusting their settings: Advanced tests allow you to compare the performance of multiple language models or experiment with different model configurations (e.g., temperature, top-p, etc.) to find the optimal setup for your use case. You can define assertions to validate the outputs and metadata across various model configurations.

Testing multi-step or conditional prompts: If your application relies on prompts that involve multiple steps, branching logic, or conditional responses based on user inputs, advanced tests can help ensure the correct flow and output for each possible path. You can create test cases that simulate different input scenarios and assert the expected behavior at each step.

This guide focuses on Langtail’s advanced testing features. To learn more about the various assertion types available and their respective use cases, refer to the Tests Overview section.

Step-by-Step Guide: Testing a Movie Recommendation Prompt

Here is a prompt that provides movie recommendations based on the user’s preferences for genre, rating, and year. Let’s take a look how we can leverage different configs in testing this prompt.

1

Set up the test suite

  1. Log into Langtail and select your project.
  2. Navigate to ‘Evalution - Tests’ to either open an existing Test or create a new one.
  3. Create a few user messages that will be sent, you can either create them manually or use LLM to create those for you.
  4. Fill out variable values.
2

Define assertions

  1. Click on “New Assertion” and select LLM
  2. Let’s check if the prompt works and it is giving us any recommendations at all.
  3. You can try to create your own LLM assertions. In this case I also created assertion to check if the Movie matched the provided criteria and if the rating of the movie is above 7.0.
  4. Optionally, create custom assertions for more complex validations.
3

Run the tests

  1. Once you’ve set up your test suite and defined the assertions, it’s time to run the tests. Click on the “Run Test” button to initiate the test execution.
  2. Langtail will process each test case, sending the user messages to the prompt and evaluating the output against the defined assertions.
  3. As the tests run, you’ll see the progress and status of each test case displayed in real-time. Langtail will indicate whether a test case passed or failed based on the assertions.
  4. If any test cases fail, you’ll have the opportunity to review the detailed output and compare it against the expected results. This can help you identify potential issues with your prompt or fine-tune the assertions for better accuracy.

Depending on the complexity of your test suite and the number of test cases, the execution time may vary. Langtail provides real-time updates to keep you informed about the test progress.

  1. Once the test execution is complete, you’ll see a summary of the test results, including the number of passed and failed test cases.
  2. You can also export the test results in CSV or JSON format for further analysis or integration with other tools.
4

Iterate and Refine

  1. Based on the test results, you may identify areas for improvement or refinement in your prompt or assertions. Langtail’s advanced testing features allow you to iterate and experiment with different configurations to optimize your prompt’s performance.

  2. Click on the “Config” tab to set up multiple configurations for your test suite. Here, you can create different configurations with varying language models, parameters, or prompt versions.

  1. For example, you could create one configuration using the default GPT-4o model and another with a specialized model fine-tuned for movie recommendations. Additionally, you can adjust parameters like temperature, max tokens, or presence penalty to see how they impact the prompt’s output.

  2. Once you’ve set up your desired configurations, you can run the tests simultaneously for all configurations. Langtail will execute the test suite for each configuration, providing you with a comprehensive view of how your prompt performs under different settings.

Try creating configurations with different versions of your prompt to evaluate how changes in wording or structure affect the output quality. This can be especially helpful when iterating on your prompt to improve its performance.

  1. Review the test results for each configuration and compare their performance. This can help you identify the optimal configuration for your movie recommendation prompt, balancing factors like output quality, diversity, and coherence.

  2. Based on your findings, you can refine your prompt, update the assertions, or adjust the configurations further. Langtail’s iterative testing approach allows you to continuously improve and fine-tune your language model until you achieve the desired results.

Regularly review and update your test suite as your prompt evolves or new requirements emerge. This ensures that your tests remain relevant and effective in validating your language model’s performance.

Tips and Best Practices

Testing Edge Cases and Corner Cases Thoroughly test edge cases (valid but uncommon or extreme inputs) and corner cases (inputs at the boundaries of valid input space) to ensure the robustness of your language models and prompts. Identify potential edge and corner cases specific to your use case and create test cases to validate how your prompts handle them.

Testing Multi-Step or Conditional Prompts For multi-step or conditional prompts, test each possible path and validate the correct flow and output for every scenario. Break down prompts into smaller components, and create test cases that simulate different input combinations and validate the expected behavior at each stage.

Organizing and Maintaining Test Suites As your test suite grows, organize and maintain it effectively. Group related test cases into logical suites. Regularly review and update your test suite to ensure it remains relevant and aligned with your application’s evolving requirements.

For a real-world example of how Langtail’s testing capabilities can save significant time and effort, read our case study with Deepnote on how Langtail saved them days of fine-tuning their AI features.