Published on January 28, 2025 by Esteban Puerta
Building Reliable AI Applications with Deepeval Evaluations
While models get better each week, it’s more important now than ever to build a collection of testable artifacts that can solidify your AI implementation. I’ll share the high level overview of important evaluation concepts and the simplicity of building your own datasets synthesized from your own data.
The evaluation testing flow
The flow is fairly simple. It essentially requires some sort of source, whether it’s synthetic of with real data, to populate the necessary parameters of a test case. Test cases follow a simple object structure and vary slightly depending on the metric you’re wanting to test. Metrics then take the list of test cases you’ve built and uses an LLM to test the desired evaluation metric. In the end, you get a test result object with the final scores for each test ran against each test case.
Lets work through what an evaluation process will look like.
In my own scenario, my current strategy is to keep a set of datasets available from files that resemble what my users would usually use.
Sources
Sources in this context usually means grabbing the necessary details needed to build an LLMTestCase. The shape of the LLMTestCase looks like the following
@dataclass
class LLMTestCase:
input: str
actual_output: str
expected_output: Optional[str] = None
context: Optional[List[str]] = None
retrieval_context: Optional[List[str]] = None
additional_metadata: Optional[Dict] = None
comments: Optional[str] = None
tools_called: Optional[List[ToolCall]] = None
expected_tools: Optional[List[ToolCall]] = None
reasoning: Optional[str] = None
name: Optional[str] = field(default=None)
_dataset_rank: Optional[int] = field(default=None, repr=False)
_dataset_alias: Optional[str] = field(default=None, repr=False)
_dataset_id: Optional[str] = field(default=None, repr=False)
In our lab setup, we will only care about input and output. For more complex use cases like RAG and Agent workloads, you’ll want to include the relevant into this TestCase class. For agents, expected_tools and tools_called is relevant and for RAG, retrieved_context is relevant.
Creating Synthetic Data
We won’t have access to user data for our tests, luckily, deepeval provides some neat ways to synthesize and generate a dataset from a file. This will parse the document and create mock inputs, mock retrieval context, and mock expected outputs, fulfilling almost everything we would need to create a group of test cases. The only missing piece is the actual output from our AI implementation which we’ll get to doing in the next section.
Lets create a dataset from a file we expect our users to use.
dataset.generate_goldens_from_docs(document_paths=sources)
What this gives us is a list of synthetic inputs, context, and expected outputs.
{
"input": "Consider if MCCVB suddenly doubled its marketing budget; how might Monterey's tourism evolve?",
"actual_output": None,
"expected_output": "If the Monterey County Convention and Visitors Bureau (MCCVB) were to suddenly double its marketing budget, several positive developments could occur within Monterey's tourism sector:\n\n1. .....",
"context": [
" \n8 | Page \n \n \nMCCVB “ABOUT US” \nMCCVB is the Destination Marketing Organization (DMO) for the County of \nMonterey, made up of nine jurisdictions and a vibrant local industry that includes \nhundreds of hotels and resorts, major attractions, renowned wineries and \nrestaurants, and a variety of additional businesses that fuel the tourism economy. .. \n \n \n \n ",
],
"retrieval_context": None,
"additional_metadata": {
"evolutions": [
"Hypothetical"
],
"synthetic_input_quality": 1.0
},
"comments": None,
"tools_called": None,
"expected_tools": None,
"source_file": "src/data/advertising-rfp.pdf"
}
Great! This gives us a good baseline to save for later. Because the actual output is done at runtime, we can use this dataset with almost any AI implementation we’re working on.
Test Cases
The first section that makes it “AI implementation specific” mainly because we need to run it through our AI system to gather the system outputs. This will differ between models, RAG implementations, Agents, etc.
Creating test cases from our dataset is easy. We can just loop through and execute a run based on an input.
For simplicity, this lab is done serially. For more optimized approaches, this work can be easily parallelized or ran concurrently.
And the output of those test cases
This means we can select the metrics we want to test against and use them against our test case list.
Metrics
Deepeval provides out of the box with various metrics we can test. While most of these deal with retrieved context which was not done in this lab for simplicity, we can utilize the ones that only evaluate with input and output.
Here is a list of available tests
-
G-Eval
-
Prompt Alignment
-
Faithfulness
-
Answer Relevancy
-
Contextual Relevancy
-
Contextual Precision
-
Contextual Recall
-
Tool Correctness
-
Json Correctness
-
Ragas
-
Hallucination
-
Toxicity
-
Bias
-
Summarization
deepeval
also offers conversational metrics, which are metrics used to evaluate conversations instead of individual, granular LLM interactions. These include:
-
Conversational G-Eval
-
Knowledge Retention
-
Role Adherence
-
Conversation Completeness
-
Conversation Relevancy
Results
After running the evaluations with the metrics we want, we can now review the results and check scores, reason, and verdict.
This initial workshop was around understanding the process of how to building an evaluation workflow is not only easy but crucial to understanding your AI systems. While the actual implementation is elementary, in the coming series, I’ll dive deeper into testing complex multi-agent systems and various RAG implementations using different models.
Written by Esteban Puerta
← Back to blog