To Data & Beyond

To Data & Beyond

Context Engineering: Improving AI Coding agents using DSPy GEPA

Context Engineering applied on coding agents (autoanalyst.ai)

Arslan Shahid's avatar
Arslan Shahid
Oct 20, 2025
∙ Paid

Get 50% off for 1 year

This blog post is a technical walkthrough of how you can improve an AI coding agents used in the AI data scientist, using DSPy prompt optimization using GEPA.

The blog covers the following topics:

  1. Preparing data

  2. Explaining GEPA

  3. Applying prompt optimization (GEPA) via DSPy

  4. Results

Get All My Books With 40% Off


Get All My Books, One Button Away With 40% Off

Youssef Hosni
·
Jun 17
Get All My Books, One Button Away With 40% Off

I have created a bundle for my books and roadmaps, so you can buy everything with just one button and for 40% less than the original price. The bundle features 8 eBooks, including:

Read full story

1. Preparing Data

Get All My Books With 40% Off

The dataset is made up of Python code execution runs done through our product. The auto-analyst is an AI system with multiple parts, each designed for a specific coding job. One part, the pre-processing agent, cleans and prepares the data using pandas. Another part, the data visualization agent, creates charts and graphs using plotly.

The system has about 12 unique signatures, each with two versions — one that uses the planner, and one that runs on its own for ‘@agent’ queries.

But for this blog post, we’ll focus on just 4 of those signatures and their two variants. These 4 alone make up around 90% of all code runs, since they’re the default ones used by almost everyone, whether they’re free or paid users.

  1. preprocessing_agent

  2. data_viz_agent

  3. statistical_analytics_agent

  4. sk_learn_agent

We can break the dataset down into two parts: the default dataset provided by the system and the data that users upload themselves.

Code execution success rate by category

Our goal is to make sure any optimization improves performance on both. It should work well not just on the default data but also on the datasets users upload.

To do this, we need to stratify the data. That way, the model doesn’t overfit on the default dataset and can handle a variety of inputs effectively.

Another important factor we need to consider is the model providers. We don’t want to optimize just for one provider and end up hurting performance on the others.

Note: There’s bias in this because our users mostly used OpenAI’s cheaper models like GPT-4o-mini in our system, while for Gemini, our users used only their top models. Since we don’t have enough data to evaluate on a per-model basis, we’re using the provider as a proxy. When comparing top OAI models with the top models of other providers, OpenAI’s success rate is similar.

After preparing the dataset, we created a stratified sample with the following constraints:

  • No more than 20% of the data comes from the default dataset (is_default_dataset == True).

  • Each of the three model providers (openai, anthropic, gemini) is represented in at least 10% of the final sample.

  • Stratification was done across three columns:

  • model_provider

  • is_successful

  • is_default_dataset

Once the stratified sample was created, we split it into a training set and a test set to be used for evaluation.


2. Explaining GEPA

Get 50% off for 1 year

GEPA stands for (Generic-Pareto), an evolutionary prompt optimizer designed for DSPy programs that uses reflection to evolve and improve text components such as AI prompts.

GEPA leverages large language models’ (LLMs) ability to reflect on the program’s execution trajectory, diagnosing what worked, what didn’t, and proposing improvements through natural language reflection.

It builds a tree of evolved prompt candidates by iteratively testing and selecting better prompts based on multi-objective (Pareto) optimization.

Step-by-step, here is what GEPA does as an evolutionary prompt optimizer in DSPy:

Get 50% off for 1 year

Keep reading with a 7-day free trial

Subscribe to To Data & Beyond to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
Arslan Shahid's avatar
A guest post by
Arslan Shahid
Building the Auto-analyst. Opensource AI data scientist. Building in public via my substack. Please subscribe to FireBirdTech
Subscribe to Arslan
© 2025 Youssef Hosni
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture