home

I vibe coded a task-specific eval suite with Cursor to stop vibe checking my AI agents' prompts.

March 4, 2025

Note - This is my edited work log, it is not comprehensive or instructional. If you need help with anything like this, email me!

Dan

Introduction

I'm working on a project called <yes>tag which uses LLMs to generate code for some of the features it provides. As I was nearing the date of a live demonstration, the generative steps were failing to produce consistently impressive results. I was going to have a very bad demo.

This is a blog post describing the thought process I went through to resolve this issue as quickly as possible, with as little waste and most value created as I could. I used Cursor's Composer to generate task-specific evals for my ailing use-case, and surprised myself both with how effective task-specific evals were at improving my program, and how well "vibe coding" with Cursor could work for me, at least in these circumstances.

Introduction
Problem
Solution
Feature Context
Evaluating Tasks
Creating the Task Evals and PyTest
Creating the Human Evaluation GUI
Why Code Generation Worked
Conclusion

Problem

My prototype's LLM-powered functionality was not producing consistently acceptable output. The initial approach to artifact generation did not contain adequate controls to guarantee quality, resulting in unacceptable final output (around 1 in 6 to 1 in 8 success) and fragility in the source code and prompts.

Failed form generation example — ❌ This form is not an acceptable artifact, but was a common sight, prompting my exploration of evals.

Successful form generation example — ✅ This successfully copied our example HTML/style guide, followed our APIs requirements, and followed specifics of our customization prompt text.

Without resolving this problem, this project would not be useful enough for end-users. It wouldn't be reducing the skill floor to achieve production-ready web apps, only shifting the complexity from coding knowledge to "how to prompt <yes>tag" knowledge - not the big win for the problem space I'm hoping for.

Solution

I used Cursor to rapidly create a task-specific eval tool for controlling quality of <yes>tag's form building use case, and improve outputs significantly. The tool greatly simplifies testing a wide range of cases that can appear in the space of combinations of form fields, html structure and css/style specifications. It took about 5 days to create the evals, unit tests, web app, test data, use it all to optimize the program into production readiness, and confidently demonstrate the application to a prospect.

It works by supplying OpenAPI specs in a jsonl file, a file for the Example prompt (html snippets and plain-language instruction), which are input to the LLM-powered generation step, with output stored in a results jsonl file. This file is accessible in a viewer web application (FastAPI, HTML, vanilla js) which was generated using Cursor. The web application's GUI makes it easy for me to browse the result outputs, add my human evaluations, and then view the Accuracy, Reacall, False Positive and False Negatives of the eval prompt. The end result is that I have a simplified and structured process to view the results of a range of inputs, understand the performance of the prompts in my pipeline, keep a record of test scores over time. and ultimately the power to rapidly test new LLMs and prompt changes without drowning in complexity.

Evaluation webapp — One of the eval result files being graded by me. The Show Context and Copy HTML buttons made it easy for me to preview the generated items, so that I could grade them.

After using the tool to improve the prompts, find a better language model, and develop a new "judge" process for the use-case, I'm seeing a 100% success rate (instead of approx. 1/7), and the time required to create a production ready form has changed from at least 30 minutes (and many attempts and tweaks) on average to 1-2 attempts and around 3-6 minutes consistently. My eventual goal is completely removing the need for a human judge, which may require additional automated testing techniques such as using screenshots (multi-model) and automated browser testing.

Feature Context

My service <yes>tag creates production-ready web forms that match the design and styling of existing websites or style guides/html templates. The forms are fully functional, submitting their data to any REST API endpoint. Our end-users are not expected to be web developers, nor have experience with editing code or using chatbots. My goal is to enable a broad spectrum of end-user skillsets, from designers and developers to marketers or product managers, to rapidly create and deploy production-ready web application components in minutes, with a low skill-barrier to success.

User flow:

Import APIs into <yes>tag using OpenAPI
Configure a new web form's fields using a simple GUI
Modify the design and behaviour of the form using two prompt fields: Example, and Result Example (to be rendered upon submission)
Regenerate the form, observing progress of the multi-agent system
Test the form in the preview frame
Copy/paste the custom HTML element into your test or production application
Profit

Usability Constraints:

Low HTML/CSS knowledge required (just enough to copy-paste)
No coding, no code or hosting to deal with
Familiar Imperative UX (forms and a ui, not a chatbot)

Quality Expectations:

Match example html snippets style and structure
Follow plain-english user modification instructions
100% functional form with client-side and server-side validation, API call, response/result view

Yestag designer page — Yestag's designer page in full.

Evaluating Tasks

In the world of LLMs, Evals are currently most often discussed from the base-model perspective, where the many emerging LLMs are compared against each-other using various evaluation suites (MMLU, HumanEval, etc). Within the context of an application using LLMs for enabling features, evals are needed at the task level - the level of a specific prompt, or chain of prompts, that achieve a specific result for your application.

I think of task evals as spiritually similar to a unit or integration test, but more necessary. It's less intuitive to understand the exact behaviour of the non-deterministic unit being tested, and less predictable than imperative code, meaning it's a less dependable piece of your program by nature. By creating task evals, automating testing using them, and even using the evals within the flow of your application (like in <yes>tag's multi-agent system), you can improve your software's quality for maintainers and end-users alike.

It's useful to adopt techniques and measures (like Accuracy, Precision, and Recall) commonly used by AI practitioners to judge their model performance, which has emerged as one of the most important aspects of building applications with LLMs. It has been said that "evals are surprisingly often all you need" (Greg Brockman, CoFounder OpenAI) which is a growing sentiment, with an equally growing negative sentiment against "vibe checking" which is just looking at the results a few times. Vibe checking worked for me during the prototyping phase of building out <yes>tag, but ultimately was a procrastination I don't recommend you repeat. The benefits of a structured evaluation habit are realized very quickly.

Ideally, you can create evals for every prompt/interaction your application has with an LLM. However, for my current project and my resource/time constraints, I've opted for a higher level of granularity, by evaluating the output of a handful of LLM calls chained together. I will be looking to cover all prompts individually, as well as all chains, as the project progresses.

There are a number of ways to create an eval. Good writing on this subject is being done by Eugene Yan (https://eugeneyan.com/writing/evals) and others. My problem boils down to a classification/extraction problem: I need to rate output, rank attempts, and ensure some operations are completed without error/hallucination.

Creating the Task Evals and PyTest

Preparation of steps:

Context switch to improve outcome with the LLM coder
- Reviewed Eugene Yan's great writeup on task-specific LLM evals.
- Had some conversations with gpt-o1 to organize my headspace around testing and evals
Create skeleton classes, methods/functions and add documentation.
- be precise with code comments, thinking through each file before adding to the composer
- create a readme file within the folder, providing a place for cursor to auto-document usage of the tool, as well as build context for future changes you might make.
- created folders for input and output files

Composer Interactions:

I performed a series of interactive code editing sessions with the Cursor Composer, set to claude-3.5-sonnet.

After the preparation, the project kicked off with a new Composer session: "can you complete the two functions in EvaluationService, add to XMLService as necessary"
I had about 16 total Composer sessions (plus manual edits) that moved the project into a functional state as a test suite I could invoke with PyTest using the CLI. At this stage I have a test suite providing an LLM-as-judge eval score, and a new application component (several classes) which supports ranking the output of multiple runs of the agent. The output of the agents have improved slightly with fixes and refactors made along the way.

Outcome:

Easy to add new data sinks for the forms by adding new OpenAPI files, line-by-line, in a jsonl file. This lets you collect different cases together into test datasets organized into a folder.
Easy to add new style guides, by creating an Example prompt (html snippets and plain-english instruction) in an HTML file, organized in a folder.
It's still difficult to understand how improvements can be made to my prompts, because I still am not yet evaluating the performance of the llm-as-judge eval prompt.

Creating the Human Evaluation GUI

Preparation Steps

Create a webapp.py file containing roughly these comments:
- Create a FastAPI application with the following endpoints:
  - GET /api/eval-results - Returns all evaluation results from the current file
  - POST /api/update-property - Updates a property for a specific form index
  - GET /api/filename - Returns the name of the current file
  - POST /api/set-file - Sets which evaluation file to use
- Use jsonlines to handle the JSONL file format used by the evaluation results
- Use Pydantic
- Include error handling for common scenarios

Composer Interactions

(There were roughly 12 composer sessions with one or more interactions)

Kicked off a Composer session with webapp.py and the eval result python class, and a results jsonl file as context: - "can you create the webapp for me"
Created the views:
- "can you create a basic index html page that implements a tabular view of the file content, with a form for each row that allows setting a json property"
Still needed an iteration to add my own inputs:
- "the webapp is pretty good, but i actually wanted to add my own human labelling of the evaluation results. so the form inputs aren't supposed to be for updating the existing content. i actually want human-set values, in new, separate properties. this might require adding properties to the EvalResultEntry to hold these values when they are written back to the file"

Evaluation Metrics Setup

Recall: Proportion of true positives that were correctly identified. If there were 100 positive instances in our data and the model identified 80, recall = 0.8
Precision: Proportion of the model's positive predictions that were correct. If the model predicted positive 50 times but only 30 were truly positive, precision = 0.6
False positive: Model predicted positive but actually negative
False negative: Model predicted negative but actually positive

Why Code Generation Worked

I ended up with a very useful piece of testing/development functionality that only took a couple days of effort to see large benefit from. The circumstances of my needs were favourable for success with code generation.

My circumstances were:

Could keep the code decoupled from the rest of the application (its a new test suite and independent/internal web app)
I could eschew any best practices or patterns I typically care about (just let the LLM make it work)
I had a high degree of certainty what I wanted to see beforehand.
Pre-existing experience with all areas of the domains being touched, leading to good instructions for the LLM and ease of evaluating result.
Had no design standard or any non-developer end-user to worry about. Whatever it came up with, I would be OK with and confident I could wrangle if it were slightly off.
The LLM basically only has to be internally consistent across its executions, I have little external context for it to factor in.
"Results needed sooner" was more important than "the best tool that does x" or long-term maintainability considerations - this tool created high value I needed quickly, which is now permanently a part of the actual product I'm creating.

The effort-to-value ratio here was high, and I was able to keep most of the generated code decoupled from the rest of my systems, making it easier to dispose of if I decide to start again or switch to a good framework/product when I'm ready. I feel this is an example of a changing build vs. buy equation, as this took much less time than it would have taken to choose from a SaaS tool or framework and implement it, at a disposable cost. The "generate vs. hand-code" equation here was clearly in favour of Cursor, given the context switch, various framework + library refreshers I would have to do plus ~3-4k new lines of code.

Conclusion

Evals are good! They should be called goodvals.

Use Cursor if you're confident in your domains, but check out things like Helicone and Langfuse if they're suitable. There are many emerging methods for this type of work.

There's no way you read all this. Thanks!