Writing Production-Grade Prompts

Josh Pitzalis
3 min readDec 24, 2024

--

You have an idea for an AI App. You convinced a friend to spend a weekend building out a dinky little version of your app…and it works!

One or two key prompts are doing most of the heavy lifting, but who cares, it works. And it’s useful. You share it with a few friends and they start using it. You might have something here. What will it take to go from this proof-of-concept to building a real app?

Traditionally, building a “real” app could mean some combination of adding user accounts, security, compliance, writing tests and robust error-handling, polishing up the front end, migrating from an ad-hoc hosting solution to something more scalable and optimising your build to reduce latency and costs.

But now we’ve added prompts to the mix.

And no one’s entirely sure what a production-grade prompt looks like.

If you’re like everyone else, you’ve tested your prompts with a handful of examples, and everything looked great. So you pushed your little app out into the world and now people are using it. Some of your friends sent you screenshots of it making silly mistakes but nothing major happened. So no problem.

This vibe-check process works fine for a proof of concept but it becomes a nightmare to maintain with a “real” app. More people start using it and some of them begin to send you more concerning issues. To fix the problems you try tweaking your prompts. A tweak or two here, and rephrasing there, and the issues seem to have disappeared. But now you start getting complaints about an entirely new problem. Fixing one error creates three new problems and now the app doesn’t reliably do what it was originally meant to do.

What a Good Solution Looks Like…

A good solution would be to take the handful of examples you were initially testing your app with and store them in some kind of system that runs every time you make a change to your prompts.

If you had this type of set-up in place when you first started getting concerning reports, you’d have been able to fix the issue without worrying about anything else. You fix the problem and now you can run your updated prompt against all of your past examples. This lets you make sure that the app still works the way it used to despite the changes you just made.

You’ve fixed the problem and you can be confident that the fix hasn’t introduced any new problems. That’s the key right there.

To keep the good train chugging along, you add a bunch of new examples to make sure the problem you just fixed doesn't show up again in the future.

Over time, you keep adding examples every time you fix a problem. You gradually build up a robust suite of examples related to every known problem you’ve dealt with so far. Make a change to any prompt and you can test it against all your past examples instantly. If the tests pass then you know that the change hasn’t broken any existing capability.

These examples that you’re using to test your app are called Prompt Evaluations. Building an evaluation suite is one way to systematically test and improve the results you get from the AI-powered features in your app.

This has all been very high level so far. Next, I’ll work through an example so that you understand the specifics around writing your first Eval.

--

--

Josh Pitzalis
Josh Pitzalis

Written by Josh Pitzalis

AI Engineer | Prompt Engineering & Evaluations

No responses yet