Most people write prompts the way they'd ask a favor from a colleague. Conversational, vague, optimistic that the other party will fill in the gaps.
Engineers write tickets differently. A good ticket specifies what input the system receives, what constraints apply, and what a passing output looks like. You can test against it. You can know when the work is done.
That's the mental model shift that separates prompts that work once from prompts that work at scale.
Why vague prompts fail at scale
A vague prompt works in conversation because you can iterate — "no, more like this" — and the model adjusts. In a production pipeline, there's no iteration. The prompt runs 500 times, the same way, and the vagueness compounds into inconsistency.
"Write a product description for this skincare brand" will generate 500 different interpretations of what a product description is, how long it should be, what tone it should take, and what the call-to-action should look like. Some will be fine. Many will be wrong. None will be reliably consistent.
The output is only as deterministic as the input. Vague spec, variable output. Tight spec, predictable output.
The three-part prompt structure
The structure that works in production borrows directly from engineering ticket conventions.
Input — what the model receives. Be explicit. "You will receive a product name, a list of five product benefits, a target audience description, and a single CTA phrase." Don't assume the model will infer the shape of the data. Name every piece of it.
Constraints — what the output must and must not do. Length in words or characters. Tone register. Not "professional" — that's an adjective. "Second person, present tense, no passive voice, no exclamation marks" is a constraint. What to avoid. What format the output should follow. Whether you want JSON, prose, or a structured list.
Acceptance criteria — what a passing output looks like. This is the piece most prompts skip entirely. "A good product description mentions the hero benefit in the first sentence, includes one social proof signal, stays under 80 words, and ends with the provided CTA phrase." That's testable. You can look at an output and check it against each criterion.
Diffing output against spec
Once you have explicit acceptance criteria, you can evaluate outputs systematically. This is what LLM evaluation frameworks do at scale — run the output against the spec and flag failures.
In practice, even manual review becomes faster with acceptance criteria. You're not reading copy and asking "does this feel right?" You're checking a list: hero benefit in sentence one — yes or no. Social proof signal present — yes or no. Under 80 words — count them. CTA phrase included — exact match check.
The same principle applies to agent prompts. When you're prompting an agent to perform a multi-step task — classify this record, then look up this reference, then write this output — each step needs its own acceptance criteria. An agent prompt without acceptance criteria for each step is a spec with no way to know if the task succeeded.
Agent prompts versus one-shot prompts
One-shot prompts are self-contained. Input goes in, output comes out, done. The spec maps cleanly onto the three-part structure.
Agent prompts are different. The model takes an action, observes the result, decides the next action. The prompt isn't specifying a single output — it's specifying a behavior pattern across an unknown number of steps.
For agent prompts, the spec needs to define: what tools the agent can use and when, what constitutes a terminal state (when is the task complete), what to do when the agent encounters a state it wasn't designed for, and how to communicate failure. An agent without clear terminal conditions will hallucinate completion or loop indefinitely.
The common failure mode: agent prompts that describe what the agent is but not what done looks like. "You are a research assistant that finds relevant information about companies." Okay — but when do you stop? What's the output format? What counts as relevant? Without those answers, the agent will either over-generate or under-deliver.
System prompts as production artifacts
In a production AI system, prompts are code. They need version control. They need testing before deployment. They need review when the model version changes — because a prompt tuned for one model version may produce different behavior on the next.
Teams that treat prompts as quick text inputs — pasted into a tool, tweaked occasionally, never formally tracked — are the ones who wake up to production failures they can't diagnose because they don't know which prompt version was running.
Version your prompts the way you version your code. Keep the input/constraints/acceptance criteria structure so every prompt is readable and testable. Diff outputs against spec when you update. Write prompts like you'd write a ticket for a senior engineer: specific enough that they can do the work without asking you five clarifying questions.
