Best Prompt Engineering Techniques for Developers
Best Prompt Engineering Techniques for Developers
Most developers treat LLM prompts like search queries: type a vague request, get unreliable results, then blame the model. This approach wastes time and produces code you can't trust. Effective prompt engineering is a skill that separates developers who integrate AI successfully from those who abandon it after frustrating experiences. The difference isn't the model you use—it's how precisely you communicate intent and constrain output.
This guide focuses on techniques that matter for building applications, not casual ChatGPT usage. You'll learn how to structure prompts for consistent JSON output, reduce hallucinations in code generation, handle context window limits, and debug when models produce unexpected results. These aren't theoretical principles—they're battle-tested patterns from production systems processing millions of LLM requests monthly.
We'll cover seven core techniques with code examples, failure modes to avoid, and specific scenarios where each technique applies. By the end, you'll understand not just what to do, but why certain patterns work and when to use them.
Why Prompt Engineering Matters for Developers
Unlike end users who can iterate manually with ChatGPT, developers need prompts that work reliably in automated systems. When an LLM is part of your application—generating summaries, extracting data, or writing code—prompt failures become application failures.
Consider a common scenario: extracting structured data from user input. A naive prompt ("Extract the name and email from this text") works 70% of the time. The other 30%, it returns malformed JSON, hallucinates fields that don't exist, or silently fails. In a manual workflow, you notice and retry. In an application serving users, that 30% failure rate is unacceptable.
The core challenge: LLMs are probabilistic, but applications need deterministic behavior. Prompt engineering techniques add constraints, structure, and error handling to bridge this gap. You're not aiming for perfection (impossible with probabilistic systems), but for predictable failure modes you can handle gracefully.
Key Insight: The best prompt is not the one that produces the most impressive output—it's the one that produces the most consistent, parseable, and verifiable output. Optimize for reliability, not creativity.
Technique 1: Structured Output with Format Enforcement
The single most important technique: force the model to output in a specific, parseable format. Never ask for "information about X"—ask for "JSON containing fields A, B, C with these exact keys".
Why this works: LLMs are trained on massive amounts of structured data (JSON, XML, YAML) and excel at pattern matching. By providing an exact template, you leverage this strength. The model is far more likely to produce valid JSON when shown the exact structure you expect.
Here's the wrong way:
// Bad: Unstructured output
const prompt = `Extract the user's name, email, and preferences from this text:
${userInput}`;
const response = await llm.complete(prompt);
// Response might be: "The user's name is John Doe, email is [email protected]..."
// Now you need complex parsing logic
The right way:
// Good: Structured output with exact format
const prompt = `Extract information from the following text and return ONLY valid JSON with this exact structure:
{
"name": "full name or null if not found",
"email": "email address or null if not found",
"preferences": ["array", "of", "preferences"]
}
Text: ${userInput}
JSON:`;
const response = await llm.complete(prompt);
const data = JSON.parse(response); // Reliably parseable
Advanced version: Use JSON Schema to specify exact types, constraints, and validation rules. Models like GPT-4 and Claude understand JSON Schema and will conform to it.
const schema = {
type: "object",
properties: {
name: { type: "string", minLength: 1 },
email: { type: "string", format: "email" },
age: { type: "integer", minimum: 0, maximum: 150 }
},
required: ["name", "email"]
};
const prompt = `Extract user information from the following text.
Return ONLY valid JSON matching this schema:
${JSON.stringify(schema, null, 2)}
Text: ${userInput}
JSON:`;
OpenAI and some other providers now support function calling, which enforces schemas natively. Use this when available—it's more reliable than prompt-based formatting.
// Using OpenAI function calling for guaranteed structure
const functions = [{
name: "extract_user_info",
description: "Extract user information from text",
parameters: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string" },
age: { type: "integer" }
},
required: ["name", "email"]
}
}];
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: userInput }],
functions: functions,
function_call: { name: "extract_user_info" }
});
const data = JSON.parse(response.choices[0].message.function_call.arguments);
Failure mode to avoid: Don't ask for multiple pieces of information without structure. Prompts like "List the pros and cons" produce free-form text that's hell to parse. Instead: "Return JSON with 'pros' array and 'cons' array".
Technique 2: Few-Shot Examples for Complex Tasks
When the task is ambiguous or requires domain knowledge, show the model examples of correct input-output pairs. This is "few-shot learning"—the model infers the pattern from examples.
Why this works: LLMs excel at pattern matching. Examples are far more precise than natural language descriptions of what you want. "Format dates consistently" is vague. Showing three examples of your date format is unambiguous.
Use few-shot for:
- Domain-specific transformations (e.g., converting medical notes to structured records)
- Style consistency (e.g., code that follows your team's conventions)
- Complex multi-step reasoning (e.g., debugging logic)
// Example: Converting natural language to SQL
const prompt = `Convert natural language questions to SQL queries.
Example 1:
Question: Show me all users who signed up last week
SQL: SELECT * FROM users WHERE created_at >= CURRENT_DATE - INTERVAL '7 days'
Example 2:
Question: Count active subscriptions by plan type
SQL: SELECT plan_type, COUNT(*) FROM subscriptions WHERE status = 'active' GROUP BY plan_type
Example 3:
Question: Find users with no orders
SQL: SELECT * FROM users WHERE id NOT IN (SELECT DISTINCT user_id FROM orders)
Now convert this question:
Question: ${userQuestion}
SQL:`;
How many examples? Typically 2-5. More examples improve quality but use context window space and increase latency. Start with 2, add more if quality is insufficient.
Example selection strategy: Don't pick examples randomly. Choose diverse examples that cover edge cases:
- One simple example (establishes the baseline pattern)
- One complex example (shows the model can handle complexity)
- One edge case example (demonstrates how to handle unusual inputs)
For production systems, dynamically select examples based on the input. Use semantic similarity search to find the most relevant examples from a larger pool.
// Dynamic example selection using embeddings
async function selectRelevantExamples(query, examplePool, k = 3) {
const queryEmbedding = await getEmbedding(query);
// Find k most similar examples
const similarities = examplePool.map(ex => ({
example: ex,
similarity: cosineSimilarity(queryEmbedding, ex.embedding)
}));
return similarities
.sort((a, b) => b.similarity - a.similarity)
.slice(0, k)
.map(s => s.example);
}
// Use relevant examples in prompt
const examples = await selectRelevantExamples(userQuestion, sqlExamples);
const prompt = buildPromptWithExamples(examples, userQuestion);
Technique 3: Chain-of-Thought for Complex Reasoning
For tasks requiring multi-step logic, explicitly ask the model to show its reasoning before giving the final answer. This "chain-of-thought" (CoT) prompting dramatically improves accuracy on reasoning tasks.
Why this works: Models perform better when they "think out loud". By generating intermediate reasoning steps, the model catches its own errors and produces more logically consistent outputs. This is particularly effective for mathematical calculations, logical deduction, and debugging.
// Without chain-of-thought (often wrong)
const prompt = `If a service costs $50/month and I pay $300, how many months of service is that?
Answer:`;
// Model might just output "6" (correct) or "6 months" or something wrong
// With chain-of-thought (more reliable)
const prompt = `If a service costs $50/month and I pay $300, how many months of service is that?
Let's think step by step:
1. Monthly cost: $50
2. Total paid: $300
3. Calculation: $300 ÷ $50 = 6 months
Answer: 6 months`;
For code generation and debugging, chain-of-thought means asking the model to explain its reasoning before generating code:
const prompt = `Debug this function that's supposed to calculate compound interest but returns wrong values:
function compoundInterest(principal, rate, time) {
return principal * Math.pow(1 + rate, time);
}
// Test: compoundInterest(1000, 0.05, 2) returns 1102.5, expected 1102.5 ✓
// Test: compoundInterest(1000, 5, 2) returns 36000000, expected 1102.5 ✗
First, explain what the bug is and why it happens.
Then provide the corrected code.
Explanation:`;
The model is far more likely to identify that the rate should be 0.05 (5% as decimal) not 5 when it's forced to explain the bug before fixing it.
Advanced technique: Zero-shot CoT, which works surprisingly well, is simply adding "Let's think step by step:" to your prompt. Research shows this single phrase significantly improves reasoning accuracy across many tasks.
Pro Tip: For production systems, you can parse and discard the reasoning steps, keeping only the final answer. The reasoning improves quality even if you don't show it to users. Use a format like "Reasoning: ... \n\nAnswer: ..." so you can extract just the answer programmatically.
Technique 4: Limiting Hallucination with Context Grounding
Hallucination—the model inventing plausible-sounding but false information—is the biggest reliability problem in production LLM applications. The solution: ground responses in provided context and explicitly forbid invention.
Why hallucination happens: LLMs are trained to produce fluent, coherent text. When they don't know something, the training objective still pushes them to generate something plausible. They're pattern-matching machines, not knowledge databases.
The fix is to provide all necessary information in the prompt and instruct the model to only use that information:
// Bad: Allows hallucination
const prompt = `What are the features of our Pro plan?`;
// Model will invent features based on common SaaS patterns
// Good: Grounded in provided context
const prompt = `Based ONLY on the following documentation, list the features of our Pro plan.
If the information isn't in the documentation, say "Information not available."
Documentation:
${productDocs}
Question: What are the features of our Pro plan?
Answer:`;
For RAG (Retrieval-Augmented Generation) systems, this is critical. You retrieve relevant documents, then explicitly tell the model to answer only from those documents:
// RAG with hallucination prevention
const relevantDocs = await vectorDB.search(userQuestion, k=5);
const prompt = `Answer the question using ONLY the information in the provided documents.
If the documents don't contain enough information to answer completely, say so.
DO NOT use external knowledge or make assumptions.
Documents:
${relevantDocs.map((doc, i) => `[${i+1}] ${doc.content}`).join('\n\n')}
Question: ${userQuestion}
Answer (citing document numbers):`;
Additional techniques to reduce hallucination:
- Lower temperature: Temperature controls randomness. Use 0.0-0.3 for factual tasks (higher temperatures increase hallucination)
- Request citations: Ask the model to cite which part of the context it used (makes hallucinations easier to detect)
- Two-step verification: Generate answer, then ask the model to verify it against the source documents
// Two-step verification to catch hallucinations
const answer = await llm.complete(answerPrompt);
const verificationPrompt = `Given this answer and the source documents, identify any claims in the answer that are NOT supported by the documents.
Documents:
${documents}
Answer:
${answer}
Unsupported claims (or "None" if all claims are supported):`;
const verification = await llm.complete(verificationPrompt);
// If verification finds unsupported claims, reject or revise the answer
Technique 5: Handling Context Window Limits
Context windows (the amount of text the model can process at once) are finite. GPT-4 supports 8K-128K tokens, Claude supports up to 200K, but every token costs money and adds latency. When your input exceeds the limit, you need strategies beyond "use a bigger model".
Chunking with map-reduce: For long documents, split them into chunks, process each chunk, then aggregate results.
// Map-reduce for summarizing long documents
async function summarizeLongDocument(document, maxChunkSize = 3000) {
const chunks = splitIntoChunks(document, maxChunkSize);
// Map: Summarize each chunk
const chunkSummaries = await Promise.all(
chunks.map(chunk => llm.complete(`Summarize this section:\n\n${chunk}\n\nSummary:`))
);
// Reduce: Combine summaries into final summary
const combinedSummaries = chunkSummaries.join('\n\n');
const finalSummary = await llm.complete(
`These are summaries of different sections of a document.
Combine them into a coherent overall summary:\n\n${combinedSummaries}\n\nFinal summary:`
);
return finalSummary;
}
Selective context: Don't send entire files—send only relevant sections. Use embeddings-based search to find the most relevant paragraphs.
// Send only relevant context
async function answerQuestionAboutCode(question, codebase) {
// Index codebase by functions/classes
const codeIndex = await buildCodeIndex(codebase);
// Find relevant code sections
const relevantSections = await codeIndex.search(question, k=5);
// Use only relevant sections (not entire codebase)
const context = relevantSections.map(s => s.code).join('\n\n');
const prompt = `Answer this question about the codebase using only the provided code:
Code:
${context}
Question: ${question}
Answer:`;
return await llm.complete(prompt);
}
Iterative refinement: For tasks like editing, instead of sending the full document, use the LLM to identify which sections need changes, then edit those sections specifically.
| Technique | Use Case | Tradeoff |
|---|---|---|
| Chunking + Map-Reduce | Summarization, analysis | Multiple API calls increase cost/latency |
| Selective Context | Q&A, code search | Retrieval quality affects answer quality |
| Iterative Refinement | Editing, refactoring | Requires multiple round-trips |
| Streaming Processing | Real-time analysis | Can't reference earlier context |
Technique 6: Prompt Optimization Through Testing
Prompt engineering is empirical—what works depends on the model, task, and data. Systematic testing is essential. Don't guess, measure.
Build a test suite: Create a dataset of example inputs with known correct outputs. Test prompt variations against this dataset and measure accuracy.
// Prompt testing framework
const testCases = [
{
input: "Schedule meeting with John next Tuesday at 3pm",
expected: {
action: "schedule_meeting",
participants: ["John"],
datetime: "next Tuesday 3pm"
}
},
// ... more test cases
];
async function testPrompt(promptTemplate) {
let correct = 0;
for (const test of testCases) {
const prompt = promptTemplate.replace('{input}', test.input);
const output = await llm.complete(prompt);
const parsed = JSON.parse(output);
if (deepEqual(parsed, test.expected)) {
correct++;
}
}
return {
accuracy: correct / testCases.length,
correct,
total: testCases.length
};
}
// Test multiple prompt variations
const prompts = [
"Extract structured data from: {input}\nJSON:",
"Parse the following command and return JSON: {input}\nOutput:",
// ... variations
];
for (const prompt of prompts) {
const results = await testPrompt(prompt);
console.log(`Prompt: ${prompt.slice(0, 50)}...`);
console.log(`Accuracy: ${results.accuracy * 100}%\n`);
}
A/B testing in production: For critical prompts, run multiple versions simultaneously and measure which performs better in real usage.
// A/B test prompts in production
async function generateWithABTest(input, experimentId = 'default') {
const variants = {
A: promptVariantA,
B: promptVariantB
};
// Assign user to variant (50/50 split)
const variant = Math.random() < 0.5 ? 'A' : 'B';
const prompt = variants[variant].replace('{input}', input);
const output = await llm.complete(prompt);
// Log for analysis
logExperiment({
experimentId,
variant,
input,
output,
timestamp: Date.now()
});
return output;
}
// Later: Analyze which variant performs better
function analyzeExperiment(experimentId) {
const results = getExperimentResults(experimentId);
const variantA = results.filter(r => r.variant === 'A');
const variantB = results.filter(r => r.variant === 'B');
return {
A: {
successRate: variantA.filter(r => r.userAccepted).length / variantA.length,
avgLatency: average(variantA.map(r => r.latency))
},
B: {
successRate: variantB.filter(r => r.userAccepted).length / variantB.length,
avgLatency: average(variantB.map(r => r.latency))
}
};
}
Common metrics to track:
- Output validity (% that parses correctly)
- Accuracy (% that matches expected output)
- Latency (time to generate)
- Cost (tokens used)
- User acceptance (% of outputs users keep vs. regenerate)
Technique 7: Error Handling and Fallbacks
LLMs fail in unpredictable ways. Production systems need robust error handling, not just try-catch blocks.
Output validation: Always validate LLM output before using it in your application. Never trust that JSON will parse, that generated code will compile, or that extracted data has expected fields.
// Robust output handling with validation
async function extractData(text) {
const prompt = `Extract user data as JSON: ${text}\nJSON:`;
try {
const response = await llm.complete(prompt, { timeout: 5000 });
// Attempt to parse JSON
let data;
try {
data = JSON.parse(response);
} catch (parseError) {
// Try to extract JSON from response (sometimes wrapped in markdown)
const jsonMatch = response.match(/```json\n(.*?)\n```/s);
if (jsonMatch) {
data = JSON.parse(jsonMatch[1]);
} else {
throw new Error('Invalid JSON in LLM response');
}
}
// Validate required fields
const schema = {
name: 'string',
email: 'string'
};
if (!validateSchema(data, schema)) {
throw new Error('Response missing required fields');
}
return data;
} catch (error) {
console.error('LLM extraction failed:', error);
// Fallback: Try simpler regex extraction
return fallbackExtraction(text);
}
}
function fallbackExtraction(text) {
// Regex-based extraction as fallback
const emailMatch = text.match(/[\w.-]+@[\w.-]+\.\w+/);
const nameMatch = text.match(/(?:name|called|I'm)\s+(\w+\s+\w+)/i);
return {
name: nameMatch ? nameMatch[1] : null,
email: emailMatch ? emailMatch[0] : null
};
}
Retry with modified prompts: If the model fails to produce valid output, retry with a more explicit prompt.
async function generateWithRetry(prompt, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await llm.complete(prompt);
const parsed = JSON.parse(response);
return parsed;
} catch (error) {
if (attempt < maxRetries - 1) {
// Make prompt more explicit on retry
prompt = `${prompt}\n\nIMPORTANT: Return ONLY valid JSON, no other text.`;
} else {
throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
}
}
}
}
Graceful degradation: Design your system so that LLM failures don't break core functionality.
// Example: Smart vs. fallback search
async function searchProducts(query) {
try {
// Try LLM-powered semantic search
const enhancedQuery = await llm.complete(
`Expand this product search query with synonyms and related terms: ${query}\nExpanded query:`
);
return await semanticSearch(enhancedQuery);
} catch (error) {
// Fallback to traditional keyword search
console.warn('LLM search enhancement failed, using keyword search');
return await keywordSearch(query);
}
}
Warning: Don't silently fail and return empty results. Log LLM failures for investigation. Track failure rates—if they exceed 5-10%, your prompts need improvement or you're hitting model limits.
Model-Specific Considerations
Different models respond differently to the same prompt. What works for GPT-4 might fail for Claude or open-source models.
OpenAI (GPT-3.5, GPT-4)
GPT-4 follows instructions precisely and handles complex formats well. Use system messages to set behavior that applies to all requests:
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "You are a code review assistant. Always return structured feedback as JSON."
},
{
role: "user",
content: codeToReview
}
]
});
GPT-3.5 is faster and cheaper but less reliable for complex tasks. Use it for simple transformations; use GPT-4 for reasoning.
Anthropic Claude
Claude excels at long-context tasks and following detailed instructions. It's less likely to hallucinate than GPT models but more verbose. Use explicit formatting instructions to keep responses concise:
const prompt = `${task}
IMPORTANT: Be concise. Return only the requested information, no preamble or explanation.`;
Claude's 200K context window makes it ideal for tasks requiring entire codebases or long documents as input.
Open-Source Models (Llama, Mistral)
Open models are less instruction-following than commercial models. Use more explicit few-shot examples and simpler output formats. Avoid complex nested JSON—stick to flat structures.
For code-specific tasks, use code-specialized models like CodeLlama or StarCoder rather than general models.
Debugging Prompts When They Fail
When a prompt produces wrong results, systematically diagnose the issue:
1. Check input validity: Is your input malformed? Does it contain characters that confuse the model (e.g., unbalanced quotes in JSON)?
2. Inspect actual output: Log the raw model response. Often the issue is parsing/interpretation, not generation. The model might return valid JSON wrapped in markdown code blocks, which fails to parse.
3. Simplify the prompt: Remove complexity until it works, then add back piece by piece to find what breaks it.
4. Test with known-good examples: If the prompt fails on production data, test with simplified examples. If those work, the issue is input-specific, not prompt-specific.
5. Check token limits: Your prompt might be truncated. Calculate token count (roughly 4 characters per token) and ensure you're under the model's limit.
// Debug helper: Log everything about a prompt execution
async function debugPrompt(prompt, input) {
console.log('=== Prompt Debug ===');
console.log('Prompt template:', prompt);
console.log('Input:', input);
const fullPrompt = prompt.replace('{input}', input);
console.log('Full prompt length:', fullPrompt.length);
console.log('Estimated tokens:', Math.ceil(fullPrompt.length / 4));
try {
const startTime = Date.now();
const response = await llm.complete(fullPrompt);
const latency = Date.now() - startTime;
console.log('Response:', response);
console.log('Latency:', latency, 'ms');
// Try to parse if JSON expected
try {
const parsed = JSON.parse(response);
console.log('Parsed successfully:', parsed);
} catch (e) {
console.error('Parse failed:', e.message);
}
} catch (error) {
console.error('Completion failed:', error);
}
}
Cost and Performance Optimization
Every token costs money and adds latency. Optimize prompts for efficiency without sacrificing quality.
Token reduction strategies:
- Remove unnecessary examples after testing (keep only the most effective ones)
- Use abbreviations in system messages ("Return JSON" vs. "Please return the data formatted as JSON")
- Reference documentation by URL instead of including it (for models with web access)
- Cache static parts of prompts (many providers now support prompt caching)
// Token-efficient prompt (removes unnecessary verbosity)
// Bad: 45 tokens
const inefficient = `I would like you to carefully analyze the following text
and extract any email addresses that you find. Please return them in a JSON
array format. Here is the text: ${text}`;
// Good: 15 tokens
const efficient = `Extract emails from text as JSON array:\n${text}\nEmails:`;
Caching for repeated prompts: If your application uses the same prompt prefix repeatedly (like system instructions), use prompt caching where available (Anthropic Claude supports this). You pay once to cache the prefix, then only pay for new tokens on subsequent requests.
Model selection based on task complexity: Don't use GPT-4 for tasks GPT-3.5 can handle. Route requests to the cheapest model that meets quality requirements.
// Route to appropriate model based on task complexity
async function smartCompletion(task, input) {
const complexity = estimateComplexity(task);
if (complexity === 'simple') {
// Use fast, cheap model
return await gpt35.complete(input);
} else if (complexity === 'medium') {
return await gpt4.complete(input);
} else {
// Use most capable model
return await claude.complete(input);
}
}
function estimateComplexity(task) {
if (task.requiresReasoning || task.outputLength > 1000) {
return 'complex';
} else if (task.requiresFormatting || task.hasFewShotExamples) {
return 'medium';
} else {
return 'simple';
}
}
Frequently Asked Questions
How do I prevent the model from refusing tasks due to safety filters?
If legitimate requests trigger safety filters (e.g., generating code that handles sensitive data), rephrase to emphasize the legitimate purpose. "Write code to process credit cards" might be blocked; "Write code to tokenize payment information using industry-standard PCI compliance methods" likely won't be. Context matters—explain why you need what you're requesting.
Should I use temperature 0 for deterministic output?
Not always. Temperature 0 makes output deterministic but can produce repetitive or lower-quality results for creative tasks. For factual extraction and code generation, use 0-0.2. For content generation and brainstorming, use 0.7-1.0. Always test with your specific use case.
How many tokens should I budget for typical tasks?
Simple extraction: 100-500 tokens. Code generation: 500-2000 tokens. Document summarization: 500-1500 tokens. Complex reasoning: 1000-3000 tokens. Always set max_tokens limits to prevent runaway generation—models will continue until they hit limits or decide they're done.
Can I use prompt engineering to make a small model perform like a large one?
No. Better prompts improve performance within a model's capability ceiling, but they can't overcome fundamental model limitations. A well-prompted 7B parameter model won't match a poorly-prompted 70B model on complex reasoning. Prompt engineering maximizes what your chosen model can do.
How do I handle multilingual prompts?
Major models (GPT-4, Claude) handle multiple languages well. Provide examples in the target language for few-shot prompts. Be explicit about which language to use for output. For code-switching (input in one language, output in another), state this clearly: "Translate this English text to French JSON: ..."
What's the best way to have the model use external tools or APIs?
Use function calling (OpenAI) or tool use (Claude) features, which are designed for this. The model decides when to call a function and generates the parameters. Your code executes the actual API call, then returns results to the model for incorporation into the response. This is more reliable than asking the model to "write code that calls X".
How do I make prompts work across different model versions?
Stick to simple, explicit instructions that don't rely on model-specific quirks. Test prompts on multiple models before deploying. Avoid using features unique to one provider (like OpenAI's system messages) if you plan to support multiple models—use universal patterns like few-shot examples instead.
Can I chain multiple LLM calls to improve quality?
Yes—this is common for complex tasks. Generate an initial response, then use another call to critique and refine it. Or use one model to plan the task, another to execute steps. Be aware this multiplies cost and latency, so only use for high-value tasks where quality justifies the expense.
How do I evaluate prompt quality objectively?
Build a test dataset with ground truth answers. Measure accuracy, validity (% of outputs that parse correctly), and consistency (same input produces same output with temperature 0). Track these metrics as you iterate on prompts. Also measure user acceptance in production—the ultimate test is whether humans find outputs useful.
Should I include instructions in every message or just the system message?
For chat-based models (like OpenAI's ChatCompletion), put persistent instructions in the system message and task-specific instructions in user messages. For completion models, include all instructions in each prompt. When in doubt, err on the side of repetition—models sometimes ignore system messages, especially in long conversations.
Conclusion
Effective prompt engineering for developers is about reliability, not cleverness. The techniques that matter most—structured output, few-shot examples, context grounding, and systematic testing—all work by constraining the model's behavior to produce predictable, parseable results. This is fundamentally different from optimizing for impressive one-off responses in ChatGPT.
Start with structured output and validation. These two techniques alone will solve 80% of integration problems. Add few-shot examples when quality isn't sufficient with zero-shot prompts. Use chain-of-thought for complex reasoning tasks. Implement proper error handling and fallbacks before going to production.
Prompt engineering is empirical—test everything with real data, measure outcomes, and iterate. What works in theory often fails in practice, and what seems hacky sometimes produces the best results. Build a test suite, track metrics, and refine based on actual performance, not intuition.