Blog article

In this article, we’ll explore the journey of creating and improving a conversational form-filling system using Large Language Models (LLMs). This case study demonstrates how iterative prompt engineering and architectural decisions can significantly enhance AI-powered conversations, creating more natural and effective user experiences.

The Initial Approach

We started with a single, comprehensive prompt that attempted to handle multiple responsibilities simultaneously:


// This is a simplified version of a very first prompt trying to do everything
const prompt = `
You are a conversational form agent.
Tasks:
1. Read user message
2. Update form fields
3. Track completed fields
4. Generate follow-up question

Current form state: ${JSON.stringify(formState)}
User says: "${userMessage}"
`;

const completion = await openai.chat.completions.create({
  model: "gpt-4.1-mini",
  messages: [{ role: "system", content: prompt }]
});

While this approach showed promise, we encountered issues with tracking completed fields and occasionally endless conversations that never reached completion. The system would sometimes lose track of which fields had been addressed, creating frustrating user experiences.

Divide and Conquer Strategy

Analyzing the prompt’s responsibilities led us to split the interaction with the model into two specific rounds:


// Simplified code and prompts for clarity

// Extraction Prompt
const extractionPrompt = `
Extract any form data from: "${userMessage}"
Current fields: ${JSON.stringify(formFields)}
Output as JSON with keys matching field names.
`;

// Question Generation Prompt
const questionPrompt = `
Based on form state: ${JSON.stringify(formState)},
generate the next most natural question to ask the user.
`;

// Call them separately
const extraction = await openai.chat.completions.create({...});
const question = await openai.chat.completions.create({...});

This separation improved conversation quality but still had occasional tracking issues and increased latency. However, the architecture made each component easier to reason about and optimize.

Adding Data Validation

Next, we introduced data validation using field constraints and examples. Rather than complicating existing prompts, we created a separate validation step:


// Step 1: Extract or infer form data from the user's message
let extraction = await extractFormDataFromMessage(...);

// Optional Step 2: Final validation of the collected information
if (isFormCompleted(...)) {
  // For this crucial step, we select a more powerful model
  const validation = await validateFormData(...);

  // Update extracted data—e.g., remove invalid values so they can be re-asked
  extraction.formData = validation.formData;
}

// Step 3: Generate the next follow-up message
// This function is also responsible for ending the conversation when appropriate
const followup = await getNextFollowupMessage(...);

This approach improved accuracy but further increased latency, though newer models like GPT-4.1 mini helped mitigate this issue. The validation layer ensured that collected information met the required format and constraints, reducing errors in the final submitted forms.

Avoiding Stateful Operations in LLMs

A crucial insight emerged: LLMs struggle with precise state management like tracking field attempt counts. While they handle this correctly most of the time, “most of the time” isn’t sufficient for production systems.

We redesigned the follow-up message generator to:

  • Generate natural next questions
  • Explicitly indicate which fields the question addresses
  • Leave the counting and tracking to deterministic code

// LLM suggests next question + targeted fields
const followUpPrompt = `
Form state: ${JSON.stringify(formState)}
User has answered: "${userMessage}"

Suggest next question and the targeted fields, e.g.:
{
"question": "Could you confirm your email?",
"targetedFields": ["email"]
}
`;

const followUp = await openai.chat.completions.create({...});

// Deterministic code tracks state
updateFormState(followUp.targetedFields, userResponse);

This hybrid approach eliminated tracking issues, provided better guarantees about form completion, and slightly improved latency. The key learning here was recognizing when to rely on LLMs for their strengths (natural language generation) and when to use traditional code for precise, deterministic operations.

Transitioning from Acceptable to Production-Ready

At this stage of our journey, we had achieved a conversational system that performed acceptably in most scenarios. The form-filling process was more natural, the conversation flowed better, and we had eliminated most of the critical issues that plagued earlier iterations. However, we weren’t yet ready for production deployment.

Several challenges remained before we could confidently roll out this system:

  • Lack of systematic evaluation: We were primarily relying on manual testing and subjective assessments, making it difficult to quantify improvements or regressions.
  • Inconsistent performance across edge cases: While the system performed well in typical scenarios, we hadn’t thoroughly tested against the wide variety of user inputs that would appear in production.
  • Insufficient confidence in accuracy metrics: Without robust evaluation frameworks, we couldn’t provide stakeholders with concrete metrics about the system’s reliability.

At this point, our mindset shifted from speed of iteration to stability and measurement. Rather than continuing with ad-hoc improvements, we needed to establish systematic LLMs/prompts evaluation methods that would allow us to:

  • Quantify the impact of each change
  • Build confidence in the system’s performance across diverse scenarios
  • Create a reliable benchmark against which to measure future optimizations

Since we already use targeted prompts for individual components, we can leverage evaluation frameworks like LangSmith to systematically test and assess each part. If we want to evaluate the entire chat experience end-to-end, a more comprehensive evaluation setup may be required, but this is typically reserved for situations where granular evaluations are insufficient.

This transition from experimental prompt engineering to structured evaluation represents a critical maturation in LLM-based system development. It’s the bridge between “works well enough in demos” and “reliable enough for thousands of real users.”

A Methodical Approach to Prompt Engineering

This case study illustrates a structured approach to prompt engineering:

  • 1. Start simple with a single prompt solution
  • 2. Evaluate acceptance: If unacceptable, iterate rapidly by improving prompts, using better models, or dividing responsibilities
  • 3. When acceptable: Implement evaluation frameworks to systematically improve quality (E.g., scenario-based evals, regression tests)
  • 4. Production readiness check: Continue improving until the solution meets production requirements
  • 5. Monitor in production: Watch real-world performance
  • 6. Optimize continuously: Improve performance and costs while maintaining quality using evaluation frameworks

The key to production-ready LLM systems lies in starting simple, iterating fast, and rigorously validating before scaling. This methodical cycle ensures continuous improvement while maintaining quality, demonstrating how prompt engineering can evolve from simple experimentation to robust production systems.