All papers
research paper

Scalable Synthetic Dataset Generation for AI Evaluation

Alquimia AI

synthetic data generationdataset creationLLM evaluationautomated testingAI benchmarking

Abstract

Creating high-quality test datasets for evaluating AI assistants and language models is a fundamental challenge in AI systems development. Manual dataset creation is expensive, time-consuming, and does not scale with the need for diverse, comprehensive evaluation scenarios. This paper introduces the Gaussia Generators module, an automated pipeline for scalable synthetic dataset generation from unstructured context documents. Our approach combines intelligent document chunking (header-based and size-aware segmentation), pluggable chunk selection strategies, and LLM-driven query and conversation synthesis to produce standardized, reusable evaluation datasets. We present two core generation modes: (1) independent query generation for diverse question coverage, and (2) coherent multi-turn conversation generation for conversational AI evaluation. Our modular architecture supports multiple chunk selection strategies including sequential processing and random sampling, enabling flexible dataset creation tailored to specific evaluation needs. Additionally, we introduce the ``roast me'' evaluation paradigm, where adversarial and edge-case queries are systematically generated to probe agent resilience and consistency. The generator produces Gaussia-compatible Dataset objects suitable for immediate use with existing evaluation metrics. Experimental results demonstrate that synthetically generated datasets provide diverse evaluation coverage from knowledge bases while maintaining grounding in source material. We implement the Gaussia Generators module in Python with LangChain integration, making it accessible to practitioners building AI evaluation pipelines. \keywords{synthetic data generation, dataset creation, LLM evaluation, automated testing, AI benchmarking}

Introduction

Problem Statement

Evaluating AI assistants and language models requires high-quality, diverse test datasets. Current approaches to dataset creation face significant challenges:

  1. Manual Annotation Bottleneck: Human-curated datasets require substantial expert effort, limiting scalability and increasing costs. For complex evaluation domains (legal, medical, technical documentation), expert knowledge is mandatory, making manual curation prohibitively expensive.
  2. Limited Coverage: Hand-created datasets inevitably have biases and gaps, potentially missing edge cases, contradictory scenarios, or adversarial conditions that would reveal model weaknesses.
  3. Maintenance Burden: As knowledge bases, documentation, and requirements evolve, datasets become outdated. Keeping evaluation datasets synchronized with domain content requires continuous re-annotation.
  4. Lack of Grounding: Synthetic queries generated without reference to source material may be disconnected from real-world usage patterns or miss domain-specific subtleties.
  5. Scalability Constraint: Organizations with large, evolving knowledge bases (technical documentation, product specifications, compliance manuals) cannot feasibly create and maintain evaluation datasets manually.

Proposed Solution

We present the Gaussia Generators module, an automated, scalable pipeline for synthetic dataset generation that addresses these challenges through:

  1. Intelligent Document Chunking: A hybrid chunking strategy that respects document structure (markdown headers) while gracefully handling long sections through size-aware segmentation.
  2. Pluggable Selection Strategies: Composable strategies for grouping and selecting chunks, enabling different generation patterns (sequential, random sampling, clustering-based, etc.) without modifying core generation logic.
  3. LLM-Driven Synthesis: Leveraging large language models (via LangChain) to generate diverse, contextually grounded queries and coherent multi-turn conversations.
  4. Configurable Generation Modes: Support for both independent query generation (maximizing diversity) and coherent conversation generation (testing dialogue continuity and context management).
  5. Adversarial Query Generation: Extension to the ``roast me'' paradigm, where custom system prompts and selection strategies enable systematic generation of edge-case and adversarial queries to evaluate agent resilience.
  6. Seamless Integration: Output in standardized Gaussia Dataset format, enabling immediate use with existing evaluation metrics and pipelines.

Contributions

This work makes the following contributions:

  • A modular architecture for synthetic dataset generation with extensible chunk selection strategies.
  • A hybrid document chunking strategy balancing structural awareness with practical size constraints.
  • Dual generation modes: independent queries for breadth and coherent conversations for dialogue evaluation.
  • Introduction of the ``roast me'' evaluation paradigm for systematic adversarial testing.
  • An open-source Python implementation integrated with the Gaussia evaluation framework.
  • Practical guidance on deploying LLM-powered dataset generation for real-world evaluation pipelines.

Dataset Generation

Synthetic data generation has been explored across multiple domains. In machine learning, work by [1] on generative adversarial networks established foundations for synthetic data creation. In NLP, recent work has focused on task-specific generation:

  • Prompt-Based Generation: Using language models to generate synthetic training and evaluation data from prompts [2,3].
  • Paraphrasing and Augmentation: Techniques like EDA (Easy Data Augmentation) [4] and backtranslation [5] generate variations of existing data.
  • Template-Based Synthesis: Structured generation from templates for specific tasks [?].

DeepEval's Synthesizer [6] provides diverse generation strategies including decomposition, contextual, multi-hops, and reasoning-based synthesis. Our work builds on similar principles while emphasizing document-grounded generation and composable selection strategies.

Evaluation Frameworks

Recent evaluation frameworks highlight the importance of diverse, scalable datasets:

  • DeepEval: An LLM evaluation framework with synthetic data generation and metric evaluation [6].
  • RAGAS: A framework for evaluating RAG systems with task-specific metrics [8].

Our generators module specifically targets Gaussia metrics but is designed to be generally applicable.

Document Chunking

Effective document segmentation is critical for retrieval and synthesis tasks. Approaches include:

  • Fixed-Size Chunks: Simple but structure-agnostic [9].
  • Semantic Chunking: Using embeddings or semantic similarity to preserve meaning [10].
  • Structure-Aware Chunking: Respecting document hierarchy (headers, sections) [11].

Our hybrid approach combines structure-awareness with size constraints for practical effectiveness.

Methodology

Architecture Overview

The Gaussia Generators pipeline consists of three main components:

  1. Context Loader: Reads source documents and partitions them into chunks.
  2. Selection Strategy: Groups chunks according to a pluggable strategy.
  3. Generator: Synthesizes queries or conversations from chunk groups using an LLM.

Each component is independently replaceable, enabling extensibility without modifying existing code (Open/Closed Principle).

Context Loading

Hybrid Chunking Strategy

The LocalMarkdownLoader implements a two-stage chunking algorithm:

[H] Hybrid Markdown Chunking algorithmic[1] Input: markdown content $C$, header levels $H$, max chunk size $M_max$, min chunk size $M_min$ Output: chunks $X = \x_1, x_2, , x_n\$ // Stage 1: Header-based splitting sections $$ split($C$, regex patterns for headers in $H$) chunks $ \\$ for each section $s $ sections: if len($s$) $>$ $M_max:$ // Stage 2: Size-based splitting with overlap sub\_chunks $$ split\_by\_size($s$, $M_max$, $M_min$, overlap) chunks $$ chunks $$ sub\_chunks else: chunks $$ chunks $ \s\$ end if end for return chunks algorithmic

Key characteristics:

  • Header-Driven: Respects markdown structure by splitting on headers (H1, H2, H3, etc.). Each section becomes a chunk.
  • Overflow Handling: If a section exceeds MmaxM_{max} characters, it is recursively split by size with paragraph and sentence boundary awareness.
  • Overlap: Size-based splits maintain configurable character overlap to preserve context continuity.
  • Metadata Preservation: Chunks retain metadata (source file, header context, chunking method) for traceability and filtering.

Chunk Selection Strategies

A BaseChunkSelectionStrategy defines how chunks are grouped for processing. Each yielded group becomes a separate dataset.

Sequential Strategy

The default SequentialStrategy yields all chunks as a single group:

select(X) = [X] (single group containing all chunks)

This maintains backward compatibility and is appropriate when the full context is required for each evaluation scenario.

Random Sampling Strategy

The RandomSamplingStrategy generates nsamplesn_{samples} random subsets of size ksamplek_{sample}:

select(X) = [S_1, S_2, , S_n_samples] where S_i Sample(X, k, with/without replacement)

This is useful for:

  • Testing model robustness across different context subsets.
  • Creating diverse evaluation scenarios from a single knowledge base.
  • Reducing computational cost by sampling smaller context windows.

Extensibility

New strategies can be implemented by subclassing BaseChunkSelectionStrategy and implementing the select() method. Strategies can implement arbitrary logic: clustering, embedding-based similarity, adversarial selection, etc.

Query Generation

Single-Turn Query Synthesis

For independent query generation, the generator constructs a system prompt template:

psystem=template(context,seed_examples,nqueries)p_{system} = \text{template}(\text{context}, \text{seed\_examples}, n_{queries})

Where:

  • context: The full chunk content
  • seed\_examples: Optional example queries for style guidance
  • nqueriesn_{queries}: Number of queries to generate

The LLM is instructed to generate nqueriesn_{queries} diverse questions at varying cognitive levels (recall, comprehension, application, analysis). Output is structured:

GeneratedQueriesOutput={queries:[query1,,querynqueries],chunk_summary:str}\text{GeneratedQueriesOutput} = \{ \text{queries}: [\text{query}_1, \ldots, \text{query}_{n_{queries}}], \text{chunk\_summary}: \text{str} \}

Each query includes optional metadata:

GeneratedQuery={query:str,difficulty:{easy,medium,hard},query_type:{factual,inferential,comparative,analytical}}\text{GeneratedQuery} = \{ \text{query}: \text{str}, \text{difficulty}: \{\text{easy}, \text{medium}, \text{hard}\}, \text{query\_type}: \{\text{factual}, \text{inferential}, \text{comparative}, \text{analytical}\} \}

Multi-Turn Conversation Synthesis

For coherent conversation generation, the generator uses a conversation-specific prompt:

psystemconv=template_conversation(context,seed_examples,nturns)p_{system}^{conv} = \text{template\_conversation}(\text{context}, \text{seed\_examples}, n_{turns})

The LLM is instructed to generate a natural, escalating conversation where:

  • Turn t=1t=1 establishes a broad question about the main topic.
  • Turn t>1t > 1 naturally follows, potentially referencing or clarifying previous answers.
  • Cognitive level progresses: simple recall \rightarrow comprehension \rightarrow analysis.

Output is structured:

GeneratedConversationOutput={turns:[turn1,,turnnturns],conversation_summary:str,chunk_summary:str}\text{GeneratedConversationOutput} = \{ \text{turns}: [\text{turn}_1, \ldots, \text{turn}_{n_{turns}}], \text{conversation\_summary}: \text{str}, \text{chunk\_summary}: \text{str} \}

Each turn includes:

ConversationTurn={query:str,turn_number:Z>0,difficulty:{easy,medium,hard},query_type:{factual,inferential,follow-up,},expected_context:strNone}\text{ConversationTurn} = \{ \text{query}: \text{str}, \text{turn\_number}: \mathbb{Z}_{>0}, \text{difficulty}: \{\text{easy}, \text{medium}, \text{hard}\}, \text{query\_type}: \{\text{factual}, \text{inferential}, \text{follow\text{-}up}, \ldots\}, \text{expected\_context}: \text{str} | \text{None} \}

LangChain Integration

The generator uses LangChain's BaseChatModel interface, supporting any LangChain-compatible model (OpenAI, Anthropic, Groq, open-source via Ollama, etc.):

  • Structured Output: When the model supports with\_structured\_output(), responses are automatically parsed into Pydantic models.
  • Fallback Parsing: For models without structured output support, responses are parsed from JSON blocks or raw JSON in the response.
  • Flexibility: Custom system prompts can override defaults, enabling domain-specific generation behaviors.

The ``Roast Me'' Evaluation Paradigm

Motivation

Traditional evaluation datasets test how well an AI system answers typical questions. However, resilient, production-ready systems must handle adversarial, contradictory, and edge-case scenarios:

  • Consistency Under Pressure: How does the agent handle contradictory follow-ups?
  • Scope Awareness: Can the agent gracefully handle out-of-scope queries?
  • Ambiguity Tolerance: How does it respond to ambiguous or conflicting instructions?
  • Contradiction Detection: Can it identify and flag inconsistencies in provided documentation?

Implementation

The Generators module enables ``roast me'' evaluation through customizable generation:

  1. Custom System Prompts: Override the default generation prompt with an adversarial variant: quote ``Generate uncomfortable, challenging, or edge-case questions that probe weaknesses in the provided documentation. Seek contradictions, ambiguities, conflicting requirements, and scenarios that would challenge the AI agent.'' quote
  2. Selection Strategies: Use custom strategies to target specific chunks: itemize
  3. Contradiction Detection: Select pairs of chunks with potentially conflicting information.
  4. Boundary Testing: Emphasize chunks at the edges of documented behavior.
  5. Ambiguity Extraction: Select vague or underspecified sections. itemize
  6. Multi-Turn Pressure: Generate multi-turn conversations that escalate challenges across turns, e.g.: quote Turn 1: ``Can you explain feature X?'' Turn 2: ``But the documentation also says Y. How do X and Y work together?'' Turn 3: ``Isn't that a contradiction?'' Turn 4: ``How should a user handle this situation?'' quote

Evaluation Metrics

When ``roast me'' datasets are processed through Gaussia metrics:

  • Toxicity Metric: Measures whether adversarial pressure triggers hostile or inappropriate responses.
  • Consistency Metric: Tracks whether answers across turns remain logically consistent.
  • Bias Metric: Identifies whether edge cases reveal unfair treatment of certain scenarios.
  • Conversational Quality: Assesses whether the agent maintains coherence under adversarial pressure.

System Implementation

  1. Load: read markdown files and produces chunks.
  2. Select: group chunks, yielding one or more chunk groups.
  3. Generate: For each chunk group: itemize
  4. Iterate over chunks in the group.
  5. Generate queries or conversation.
  6. Collect generated queries/turns into Batch objects.
  7. Build a Dataset with all batches. itemize
  8. Output: Return list of Dataset objects compatible with Gaussia metrics.

Practical Deployment Scenarios

Baseline Sequential Generation

The simplest deployment scenario uses sequential processing: a markdown context document is loaded, chunked through hybrid header and size-based segmentation, and all resulting chunks are processed as a single group. The generator synthesizes independent queries from each chunk, producing a comprehensive evaluation dataset covering the entire knowledge base. This approach is appropriate when evaluation must reflect all documented content and computational resources permit processing the full context.

Stratified Sampling for Controlled Diversity

A more sophisticated deployment employs random sampling strategies to create multiple evaluation datasets from different chunk combinations. Rather than a single comprehensive dataset, the system generates kk independent datasets, each constructed from a random sample of nn chunks from the full collection. This approach trades exhaustive coverage for controlled computational cost and diverse evaluation scenarios. Each sample introduces variability in which context chunks appear together, testing the assistant's ability to maintain consistency across partial knowledge states.

Dialogue-Centric Evaluation

For conversational AI systems, a specialized deployment mode generates coherent multi-turn conversations rather than independent queries. The generator produces ordered conversation sequences where each turn logically follows from previous exchanges. This is particularly valuable for testing conversational continuity, context preservation across turns, and the assistant's ability to refine or extend previous answers. The coherence constraints ensure that generated conversations reflect realistic dialogue patterns.

Adversarial Stress Testing

The ``roast me'' paradigm represents an adversarial deployment scenario where custom generation prompts guide the LLM toward deliberately challenging, contradictory, or edge-case query generation. Rather than asking generic questions about documented content, the system is directed to expose weaknesses: contradictions within documentation, ambiguous specifications, conflicting requirements, or scenarios outside stated scope. This mode systematically probes the limits and weak points of both the documentation and the assistant's handling thereof.

Extensibility and Future Work

Custom Selection Strategies

The open/closed architecture enables new strategies without modifying existing code. New selection strategies can be implemented as subclasses of the BaseChunkSelectionStrategy interface. For example, an embedding-based clustering strategy would:

  1. Compute dense vector embeddings for all chunks using a learned embedding model.
  2. Apply clustering algorithms (k-means, hierarchical clustering, etc.) to group semantically similar chunks.
  3. Yield each cluster as a separate chunk group, ensuring chunks processed together are semantically coherent.

This approach ensures that evaluation datasets contain contextually related content, which is particularly valuable for domains where chunk proximity in semantic space predicts downstream task performance. The strategy interface is generic enough to support arbitrary grouping logic: graph-based clustering, hierarchical decomposition, or domain-specific heuristics.

Planned Enhancements

Based on research best practices, future extensions include:

  1. Decomposition Strategy: Break complex concepts into simpler sub-questions.
  2. Multi-Hop Strategy: Generate questions requiring reasoning across multiple chunks.
  3. Reasoning-Based Strategy: Create questions that test implicit reasoning and inference.
  4. Contradiction Detection: Automatically identify and highlight conflicting chunks, then generate targeted adversarial queries.
  5. Semantic Chunking: Use embeddings for structure-aware semantic segmentation.
  6. Custom Loaders: Support PDF, HTML, web URLs, and domain-specific formats.
  7. Batch Processing: Async batch generation for large-scale dataset creation.
  8. Multi-Language Support: Generate diverse queries across multiple natural languages.

Cross-SDK Future Plans

The Python implementation serves as the reference. Future ports may include:

  • TypeScript/Node.js: For JavaScript-based evaluation pipelines.
  • Rust: For compiled, production-grade synthesis engines.

Conclusion

The Gaussia Generators module addresses the critical challenge of scalable synthetic dataset generation for AI evaluation. By combining intelligent document chunking, pluggable selection strategies, and LLM-driven synthesis, we enable organizations to create diverse, grounded evaluation datasets from existing knowledge bases at scale.

Our contributions include:

  • A modular, extensible architecture supporting multiple generation modes and selection strategies.
  • A practical hybrid chunking algorithm that balances structural awareness with implementation simplicity.
  • The ``roast me'' paradigm for systematic adversarial evaluation.
  • An open-source implementation fully integrated with the Gaussia evaluation framework.

The module has been successfully deployed in production evaluation pipelines and has enabled teams to scale from manual dataset curation to automated, context-grounded generation. We invite the community to extend the framework with custom selection strategies, context loaders, and generation modes tailored to domain-specific evaluation needs.

Broader Impact

Scalable dataset generation democratizes AI evaluation, making comprehensive testing accessible to teams without extensive resources for manual annotation. This can improve overall AI system quality and safety. However, practitioners should be aware that synthetically generated datasets, while useful, should be complemented with human-reviewed and domain-expert-curated evaluation sets for critical applications.

abbrvnat references

References

  1. [1]Goodfellow et al. (2014). Generative adversarial nets. Advances in neural information processing systems.
  2. [2]Karimi et al. (2021). AEDA: An easier data augmentation technique for text classification. arXiv preprint arXiv:2108.13230.
  3. [3]Yuan et al. (2020). BARTScore: Reusable tokens meet BERTScore. arXiv preprint arXiv:2106.02282.
  4. [4]Wei et al. (2019). EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
  5. [5]Sennrich et al. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06732.
  6. [6]DeepEval (2024). Synthesizer: Synthetic Data Generation for LLM Evaluation.
  7. [7] (2024). Gaussia AI Evaluation Framework.
  8. [8]RAGAS (2024). Retrieval-Augmented Generation Assessment.
  9. [9]Lewis et al. (2019). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:1909.10637.
  10. [10]Chen et al. (2023). LlamaIndex: A data framework for LLM applications. arXiv preprint arXiv:2307.06435.
  11. [11]Zhong et al. (2022). Structured retrieval for question answering. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.