2025-11-11 | PreviewProof Team

The Synthetic Data Problem: When Test Fixtures Aren't Enough for Realistic Previews

synthetic datatest fixturespreview environmentsdata generationQA

The seed file your team checked in three years ago has three users, two orgs, and four projects. It exists because your tests needed something to assert against. It’s been quietly serving as the preview seed ever since because nobody wanted to write a second one. And every reviewer keeps not finding the bug, because the bug shows up at the seventeenth row and the seed only generates four.

Test fixtures and preview seed data are different problems with different goals. Conflating them is why previews end up feeling fake.

Two different jobs

A test fixture drives a test. Its job is the smallest, most controlled set of records that exercises a specific code path. Three users, one admin, one not — assert the non-admin gets a 403. Adding a fourth user makes the test slower and harder to reason about with no benefit.

A preview seed makes the application feel like production. Its job is to populate the UI with enough data and variety that a reviewer notices broken layouts on long names, slow queries on dense tables, off-by-one errors in pagination.

These goals are in tension. The fixture wants minimal and isolated. The preview wants representative and varied. Using one for both leaves both jobs done badly.

What “realistic” actually means

Most preview seeds fail at least one of these:

Volume. Production has 50K users; your seed has 50. Pagination doesn’t trigger, the activity feed has three entries instead of three hundred.

Distribution. Most orgs have 2 users, some have 5,000. Your seed has every org with exactly 5. The “too many users” handling never triggers.

Time spread. Production spans years. Your seed is all NOW(). The “last 30 days” filter looks identical to “all time.”

Content variety. Product names in 14 languages, notes with emoji, addresses with apostrophes. Your seed has "Test Product 1". I18n and escaping bugs hide.

Edge cases. The user with 247 saved searches. The order with one cent of tax. The customer dormant for two years. None of these exist.

A seed that doesn’t surface these conditions makes reviewers confident in features that break the moment a real user looks at them.

Patterns that work

Faker-style generation with a fixed seed

The baseline. Same seed across every preview means the same data, which means reproducible bug reports.

from faker import Faker
fake = Faker()
Faker.seed(42)
# Volume: enough rows that pagination triggers and dashboards populate
for _ in range(500):
User.create(
email=fake.unique.email(),
name=fake.name(),
signup_at=fake.date_time_between(start_date="-2y"),
last_active=fake.date_time_between(start_date="-30d"),
)

Faker handles the basics. Use unique for fields with uniqueness constraints, date_time_between to spread timestamps. Weakness: uniformity. Every Faker user looks like every other one. Fine for filling the screen, doesn’t surface long-tail bugs.

Distributions and weights

Real systems are skewed. Synthesize with explicit distributions:

import numpy as np
# Power law: most orgs small, a few enormous
sizes = np.random.zipf(a=1.7, size=200).clip(max=2000)
for size in sizes:
org = Organization.create(name=fake.company())
for _ in range(int(size)):
User.create(organization=org, email=fake.unique.email())

A Zipf-distributed company size exercises the “large org” code path. Exponential decay on activity creates the dormant-user case. More work than faker.name(), but the difference in preview quality is large.

Markov-chain or LLM content for free text

Faker’s lorem ipsum looks like garbage. For comments, reviews, support tickets, you want content that reads as plausibly human. A Markov chain on a public corpus (Project Gutenberg, Wikipedia) is decent. LLM calls produce excellent results at higher cost. Generate once, commit or cache the output — you don’t want a per-preview LLM bill.

Statistically representative sampling without copying

If you have access to production statistics (without production rows), you can generate synthetic data matching the shape. Aggregate counts, distribution of order sizes, ratio of free to paid users — not PII, enough to drive a realistic generator.

# stats.json — committed in the repo, derived from production aggregates
{
"users_per_org_p50": 4,
"users_per_org_p99": 380,
"orders_per_user_mean": 2.3,
"free_to_paid_ratio": 0.92,
"languages": {"en": 0.71, "es": 0.12, "fr": 0.07, "ja": 0.05, "de": 0.05}
}

Generators consuming these stats produce previews that feel right without ever touching a real record. PII in preview seed data covers why that bar matters.

LLM-generated scenarios for narrative coherence

When the story of the data matters — a user who signed up, hit a paywall, upgraded, churned, came back — Faker can’t produce that narrative. Generate scenarios with an LLM as a one-time step, commit the output, load deterministically. Overkill for most apps. Right answer for analytics, customer journey, lifecycle features.

How much is enough

How many records should the seed produce? Enough to make the most data-dense screen feel real, not much more. For most apps, 200 to 5,000 of the dominant entity. Below 200, dashboards look anemic and pagination doesn’t trigger. Above 5,000, the seed step takes minutes and costs go up for marginal benefit.

Useful exercise: identify the densest screen in your app and make sure your seed fills it convincingly. Everything else can be lighter.

The freshness problem

Seeds go stale as schemas evolve. New required column, the seed crashes. New entity in the dashboard, the seed doesn’t generate any. Two practices help:

Treat the seed as part of the schema change. Same PR that adds a column updates the seed. PRs with schema changes and no seed update are incomplete.

Run the seed in CI on every PR. Not just “the migration applies” — actually run the seed against the migrated schema and assert it completes. Catches “you added a column with no default and forgot the seed” before any preview breaks.

When to give up and use snapshots

There’s a point where synthetic data hits a ceiling — risk scoring, fraud detection, recommendation systems, anywhere the content of real behavior is the test.

The answer is rarely “use production data anyway.” It’s “synthetic for previews, realistic eval data for the model in a separate hardened environment with proper access controls.” Don’t make previews carry both jobs.

The takeaway

Preview seed data is not test fixture data. The job is different, the constraints are different, the failure modes are different. Synthetic data is its own discipline — distributions matter, freshness matters, narrative matters for some apps.

If you’re using a four-record test fixture as your preview seed, replace it. A real synthetic seed is one afternoon of work. Reviewers approving features that look fine in a sparse preview and break in production is much more expensive.

If you’d rather not spend the afternoon — or the ongoing maintenance — PreviewProof runs your seed script on every preview against a fresh database. Bring your generator. We make sure it runs, every time, and the resulting data is what your reviewers see.