2025-11-01 | PreviewProof Team

How to Seed a Postgres Database for Ephemeral Preview Environments

PostgreSQLseed datapreview environmentsdatabaseephemeral environments

A preview environment is only as useful as its database. Schema is the easy part — prisma migrate deploy runs in your container entrypoint and you’re done. Data is where most implementations succeed or quietly fail. An empty database produces a preview where nothing renders. A copy of production produces a compliance liability before the URL is shared.

There are four real strategies, each with trade-offs, and most mature setups combine them.

Strategy 1: Synthetic generation

Generate the data programmatically — Faker, factory libraries, scripted SQL. The seed runs against a fresh database every preview boot.

seeds/users.py
from faker import Faker
from app.models import User, Organization
fake = Faker()
Faker.seed(42) # deterministic across previews
def seed():
org = Organization.objects.create(name="Acme Corp", plan="pro")
User.objects.create(
name="Test Admin",
organization=org,
role="admin",
)
for _ in range(40):
User.objects.create(
email=fake.unique.email(),
name=fake.name(),
organization=org,
role=fake.random_element(["member", "member", "viewer"]),
created_at=fake.date_time_between(start_date="-1y"),
)

The default for most teams and the right starting point. Deterministic (with a fixed PRNG seed), versioned, free of production data. Downsides:

  • Maintenance. Every schema change touching seeded tables means updating the seed. Add a non-null column to users and forget the seed and your next preview crashes.
  • Realism ceiling. Synthetic data is uniform. Bugs in long tails (user with 47 addresses, order with one cent of tax) don’t surface.
  • Scale. Seeding 100K rows at boot adds minutes per preview.

Use this when your data model is small and realistic distributions don’t matter.

Strategy 2: Anonymized snapshots

Take a periodic dump of production, run it through an anonymization pipeline, and store the sanitized snapshot as a Postgres dump or a prebuilt Docker image. Every preview restores from that snapshot.

Terminal window
# Nightly job — produces previews/db-2026-04-27.dump
pg_dump --format=custom --no-owner production_db > raw.dump
pg_restore --no-owner raw.dump | \
python anonymize.py | \
pg_dump --format=custom > previews/db-$(date +%F).dump
aws s3 cp previews/db-$(date +%F).dump s3://preview-data/latest.dump

The benefit is realism — same distributions, edge cases, and historical artifacts as production. The cost is the anonymization pipeline, which is harder than it looks. Anonymization is not just UPDATE users SET email = '[email protected]'. You need to handle referential integrity across PII (the user’s email is in users, audit_log, support_tickets), quasi-identifiers (zip + signup date + plan tier can re-identify), and free-text fields where PII hides invisibly.

Handling PII in preview seed data goes deep. Short version: snapshots are powerful when done right and dangerous when done wrong. If you can’t commit to maintaining the pipeline, don’t go down this path.

Strategy 3: Fixture libraries

A curated set of records — JSON, YAML, or SQL — checked into the repo and loaded in order. Less generated than synthetic, less risky than snapshots.

fixtures/projects.yml
- id: proj_001
name: "Migration to Postgres 16"
status: archived
owner: user_alice
created_at: 2025-11-04
- id: proj_002
name: "Q2 Onboarding Redesign"
status: active
owner: user_alice
members: [user_bob, user_carol]
created_at: 2026-02-01

Fixtures are predictable and good for “canonical example state.” They’re also prone to going stale. Use fixtures for the small set of records that must always exist (admin user, tenant, pricing tier) and combine with synthetic generation for volume.

Strategy 4: On-demand factories

Don’t seed at all at boot — instead, expose factory endpoints or CLI commands that create exactly the data needed for a given test scenario.

// previewable scenarios
POST /preview/scenarios/onboarding-incomplete
POST /preview/scenarios/billing-failed
POST /preview/scenarios/large-team-with-history

Trades boot speed for per-scenario setup. Powerful when scenarios are well-defined. Frustrating when reviewers just want to click around. Works best as a complement to one of the other strategies, not a replacement.

Postgres-specific gotchas

Sequences. Insert rows with explicit PKs and the underlying sequence isn’t advanced. Next INSERT without an explicit ID collides:

SELECT setval(pg_get_serial_sequence('users', 'id'),
COALESCE(MAX(id), 1)) FROM users;

Run for every table you bulk-loaded. Generate the statements from information_schema.columns.

FKs during bulk load. Use SET session_replication_role = replica; to disable triggers during load, then re-enable.

Generated columns. Postgres 12+ rejects explicit values on generated columns. Snapshots predating the column will fail to restore.

Large reference data. A zip_codes or currencies table with 50K rows shouldn’t be regenerated per preview. Bake it into a base image and exclude from pg_dump (--exclude-table=zip_codes).

Extensions. pgvector, postgis, pg_trgm need to be in your base image with CREATE EXTENSION running before any dependent migration.

What we recommend

For most teams the stack that works is:

  1. Synthetic generation as the default. Faker-driven scripts with a fixed PRNG seed, run at container boot.
  2. A baked base image for reference data. Currencies, country codes, taxonomies — anything large and stable.
  3. Fixed YAML fixtures for known entities. Admin user, demo org, sample project — so reviewers always know how to log in.
  4. Snapshots only when realism matters more than the anonymization pipeline costs. Analytics features may need it. A settings page doesn’t.

Synthetic data vs test fixtures for realistic previews covers when synthetic generation hits its ceiling.

If you want the seed pipeline to be something you write once and forget — migration step, deterministic seed, per-PR isolation — that’s the part PreviewProof handles. You bring the seed script. We run it on every preview against a clean database, and the URL goes to your reviewers with data already loaded. No manual setup per PR, no stale shared database, no production copies in S3.