2025-11-06 | PreviewProof Team
Handling PII in Preview Environment Seed Data
A preview environment with a real customer’s email is a data breach waiting to be discovered. The URL being unguessable doesn’t matter. “Only engineers see it” doesn’t matter. The moment customer-identifiable data leaves an environment approved for that data, you’ve widened the blast radius of every future incident.
The hill we’ll die on: preview environments should never contain real PII. The work is figuring out how to get realistic-feeling data without any.
Why “we anonymize it” is rarely enough
The standard answer is “we snapshot production and run an anonymization script.” This is usually wrong, because anonymization is harder than it looks. A real pipeline has to handle four classes of problem:
Direct identifiers. Names, emails, phones. Replace with fake values — but emails appear in users.email, audit_log.actor_email, support_messages.from, email_log.recipient, webhook_deliveries.payload. They all need to be anonymized consistently — same real email always becoming the same fake one — or you break referential integrity.
Quasi-identifiers. A user with a unique combination of zip code, signup date, plan tier, and last-login pattern is identifiable even with their name redacted. Mitigation: date shifting, bucketing, k-anonymity checks.
Free text. Ticket bodies, comments, notes. People put SSNs, credit card numbers, and addresses in free text. Schema-level rules don’t catch it. You need a content scanner (Microsoft Presidio, AWS Comprehend) — and none are perfect.
Behavioral data. Even with identifiers stripped, four credit card transactions uniquely identify 90% of people. Real purchase histories aren’t really anonymized.
If your team isn’t going to maintain a pipeline that handles all four, don’t anonymize production data.
The four real strategies
1. Deterministic synthetic generation
The default for most teams. Use a generation library with a fixed PRNG seed so every preview gets the same data, but the data is wholly fabricated.
import { faker } from '@faker-js/faker'faker.seed(42)
// Same seed -> same emails, same names, same dates// across every preview environment, forever.const users = Array.from({ length: 50 }, () => ({ email: faker.internet.email(), name: faker.person.fullName(), phone: faker.phone.number(), createdAt: faker.date.past({ years: 2 }),}))The fixed seed matters. Without it, bug reports become unreproducible — “I saw this on the preview” means nothing if Alice’s data is different on every refresh.
Works for most domains. Falls short when the shape of real data carries information synthetic data can’t capture (risk scoring, fraud detection). The synthetic data problem covers when this matters.
2. Format-preserving encryption (FPE)
When you need data that looks like the real thing — same character classes and lengths, valid checksums — but is reversibly transformed, FPE is the right primitive. 4532-1234-5678-9010 becomes another valid-Luhn number 4716-8851-3392-1407.
Use FPE when downstream systems validate format (Luhn on cards, Mod 11 on tax IDs) and random fakes would fail validation. pyffx or AWS FPE primitives. Don’t roll your own.
Catch: anyone with the key can de-anonymize. FPE pushes security onto key management.
3. Deterministic synthetic from a seed value
Halfway between Faker and FPE. Generate fake data deterministically from each real record’s primary key, so the user with id=42 always gets the same fake email across every preview.
import hashlibfrom faker import Faker
def anonymize_user(user_id: int) -> dict: fake = Faker() fake.seed_instance(int(hashlib.sha256( f"user-{user_id}".encode() ).hexdigest(), 16)) return { "email": fake.email(), "name": fake.name(), "phone": fake.phone_number(), }Stable references — user_id=42 is “Alice Patterson” everywhere — without touching real PII. Useful when your pipeline operates on production’s PK structure but generates fully synthetic content. Realism of “user 42 has a long history” without any of user 42’s actual data.
4. Full synthetic from a model
Generate the entire dataset from a model of “what your customers look like.” Synth, Mockaroo, Tonic.ai, and LLM-based generators produce statistically representative datasets that look real without being real. Most work upfront, cleanest from a compliance standpoint. Hard part: keeping the model in sync as your business changes.
The related-PII problem
Wherever a user’s email exists, related PII lives in places you forgot:
- Audit logs.
audit_log.payloadoften contains the old email when it was changed. - Soft-deleted records.
WHERE deleted_at IS NOT NULLrows still have PII. - Backup tables and
_archiveschemas. Easy to miss. - JSON columns.
metadata jsonbwith{"original_email": "..."}is invisible to schema-level rules. - Search indexes. Elasticsearch, Typesense, Meilisearch keep the real values.
- Object storage. Avatars, attachments, exported reports. Database-only snapshots leave these intact.
Principle: anonymize holistically. Either the whole data plane is sanitized or none of it is.
When even anonymized data isn’t enough
Some data shouldn’t be anonymized for previews — it shouldn’t be near a preview at all:
- PHI under HIPAA. De-identified PHI requires Safe Harbor or Expert Determination. Build synthetic from day one.
- CUI for federal contractors. Your preview probably isn’t in your authorization boundary. Federal contractor guide to PII and CUI in previews.
- Data with residency rules. GDPR, PIPL, DPDP — if your preview runs in a region your production data can’t leave, anonymization doesn’t help.
- Credit card data under PCI DSS. Use Stripe test tokens. Stripe, Twilio, SendGrid test mode.
For these, “anonymize and seed” isn’t an option.
A simple rule
If you can’t explain to your DPO, CISO, or compliance lead exactly what’s in your preview seed data and where it came from, your strategy is wrong. The fix is almost always: less production data, more synthetic generation, deterministic seed, fixtures for known accounts.
The boring answer scales. Start with Faker, fix the seed, move to something more sophisticated only when you hit a concrete realism limit you can name.
If you want the preview pipeline to enforce these defaults — synthetic seed running on a fresh per-preview database, no path for production data to sneak in, evidence captured of what data the reviewer saw — that’s how PreviewProof handles it. The seed runs your code, not a snapshot, and the audit trail records exactly what got loaded into each environment. Building that yourself is doable. Operating it for years isn’t free.