← Projects

StupiPHI

Open-source engine that turns production healthcare-like data into safe, realistic dev datasets without exposing PHI/PII.

View on GitHub →

The Problem

Healthcare teams can't use real production data in development because of privacy rules. StupiPHI lets them create realistic, safe dev data while keeping schema and relationships intact.

The tool is built for HIPAA-style environments and treats privacy and verification as first-class concerns.

Detection

  • ML-based NER (Hugging Face bert-base-NER) for entities in free text
  • Rule-based patterns (email, phone) in encounter notes
  • Structured fields (patient name, DOB, address, phone, email)

Transformation

  • Redaction and pseudonymization
  • Deterministic pseudonyms with optional salt for cross-record consistency
  • Preference for safety (fewer false negatives over false positives)

Database Slice Transfer

Case-centric extract: patient, case, appointments, therapists, payments.

  • Configurable column-level policy (preserve, redact, pseudonymize, mask, placeholder)
  • Replay into dev with new dev-only IDs (no traceability back to prod)
  • Username/password handling: pseudonymize usernames, never copy password hashes; optional placeholder for dev logins
  • Foreign key relationships preserved

Audit and Verification

  • Audit payload with modifications (field_path, action_type, entity_type) and verification status
  • No raw PHI in audit events
  • User-provided audit sink (file, DB, or queue)
  • Post-sanitization checks for residual email/phone patterns

Security

  • Explicit opt-in env var for prod→dev transfers
  • Automatic downgrade of preserve on sensitive columns (password, token, SSN, etc.)
  • Single transaction for extract→replay to avoid partial writes

Evaluation

Synthetic records with injected PHI for validation:

  • False negative rate and residual pattern metrics
  • Difficulty levels (easy, hard)

Tech Stack & CLI

Python 3.10+ · Hugging Face Transformers · PyTorch · Faker · PostgreSQL (psycopg)

stupiphi sanitize                    # Sanitize one synthetic record
stupiphi run-eval --count 100          # Evaluation harness
stupiphi transfer-case --case-id 42    # Extract + sanitize + replay to dev