Back to all posts
AIMCPTest dataPostgreSQLCursorTest data generationDatabase

Generate Realistic Postgres Test Data with AI and MCP

Use Seedmancer's MCP server to let your AI agent generate schema-aware, reproducible test data for PostgreSQL — without hand-rolling fixtures or losing data to schema drift.

Asking an LLM to write INSERT statements is easy. Getting those statements to run reliably against your actual schema — and to produce the same rows every time — is a different problem entirely.

Raw AI-generated SQL drifts out of sync the moment you rename a column or add a NOT NULL constraint. And a one-off prompt in a chat window produces data nobody else on your team can reproduce. The MCP server built into Seedmancer bridges that gap: it gives your AI agent typed tools that work against your live schema, generate test data locally, and snapshot the result as an immutable revision that anyone can seed in under a second.


What "AI + MCP" means for test data

The Model Context Protocol is a standard that lets AI agents call external tools rather than just emitting text. Instead of the agent guessing table names or column types, it calls describe_schema to fetch the real structure, generates schema-aware test data, then calls generate_dataset_local to apply it and capture the output as CSVs.

This matters for three reasons:

  1. Schema-aware — the agent knows your actual columns, types, and constraints before generating a single row.
  2. Reproducible — the dataset is saved alongside the revision. Anyone who clones the repo and runs seedmancer seed billing/pro gets the exact same rows.
  3. Local-first — data is generated and applied inside a local transaction. Nothing leaves your machine during generation; cloud quota is not consumed.

Setup in two minutes

Install the CLI (single static binary):

# macOS
brew install KazanKK/tap/seedmancer

# or go install
go install github.com/KazanKK/seedmancer@latest

Point it at your PostgreSQL database and initialise:

cd your-project
seedmancer init

init creates seedmancer.yaml with a local environment pointing at the database URL you provide. Open the file and confirm the url field is correct:

envs:
  local:
    url: postgres://user:pass@localhost:5432/myapp

Export a baseline snapshot so Seedmancer has a schema fingerprint to work against:

seedmancer export myapp/baseline

Now add the MCP server to your AI host. In Cursor (~/.cursor/mcp.json) or Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "seedmancer": {
      "command": "seedmancer",
      "args": ["mcp"]
    }
  }
}

Restart the host. The agent now has access to tools like describe_schema, generate_dataset_local, seed_database, and list_schemas.


Generate a scenario from a prompt

Open Cursor (or any MCP-capable AI host) and describe the state you need. You do not write SQL — the agent does.

"Create a scenario called billing/pro with a premium workspace, three members, and an expired trial subscription."

Behind the scenes the agent:

  1. Calls describe_schema to inspect your real tables and columns.
  2. Generates a full idempotent dataset for every affected table.
  3. Calls generate_dataset_local with scenario: "billing/pro" and inherit: "myapp/baseline".

The inherit flag seeds the baseline into your local database first so FK dependencies are satisfied. The agent's SQL then runs on top and the result is captured as r001 under .seedmancer/scenarios/billing/pro/.

Seed it immediately:

seedmancer seed billing/pro --yes

Your database is now in the exact state you described.


Why it is reproducible — not just a one-off

When you ask a plain LLM for SQL, you get text. When you use the MCP workflow, you get a revision:

.seedmancer/
  scenarios/
    billing/
      pro/
        revisions/
          r001/
            data/
              workspaces.csv
              users.csv
              subscriptions.csv
            manifest.json
            dataset.sql

Three properties make this reproducible:

Immutable revisions. Once r001 is written, it is never modified. Subsequent generations create r002, r003, and so on. latest always points at the most recent one, but any historical revision is still seedable with --revision r001.

Content-addressed schema fingerprint. Every revision is bound to the schema at the time it was created. If you later rename a column, seedmancer check billing/pro detects the mismatch before seeding, so you do not silently restore incompatible data.

Saved dataset. The dataset.sql file inside the revision is Seedmancer's internal source of truth. The agent can retrieve it to understand what was generated and use it as the base for the next iteration.

Commit .seedmancer/ to your repository. It contains no secrets — only CSVs and JSON derived from your data. Every developer and every CI runner can then seed the same revision without touching the cloud.


Iterate on existing scenarios

Real schemas change. When they do, your AI host fixes the drift locally: the agent runs check_state_schema, retrieves the saved generation SQL with get_dataset_sql, rewrites it for the new schema, and calls generate_dataset_local again. Alternatively, the CLI seedmancer refresh billing/pro (Pro plan) lets the Seedmancer cloud AI adapt the scenario automatically.

To make targeted changes yourself, ask the agent:

"Add a fourth member to billing/pro who has accepted an invite but not yet set a password."

The agent retrieves the current scenario state, produces an updated complete dataset, and calls generate_dataset_local again. A new revision is created; the old one is untouched.


Reset between tests and in CI

Once a scenario exists, resetting to it is a single command:

seedmancer seed billing/pro --yes

Compose it into your test scripts so every run starts from the same state:

{
  "scripts": {
    "test:billing": "seedmancer seed billing/pro --yes && npx playwright test tests/billing"
  }
}

In CI, expose your database URL as an environment variable and add the seed step before the test command:

- name: Seed billing scenario
  run: seedmancer seed billing/pro --yes
  env:
    SEEDMANCER_DATABASE_URL: ${{ secrets.TEST_DATABASE_URL }}

- name: Run billing tests
  run: npx playwright test tests/billing

The --yes flag skips the interactive confirmation so pipelines do not hang. Because the revision is checked into the repository, no cloud account is needed in CI.


Multiple scenarios for different test states

One baseline is rarely enough. State-based test data means each test suite or feature area gets its own named starting point:

| Scenario | What it contains | |---|---| | myapp/baseline | Migrations applied, reference data loaded, no user rows | | billing/free | One workspace on the free plan | | billing/pro | Premium workspace, three members, active subscription | | billing/expired | Premium workspace with an expired trial | | auth/locked | User account that has been locked after failed logins | | qa/smoke | Minimal data for smoke tests across all major flows |

Each is a separate directory under .seedmancer/scenarios/. Each has its own revisions. Generating a new one is a prompt away; seeding takes under a second.


Local generation vs cloud generation

generate_dataset_local (MCP) runs entirely on your machine — no cloud quota, no data leaves your environment. This is the recommended path for day-to-day scenario creation when using an MCP-capable AI host.

The cloud AI generation route (seedmancer generate) is an alternative for teams who prefer a simpler CLI workflow without a local AI host. It requires a Pro plan — local generation through MCP stays free on every plan.

Cloud push and pull — syncing revisions to your team's Seedmancer account — require a Pro plan. For most projects, committing .seedmancer/ to the repository is sufficient for sharing datasets across machines.

See the CLI documentation for the full flag reference and environment variable support.


Try it

Install the CLI, run seedmancer init, and ask your AI agent to generate a scenario. The whole loop — describe a state, generate a dataset, seed it into your database — takes a few minutes the first time and a few seconds every run after that.