GDPR-Safe Contract Extraction with Presidio and Claude Code

How to build a two-stage pipeline on Ubuntu 24.04 LTS that strips all personal data before it leaves your server, then uses Claude Code to extract structured legal intelligence from the sanitized text.

Stack

Platform: Ubuntu 24.04 LTS
Stack: Python 3.12 · Presidio · Claude Code
Compliance: GDPR Article 25 (Privacy by Design)

Prerequisites

You need the following before you start. Each item is covered in detail in the steps that follow.

Requirement Details:
Ubuntu 24.04 LTS -> Fresh install or existing server. Root or sudo access required.
Python 3.12 -> Ships with Ubuntu 24.04. No separate install needed.
Internet access -> Required to download packages and to reach the Anthropic API at runtime.
Anthropic account -> A paid Anthropic account at console.anthropic.com to generate an API key.
Claude Code -> Installed via npm. Node.js is a dependency (covered below).
4 GB RAM minimum -> The spaCy large language model requires approximately 1.5 GB at load time.
2 GB disk space -> For Python packages, the spaCy model, and working files.

Before you begin

All commands in this article must be run as a non-root user with sudo privileges. Running as root directly is not recommended. If you are unsure which user you are, run whoami and confirm you see your username rather than root.

How the pipeline works

Before touching a terminal, it is worth understanding the data flow you are building. The pipeline has three distinct layers. Each layer has a clear responsibility and a clear boundary. Nothing crosses from one layer to the next until it has been processed.

  RAW CONTRACT
  (PDF / DOCX / TXT)
         │
         ▼
┌─────────────────────────────┐
│  LAYER 1: Ingestion         │  Converts any format to plain text.
│  pdfplumber · python-docx   │  No analysis. No decisions.
└─────────────────────────────┘
         │
         ▼  plain text
         │
┌─────────────────────────────┐
│  LAYER 2: Anonymization     │  Finds every personal identifier.
│  spaCy · Microsoft Presidio │  Replaces each with a typed token.
└─────────────────────────────┘  Saves the mapping locally. Never transmitted.
         │
         │  sanitized text         ← Only this crosses the network
         │  e.g. [PARTY_A], [EMAIL_1]
         ▼
┌─────────────────────────────┐
│  LAYER 3: Claude Code Agent │  Reads sanitized text.
│  Anthropic API (HTTPS)      │  Extracts structured legal fields.
└─────────────────────────────┘  Returns JSON output.
         │
         ▼
  STRUCTURED OUTPUT
  (JSON with placeholder tokens)
         │
         ▼  optional
  RE-HYDRATION  ← Swap tokens back to real values locally if needed

Under GDPR Article 25 (Data Protection by Design and by Default), you are required to implement appropriate technical measures to integrate data protection into processing activities. This pipeline satisfies that requirement for the API call specifically: the anonymization layer ensures that no personal data is transmitted to Anthropic’s infrastructure. What reaches the API is pseudonymised text containing only typed placeholder tokens. It carries no information that could, on its own or in combination with other data held by Anthropic, identify a natural person.

The entity types that Presidio will detect and replace cover the full GDPR Article 4 definition of personal data:

Token format -> What it replaces
[PARTY_A], [PARTY_B] -> Named parties and signatories (persons and legal entities
[PERSON_1], [PERSON_2] -> Any other named individuals in body text
[EMAIL_1] -> Email addresses
[PHONE_1] -> Phone and fax numbers
[ADDRESS_1] -> Street addresses, postcodes, building names
[ORG_1], [ORG_2] -> Company and organisation names
[DATE_1], [DATE_2] -> Specific dates (birth dates, signing dates, etc.)
[ID_1] -> National IDs, VAT numbers, company registration numbers
[IBAN_1] -> Bank account and IBAN numbers
[IP_1] -> IP addresses
[URL_1] -> URLs containing personal identifiers

System setup

Update the package index

Always start with a full package update to ensure you are installing the latest available versions of all dependencies.

sudo apt update && sudo apt upgrade -y

Install system dependencies

These packages are required by the PDF extraction library and by the Python build tools used when installing some dependencies.

sudo apt install -y \
  python3-pip \
  python3-venv \
  python3-dev \
  build-essential \
  libpoppler-cpp-dev \
  pkg-config \
  poppler-utils \
  curl \
  git

Install Node.js 20 LTS

Claude Code is installed as an npm package and requires Node.js version 18 or higher. Ubuntu 24.04’s default Node.js package in the apt repository is too old, so you will install Node.js 20 LTS using the NodeSource setup script.

curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt install -y nodejs

Confirm both Node.js and npm installed correctly:

node --version
npm --version

You should see output similar to v20.x.x and 10.x.x. If either command returns command not found, the NodeSource install did not complete. Re-run the two commands above and check for errors in the output.

Create the project directory

Create a dedicated directory for the pipeline. All files, scripts, and outputs will live here.

mkdir -p ~/contract-pipeline/{contracts,sanitized,output,mappings}
cd ~/contract-pipeline

This creates four subdirectories inside ~/contract-pipeline:

contract-pipeline/
    contracts/
    sanitized/
    output/
    mappings/

Python environment and dependencies

Create a virtual environment

A virtual environment isolates the Python packages for this project from the rest of your system. This prevents version conflicts and makes the pipeline reproducible.

cd ~/contract-pipeline
python3 -m venv venv
source venv/bin/activate

Your terminal prompt will change to show (venv) at the start. This confirms the virtual environment is active. You must activate the virtual environment every time you open a new terminal session before running pipeline commands. The activation command is always source ~/contract-pipeline/venv/bin/activate.

Install Python packages

Install all required packages in one command. This may take two to three minutes depending on your connection speed.

pip install \
  pdfplumber \
  python-docx \
  spacy \
  presidio-analyzer \
  presidio-anonymizer \
  cryptography

Download the spaCy language model

Presidio uses spaCy’s named entity recognition (NER) to detect personal identifiers in text. The large English model (en_core_web_lg) provides the best accuracy for legal document text and is required for reliable GDPR-grade detection.

python3 -m spacy download en_core_web_lg

This downloads approximately 560 MB. When complete, verify it loaded correctly:

python3 -c "import spacy; nlp = spacy.load('en_core_web_lg'); print('Model loaded OK')"

The output should be Model loaded OK. If you see an error about the model not being found, the download may have been interrupted. Re-run the download command above.

Why the large model?

spaCy ships three English model sizes: small (sm), medium (md), and large (lg). The small and medium models use simpler statistical approaches and miss a significant proportion of named entities in dense legal text. The large model uses word vectors and produces substantially higher recall on person names, organisation names, and location references. For a GDPR pipeline where a missed entity is a compliance failure, the large model is the correct choice.

Building the ingestion layer

The ingestion layer has one job: convert any supported file format into a single plain text string that the anonymization layer can process. Create the following file exactly as shown.

nano ~/contract-pipeline/ingest.py

Paste the following content into the editor, then press Ctrl+O to save and Ctrl+X to exit.

"""
ingest.py
Converts PDF, DOCX, or plain text contracts to a single text string.
No analysis is performed here. Output is always UTF-8 plain text.
"""

import pathlib
import sys
import pdfplumber
from docx import Document


def extract_text(file_path: str) -> str:
    """Extract plain text from PDF, DOCX, or TXT. Returns a string."""
    path = pathlib.Path(file_path)

    if not path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")

    suffix = path.suffix.lower()

    if suffix == ".pdf":
        return _extract_pdf(path)
    elif suffix in (".docx", ".doc"):
        return _extract_docx(path)
    elif suffix in (".txt", ".md", ".text"):
        return path.read_text(encoding="utf-8", errors="replace")
    else:
        raise ValueError(f"Unsupported file type: {suffix}")


def _extract_pdf(path) -> str:
    """Extract text from all pages of a PDF."""
    pages = []
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            if text:
                pages.append(text)
    return "\n\n".join(pages)


def _extract_docx(path) -> str:
    """Extract text from all paragraphs of a DOCX file."""
    doc = Document(path)
    return "\n".join(
        para.text for para in doc.paragraphs if para.text.strip()
    )


if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 ingest.py <path/to/contract>")
        sys.exit(1)
    result = extract_text(sys.argv[1])
    print(result)

Test the ingestion layer

Drop a sample contract file into the contracts/ directory and run a quick test to confirm extraction is working before proceeding.

cd ~/contract-pipeline
source venv/bin/activate
python3 ingest.py contracts/sample-contract.pdf

You should see the contract text printed to the terminal. If you see a blank output from a PDF, the PDF may contain scanned images rather than embedded text. Scanned PDFs require an OCR step that is outside the scope of this guide. Contact your document management team to obtain text-layer PDFs.

Building the anonymization layer

This is the core GDPR safeguard. The anonymization script runs Presidio’s analyzer over the extracted text, identifies every personal identifier, replaces each one with a typed placeholder token, and saves the token-to-value mapping as an encrypted JSON file on your server. The mapping never leaves the server.

nano ~/contract-pipeline/anonymize.py

"""
anonymize.py
Strips all GDPR-relevant personal identifiers from extracted contract text.
Replaces each with a typed placeholder token (e.g. [PERSON_1], [EMAIL_1]).
Saves the token-to-value mapping as an encrypted JSON file locally.
The sanitized text is safe to transmit to external APIs.
"""

import json
import os
import pathlib
import sys
from collections import defaultdict
from cryptography.fernet import Fernet
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

# ── Entity types to detect ──────────────────────────────────────────────────
# This list covers the full GDPR Article 4 definition of personal data.
ENTITIES = [
    "PERSON",
    "EMAIL_ADDRESS",
    "PHONE_NUMBER",
    "LOCATION",
    "ORG",
    "DATE_TIME",
    "NRP",          # National/religious/political group
    "IP_ADDRESS",
    "URL",
    "IBAN_CODE",
    "MEDICAL_LICENSE",
    "US_SSN",
    "UK_NHS",
    "IN_PAN",       # Indian PAN card
    "IN_AADHAAR",   # Indian Aadhaar
    "AU_ABN",       # Australian business number
    "AU_ACN",
    "AU_TFN",       # Australian tax file number
    "SG_NRIC_FIN",  # Singapore ID
    "CRYPTO",       # Cryptocurrency wallet addresses
]

# ── Encryption key management ───────────────────────────────────────────────
KEY_FILE = pathlib.Path(os.path.expanduser("~/.contract_pipeline_key"))


def _get_or_create_key() -> bytes:
    """Load the encryption key or create one on first run."""
    if KEY_FILE.exists():
        return KEY_FILE.read_bytes()
    key = Fernet.generate_key()
    KEY_FILE.write_bytes(key)
    KEY_FILE.chmod(0o600)  # Owner read/write only
    print(f"Encryption key created at {KEY_FILE} — back this up securely.")
    return key


# ── Token counter — gives each entity type its own numbered sequence ────────
_counters: dict = defaultdict(int)
_value_to_token: dict = {}
_token_to_value: dict = {}

# Map Presidio entity type names to readable token prefixes
_PREFIX_MAP = {
    "PERSON":           "PERSON",
    "EMAIL_ADDRESS":    "EMAIL",
    "PHONE_NUMBER":     "PHONE",
    "LOCATION":         "ADDRESS",
    "ORG":              "ORG",
    "DATE_TIME":        "DATE",
    "NRP":              "GROUP",
    "IP_ADDRESS":       "IP",
    "URL":              "URL",
    "IBAN_CODE":        "IBAN",
    "MEDICAL_LICENSE":  "ID",
    "US_SSN":           "ID",
    "UK_NHS":           "ID",
    "IN_PAN":           "ID",
    "IN_AADHAAR":       "ID",
    "AU_ABN":           "ID",
    "AU_ACN":           "ID",
    "AU_TFN":           "ID",
    "SG_NRIC_FIN":      "ID",
    "CRYPTO":           "WALLET",
}


def _make_token(entity_type: str, value: str) -> str:
    """Return a stable token for this value, creating one if first seen."""
    if value in _value_to_token:
        return _value_to_token[value]
    prefix = _PREFIX_MAP.get(entity_type, entity_type)
    _counters[prefix] += 1
    token = f"[{prefix}_{_counters[prefix]}]"
    _value_to_token[value] = token
    _token_to_value[token] = value
    return token


def anonymize(text: str) -> tuple[str, dict]:
    """
    Anonymize text. Returns (sanitized_text, token_to_value_mapping).
    The mapping is the only record of what each token represents.
    Store it locally and never transmit it.
    """
    # Reset per-document state
    _counters.clear()
    _value_to_token.clear()
    _token_to_value.clear()

    analyzer = AnalyzerEngine()
    anonymizer = AnonymizerEngine()

    # Analyze — find all entity spans in the text
    results = analyzer.analyze(
        text=text,
        entities=ENTITIES,
        language="en",
    )

    # Build operator config: for each result, produce a unique token
    operators = {}
    for result in results:
        value = text[result.start:result.end]
        token = _make_token(result.entity_type, value)
        if result.entity_type not in operators:
            operators[result.entity_type] = OperatorConfig(
                "replace", {"new_value": token}
            )

    # Anonymize — replace spans with tokens
    anonymized = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators=operators,
    )

    return anonymized.text, dict(_token_to_value)


def save_mapping(mapping: dict, mapping_path: str) -> None:
    """Encrypt and save the token-to-value mapping locally."""
    key = _get_or_create_key()
    fernet = Fernet(key)
    plaintext = json.dumps(mapping, indent=2).encode()
    encrypted = fernet.encrypt(plaintext)
    pathlib.Path(mapping_path).write_bytes(encrypted)
    pathlib.Path(mapping_path).chmod(0o600)


def load_mapping(mapping_path: str) -> dict:
    """Decrypt and load a previously saved mapping."""
    key = _get_or_create_key()
    fernet = Fernet(key)
    encrypted = pathlib.Path(mapping_path).read_bytes()
    plaintext = fernet.decrypt(encrypted)
    return json.loads(plaintext)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python3 anonymize.py <input.txt> <output_sanitized.txt>")
        sys.exit(1)
    raw_text = pathlib.Path(sys.argv[1]).read_text(encoding="utf-8")
    sanitized, mapping = anonymize(raw_text)
    out_path = pathlib.Path(sys.argv[2])
    out_path.write_text(sanitized, encoding="utf-8")
    mapping_path = pathlib.Path("mappings") / (out_path.stem + ".enc")
    save_mapping(mapping, str(mapping_path))
    print(f"Sanitized text written to: {out_path}")
    print(f"Encrypted mapping saved to: {mapping_path}")
    print(f"Entities replaced: {len(mapping)}")

GDPR note: The encryption key

The script generates a Fernet symmetric encryption key on first run and saves it to ~/.contract_pipeline_key with permissions set to owner-read-only. This key is the only way to decrypt the mapping files. Back it up to a secure offline location. If you lose it, the mapping files cannot be decrypted and you cannot re-hydrate placeholder tokens back to real values. The key file itself must never leave the server.

Installing and configuring Claude Code

Install Claude Code globally

Claude Code is installed as a global npm package. The -g flag makes the claude command available system-wide.

sudo npm install -g @anthropic-ai/claude-code

Verify the installation:

claude --version

Obtain an Anthropic API key

Claude Code requires an Anthropic API key to function. To get one:

a) Open the Anthropic Console
Go to console.anthropic.com in a browser and sign in to your Anthropic account.

b) Navigate to API Keys
In the left sidebar, click API Keys, then click Create Key.

c) Name and copy the key
Give the key a name such as contract-pipeline-prod. Copy the key value immediately. It is only shown once.

d) Set the key as an environment variable
Add the key to your shell profile so it is available in every session.

echo 'export ANTHROPIC_API_KEY="your-key-here"' >> ~/.bashrc
source ~/.bashrc

Replace your-key-here with the actual key value you copied.

Never commit API keys to version control

If you are using Git for this project, add .env and ~/.bashrc patterns to your .gitignore immediately. An exposed API key gives anyone who finds it the ability to call the Anthropic API using your account and billing.

Authenticate Claude Code

Run the Claude Code authentication flow to confirm the key is working:

cd ~/contract-pipeline
claude

Claude Code will start an interactive session. Type hello and press Enter. If you receive a response, authentication is working. Press Ctrl+C to exit the interactive session before continuing.

Defining the extraction agent

Claude Code reads a file called CLAUDE.md from the project directory. This file defines the agent’s role, its instructions, and the exact output format you expect. Create it now.

nano ~/contract-pipeline/CLAUDE.md

# Contract Intelligence Agent

## Role

You are a contract analysis agent. You extract structured legal and
commercial information from sanitized contract text.

The text you receive has been pre-processed by a GDPR anonymization
layer. All personal identifiers have been replaced with typed
placeholder tokens (e.g. [PERSON_1], [ORG_1], [EMAIL_1], [DATE_1]).
You will see these tokens throughout the text. This is expected and
correct. Do not attempt to guess or restore what any token represents.
Use the token labels as-is in your output.

## Your task

Read the contract text provided in the file given to you and extract
every field listed in the Output Schema below. For each field:

- Extract the actual value from the contract text.
- If a field is present but uses placeholder tokens, include the tokens
  in your output exactly as they appear.
- If a field is not present in the contract, set its value to null.
- Do not invent or infer values that are not explicitly stated.
- Do not summarise or paraphrase legal language — quote it exactly
  where instructed.

## Output format

Return ONLY a single valid JSON object. No preamble, no explanation,
no markdown code fences. Start the output with { and end with }.

## Output schema

{
  "contract_type": "string | null — e.g. Service Agreement, NDA, SLA",
  "effective_date": "string | null — exact text from contract",
  "expiry_date": "string | null — exact text from contract",
  "duration": "string | null — e.g. 12 months, 3 years",
  "parties": [
    {
      "token": "string — placeholder token e.g. [ORG_1]",
      "role": "string — role in contract e.g. Service Provider, Client"
    }
  ],
  "payment_terms": {
    "amount": "string | null",
    "currency": "string | null",
    "frequency": "string | null — e.g. monthly, quarterly, annually",
    "due_date_terms": "string | null — e.g. 30 days from invoice",
    "late_payment_penalty": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "termination": {
    "notice_period": "string | null",
    "termination_for_cause": "string | null",
    "termination_for_convenience": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "liability": {
    "cap_amount": "string | null",
    "cap_basis": "string | null — e.g. total fees paid in prior 12 months",
    "exclusions": ["string"],
    "exact_clause": "string | null — verbatim clause text"
  },
  "renewal": {
    "auto_renews": "boolean | null",
    "renewal_notice_period": "string | null",
    "renewal_terms": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "governing_law": {
    "jurisdiction": "string | null",
    "dispute_resolution": "string | null — e.g. arbitration, litigation",
    "exact_clause": "string | null — verbatim clause text"
  },
  "sla": {
    "uptime_commitment": "string | null",
    "response_time": "string | null",
    "resolution_time": "string | null",
    "penalty_for_breach": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "confidentiality": {
    "duration": "string | null",
    "scope": "string | null",
    "exclusions": ["string"],
    "exact_clause": "string | null — verbatim clause text"
  },
  "intellectual_property": {
    "ownership": "string | null",
    "licence_granted": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "indemnification": {
    "indemnifying_party": "string | null — use token if applicable",
    "scope": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "force_majeure": {
    "present": "boolean | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "amendments": {
    "amendment_process": "string | null",
    "exact_clause": "string | null — verbatim clause text"
  },
  "notices": {
    "method": "string | null — e.g. registered post, email",
    "contact_tokens": ["string — placeholder tokens only"],
    "exact_clause": "string | null — verbatim clause text"
  },
  "entire_agreement": "boolean | null — true if entire agreement clause present",
  "severability": "boolean | null — true if severability clause present",
  "extraction_notes": "string | null — flag any ambiguities or missing sections"
}

Customising the schema

The schema above covers the full standard set of commercial contract fields. If your contracts regularly contain additional clauses specific to your industry (for example data processing agreements, escrow arrangements, or benchmark testing clauses), you can add fields to the schema following the same pattern. Add the field name, expected type, and a brief description after the colon. Claude Code will attempt to extract any field you define.

Running the full pipeline

Now that all layers are in place, create the main pipeline runner script that ties everything together.

nano ~/contract-pipeline/run_pipeline.sh

#!/usr/bin/env bash
# run_pipeline.sh
# Runs the full three-layer contract extraction pipeline for a single file.
# Usage: ./run_pipeline.sh contracts/my-contract.pdf

set -e

CONTRACT_FILE="$1"

if [ -z "$CONTRACT_FILE" ]; then
  echo "Usage: ./run_pipeline.sh <path/to/contract>"
  exit 1
fi

BASENAME=$(basename "$CONTRACT_FILE" | sed 's/\.[^.]*$//')
RAW_TEXT="sanitized/${BASENAME}_raw.txt"
SANITIZED="sanitized/${BASENAME}_sanitized.txt"
OUTPUT="output/${BASENAME}_extracted.json"

echo "[1/3] Ingesting: $CONTRACT_FILE"
source venv/bin/activate
python3 ingest.py "$CONTRACT_FILE" > "$RAW_TEXT"
echo "      Raw text saved to: $RAW_TEXT"

echo "[2/3] Anonymizing..."
python3 anonymize.py "$RAW_TEXT" "$SANITIZED"
echo "      Sanitized text saved to: $SANITIZED"

echo "[3/3] Running Claude Code extraction..."
claude \
  --print \
  --max-turns 3 \
  "Read the contract in the file $SANITIZED and extract all fields per CLAUDE.md. Output only valid JSON." \
  > "$OUTPUT"

echo "      Extraction complete. JSON saved to: $OUTPUT"
echo ""
echo "Done. Files produced:"
echo "  Raw text:   $RAW_TEXT"
echo "  Sanitized:  $SANITIZED"
echo "  Mapping:    mappings/${BASENAME}_sanitized.enc"
echo "  Output:     $OUTPUT"

Make the script executable:

chmod +x ~/contract-pipeline/run_pipeline.sh

Also update ingest.py to support writing output to a file rather than just printing to screen. Open the file and replace the if __name__ == "__main__" block at the bottom with this version:

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python3 ingest.py <path/to/contract>")
        sys.exit(1)
    result = extract_text(sys.argv[1])
    print(result)  # shell redirect captures this to a file

Run the pipeline on a contract

Place a contract file in the contracts/ directory, then run:

cd ~/contract-pipeline
./run_pipeline.sh contracts/your-contract.pdf

A successful run produces output similar to this:

[1/3] Ingesting: contracts/your-contract.pdf
      Raw text saved to: sanitized/your-contract_raw.txt
[2/3] Anonymizing...
Encryption key created at /home/youruser/.contract_pipeline_key — back this up securely.
Sanitized text saved to: sanitized/your-contract_sanitized.txt
Entities replaced: 34
[3/3] Running Claude Code extraction...
      Extraction complete. JSON saved to: output/your-contract_extracted.json

Done. Files produced:
  Raw text:   sanitized/your-contract_raw.txt
  Sanitized:  sanitized/your-contract_sanitized.txt
  Mapping:    mappings/your-contract_sanitized.enc
  Output:     output/your-contract_extracted.json

Inspect the extracted JSON:

cat output/your-contract_extracted.json | python3 -m json.tool

The python3 -m json.tool part pretty-prints the JSON so it is readable in the terminal. You should see a structured object with all fields from the schema, populated where the contract contained the relevant information and set to null where it did not.

Auditing that no PII reached the API

Before putting this pipeline into production, you should perform a manual audit pass to confirm the anonymization layer is catching all identifiers. This gives you evidence for your DPIA and satisfies Article 25 documentation requirements.

Visual inspection of the sanitized file

Open the sanitized text file and scan it. Every personal name, company name, email, phone number, address, date, and ID number should have been replaced with a bracketed token.

cat sanitized/your-contract_sanitized.txt | less

Use arrow keys to scroll. Press q to exit.

Search for common PII patterns that may have been missed

Run these grep commands against the sanitized file to check for common patterns that might have slipped through. A clean result (no output) means nothing was found.

# Check for email addresses
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' \
  sanitized/your-contract_sanitized.txt

# Check for phone number patterns
grep -oE '(\+?[0-9]{1,3}[\s\-]?)?(\([0-9]{2,4}\)[\s\-]?)?[0-9]{3,5}[\s\-]?[0-9]{3,5}' \
  sanitized/your-contract_sanitized.txt

# Check for IBAN patterns
grep -oE '[A-Z]{2}[0-9]{2}[A-Z0-9]{4}[0-9]{7,}' \
  sanitized/your-contract_sanitized.txt

# Check for postcode / zip code patterns
grep -oE '[A-Z]{1,2}[0-9][0-9A-Z]?\s?[0-9][A-Z]{2}|[0-9]{5}(-[0-9]{4})?' \
  sanitized/your-contract_sanitized.txt

When grep finds something

If any of these commands return a result, Presidio did not catch that entity. This can happen with unusual formatting (for example, phone numbers written as words, or addresses that use non-standard punctuation). For each miss you find, note the pattern and consider adding a custom Presidio recognizer for it. The Presidio documentation at microsoft.github.io/presidio covers writing custom recognizers using regular expressions or ML models.

Verify the mapping file is encrypted

Confirm the mapping file is not readable as plain text:

file mappings/your-contract_sanitized.enc
head -c 100 mappings/your-contract_sanitized.enc | cat

The output should be binary garbage, not readable JSON. If you can read names or email addresses in the output, the encryption step did not run correctly. Do not proceed to production until this is resolved.

What a passing audit looks like

No grep commands return results. The sanitized file contains only bracketed tokens where identifiers were. The mapping file is unreadable binary. The extracted JSON contains tokens in place of any personal identifiers. This state is what you document in your DPIA as evidence of the technical safeguard.

Troubleshooting

The spaCy model fails to load with “Can’t find model ‘en_core_web_lg'”

The model download did not complete or was installed outside the active virtual environment. Activate the virtual environment first with source ~/contract-pipeline/venv/bin/activate, then re-run python3 -m spacy download en_core_web_lg. Confirm with python3 -c "import spacy; spacy.load('en_core_web_lg')".

pdfplumber returns empty text from a PDF

The PDF is likely a scanned image rather than a text-layer PDF. Run pdftotext your-contract.pdf - and check for output. If that also returns nothing, the PDF requires OCR processing. Use a tool such as ocrmypdf to add a text layer before running the pipeline: sudo apt install ocrmypdf then ocrmypdf input.pdf output.pdf.

Claude Code returns “Authentication error” or “Invalid API key”

The ANTHROPIC_API_KEY environment variable is either not set or set incorrectly. Run echo $ANTHROPIC_API_KEY to check its value. If it is empty, re-run source ~/.bashrc. If it shows the key, log into console.anthropic.com and confirm the key is active and not expired.

The extracted JSON is malformed or wrapped in markdown fences

Claude Code occasionally wraps output in markdown code fences (“`json … “`) despite instructions not to. Strip them programmatically: cat output/file.json | sed 's/```json//g' | sed 's/```//g' | python3 -m json.tool. If this happens consistently, add the phrase “Do not use markdown code fences” explicitly to the first line of the CLAUDE.md output instructions.

Presidio is replacing contract clause dates with [DATE_1] tokens

This is correct and expected behaviour. The schema asks Claude Code to extract exact clause text, which will contain date tokens. The extracted JSON will show "effective_date": "[DATE_1]". To resolve a token back to its original value, decrypt the mapping file: python3 -c "from anonymize import load_mapping; import json; print(json.dumps(load_mapping('mappings/your-contract_sanitized.enc'), indent=2))".

The pipeline runs but produces “Entities replaced: 0”

Presidio found no personal identifiers in the text. This may mean the contract is already anonymized, or the text extraction produced garbled output that Presidio cannot parse. Check the raw text file first: head -50 sanitized/your-contract_raw.txt. If the text looks garbled, re-check the ingestion step for the specific file format.

Permission denied when running run_pipeline.sh

The script is not marked as executable. Run chmod +x ~/contract-pipeline/run_pipeline.sh and try again.

node: command not found after installing Node.js

The NodeSource install completed but the shell has not picked up the new PATH. Run source ~/.bashrc or open a new terminal session. If node --version still fails, confirm the install ran without errors by re-running curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - and checking for any error lines in the output.