Claude API Structured Output Guide: JSON Extraction, Data Parsing, and Schema Validation

Why Structured Output Is Claude’s Killer Feature for Developers

Most LLM use cases in production are not chatbots — they are data extraction pipelines. A company needs to parse invoices into accounting records. A recruiter needs to extract skills from resumes. A legal team needs to pull key terms from contracts. In every case, the output must be structured, validated, and machine-readable — not free-form text.

Claude excels at structured output because of three features: strong instruction following (it respects schema definitions), prefill (you can force the response to start with { for guaranteed JSON), and large context (you can process entire documents in one call). Combined with schema validation using Pydantic or Zod, you get a reliable extraction pipeline.

This guide covers the patterns for building production-grade structured output with Claude API.

The Basics: Getting JSON from Claude

Simple JSON Extraction

import anthropic
import json

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a data extraction assistant. Always respond with valid JSON only. No markdown, no explanation, no text outside the JSON object.",
    messages=[
        {
            "role": "user",
            "content": """Extract the following information from this text:

"John Smith, CEO of Acme Corp, announced today that the company
raised $50 million in Series C funding led by Sequoia Capital.
The round values the company at $500 million. Acme Corp, founded
in 2019, provides cloud infrastructure for AI startups."

Extract: person_name, title, company, funding_amount, funding_round,
lead_investor, valuation, founded_year, company_description"""
        },
        {
            "role": "assistant",
            "content": "{"  # Prefill forces JSON output
        }
    ]
)

# Complete the JSON (Claude continues from the prefill)
json_str = "{" + message.content[0].text
data = json.loads(json_str)
print(json.dumps(data, indent=2))

Output:

{
  "person_name": "John Smith",
  "title": "CEO",
  "company": "Acme Corp",
  "funding_amount": "$50 million",
  "funding_round": "Series C",
  "lead_investor": "Sequoia Capital",
  "valuation": "$500 million",
  "founded_year": 2019,
  "company_description": "Provides cloud infrastructure for AI startups"
}

The Prefill Technique Explained

The key line is:

{"role": "assistant", "content": "{"}

This tells Claude that the assistant’s response has already started with {. Claude will continue from there, producing valid JSON. Without this, Claude may add explanatory text before or after the JSON.

Schema-Driven Extraction with Pydantic

Define Your Schema

from pydantic import BaseModel, Field
from typing import Optional, List
from enum import Enum

class FundingRound(str, Enum):
    seed = "seed"
    series_a = "series_a"
    series_b = "series_b"
    series_c = "series_c"
    series_d = "series_d"

class FundingEvent(BaseModel):
    company_name: str = Field(description="Name of the company")
    funding_amount_usd: Optional[int] = Field(description="Amount in USD, null if not stated")
    funding_round: Optional[FundingRound] = Field(description="Type of funding round")
    lead_investors: List[str] = Field(description="List of lead investors")
    valuation_usd: Optional[int] = Field(description="Company valuation in USD")
    date: Optional[str] = Field(description="Date of announcement, ISO format")
    sector: str = Field(description="Industry sector")
    headquarters: Optional[str] = Field(description="City and country")

Generate the Schema Description for Claude

def schema_to_prompt(model_class):
    schema = model_class.model_json_schema()
    properties = schema.get("properties", {})

    lines = [f"Return a JSON object with these fields:"]
    for name, prop in properties.items():
        type_str = prop.get("type", "string")
        desc = prop.get("description", "")
        required = name in schema.get("required", [])
        nullable = "null if not found" if not required else "required"
        lines.append(f'  - "{name}": {type_str} — {desc} ({nullable})')

    return "\n".join(lines)

Extract and Validate

def extract_structured(text: str, schema_class):
    schema_prompt = schema_to_prompt(schema_class)

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=f"""You are a precise data extraction assistant.
Extract information from the provided text and return valid JSON.

{schema_prompt}

Rules:
- Return ONLY the JSON object, no other text
- Use null for fields where information is not available
- Use exact values from the text, do not infer or guess
- Dates in ISO 8601 format (YYYY-MM-DD)
- Monetary amounts as integers in USD (convert if needed)""",
        messages=[
            {"role": "user", "content": text},
            {"role": "assistant", "content": "{"}
        ]
    )

    json_str = "{" + message.content[0].text

    # Validate with Pydantic
    try:
        result = schema_class.model_validate_json(json_str)
        return result
    except Exception as e:
        # Handle validation failure
        return {"error": str(e), "raw": json_str}

# Usage
text = "Acme Corp raised $50M Series C from Sequoia..."
result = extract_structured(text, FundingEvent)
print(result.model_dump_json(indent=2))

Advanced Extraction Patterns

Multi-Entity Extraction

Extract multiple entities from a single document:

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    vendor_address: Optional[str]
    issue_date: str
    due_date: str
    line_items: List[LineItem]
    subtotal: float
    tax_amount: float
    total: float
    currency: str
    payment_terms: Optional[str]

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    amount: float

Conditional Fields

Handle fields that depend on context:

class JobPosting(BaseModel):
    title: str
    company: str
    location: str
    remote_policy: str = Field(description="'remote', 'hybrid', 'onsite', or 'not_specified'")
    salary_min: Optional[int] = Field(description="Minimum salary in USD/year, null if not listed")
    salary_max: Optional[int] = Field(description="Maximum salary in USD/year, null if not listed")
    experience_years: Optional[int] = Field(description="Minimum years of experience required")
    skills_required: List[str]
    skills_preferred: List[str]
    benefits: List[str]
    posted_date: Optional[str]

Extraction with Confidence Scores

class ExtractedField(BaseModel):
    value: str
    confidence: float = Field(description="0.0 to 1.0, how confident you are in this extraction")
    source_text: str = Field(description="The exact text span this was extracted from")

class ContractExtraction(BaseModel):
    parties: List[ExtractedField]
    effective_date: ExtractedField
    termination_date: Optional[ExtractedField]
    total_value: Optional[ExtractedField]
    payment_terms: Optional[ExtractedField]
    governing_law: Optional[ExtractedField]
    key_obligations: List[ExtractedField]

Adding confidence scores helps downstream systems decide when to flag extractions for human review.

Handling Edge Cases

Malformed JSON Recovery

import json
import re

def parse_json_safely(text: str):
    """Attempt to parse JSON with fallback recovery."""
    # Try direct parse
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Try to find JSON object in the text
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass

    # Try to fix common issues
    # Remove trailing comma before closing brace
    fixed = re.sub(r',\s*}', '}', text)
    fixed = re.sub(r',\s*]', ']', fixed)
    try:
        return json.loads(fixed)
    except json.JSONDecodeError:
        return None

Missing Fields Handling

def extract_with_defaults(text: str, schema_class):
    result = extract_structured(text, schema_class)

    if isinstance(result, dict) and "error" in result:
        # Retry with more explicit instructions
        result = extract_structured(
            text + "\n\nIMPORTANT: You MUST include all required fields. "
            "Use null for any field where the information is truly not available.",
            schema_class
        )

    return result

Ambiguous Input Handling

system_prompt = """...
When the text is ambiguous:
- If a field could have multiple interpretations, choose the most likely one
  and set confidence to 0.5-0.7
- If a field is genuinely not present in the text, set it to null
- If a field is implied but not explicitly stated, include it with
  confidence 0.3-0.5 and note the inference in source_text
- NEVER fabricate data that is not in or implied by the text"""

Production Pipeline Architecture

Batch Processing

import asyncio
from anthropic import AsyncAnthropic

async_client = AsyncAnthropic()

async def extract_batch(texts: List[str], schema_class, max_concurrent=5):
    semaphore = asyncio.Semaphore(max_concurrent)
    results = []

    async def process_one(text, index):
        async with semaphore:
            try:
                message = await async_client.messages.create(
                    model="claude-sonnet-4-20250514",
                    max_tokens=2048,
                    system=extraction_system_prompt,
                    messages=[
                        {"role": "user", "content": text},
                        {"role": "assistant", "content": "{"}
                    ]
                )
                json_str = "{" + message.content[0].text
                result = schema_class.model_validate_json(json_str)
                return {"index": index, "result": result, "status": "success"}
            except Exception as e:
                return {"index": index, "error": str(e), "status": "failed"}

    tasks = [process_one(text, i) for i, text in enumerate(texts)]
    results = await asyncio.gather(*tasks)
    return sorted(results, key=lambda x: x["index"])

Retry with Escalation

async def extract_with_retry(text, schema_class, max_retries=3):
    models = ["claude-haiku-4-5-20251001", "claude-sonnet-4-20250514", "claude-opus-4-20250514"]

    for attempt in range(max_retries):
        model = models[min(attempt, len(models) - 1)]
        try:
            result = await extract_one(text, schema_class, model=model)
            if result["status"] == "success":
                return result
        except Exception:
            continue

    return {"status": "failed", "error": "All retries exhausted"}

This pattern starts with the cheapest model (Haiku) and escalates to more capable models only when extraction fails — optimizing cost while maintaining reliability.

Cost Optimization

Model Selection by Task Complexity

Task	Recommended Model	Cost per 1K extractions
Simple field extraction (name, date, amount)	Claude Haiku 4.5	~$0.50
Multi-field document parsing	Claude Sonnet 4	~$5.00
Complex contract analysis	Claude Opus 4.6	~$50.00

Reducing Token Usage

Trim input: remove boilerplate, headers, footers before sending
Batch related extractions: extract all fields in one call instead of separate calls per field
Cache results: identical inputs should return cached results, not new API calls
Use Haiku for validation: after Sonnet extracts, use Haiku to verify specific fields

Frequently Asked Questions

Does Claude support native JSON mode like OpenAI?

Claude does not have a response_format: json parameter. Instead, use the prefill technique (start the assistant response with {) combined with clear system prompt instructions. This produces reliable JSON output.

How do I handle very long documents?

Claude’s 200K token context window handles most documents. For extremely long documents (books, legal filings), split by section and extract from each section independently, then merge results.

Can Claude extract data from images?

Yes. Claude’s vision capability can extract structured data from images of forms, receipts, and documents. Include the image in the message and use the same schema-driven approach.

What about extraction in languages other than English?

Claude handles multilingual extraction well. You can provide source text in Korean, Japanese, or other languages and extract into English-labeled JSON fields. Or use native-language field labels if preferred.

How accurate is Claude for data extraction?

For well-structured documents (invoices, job postings, press releases), accuracy is typically 95-99% for clearly stated fields. For ambiguous or implied information, accuracy drops to 70-85%. Always validate critical extractions.

Should I use Claude or a dedicated OCR/NER tool?

For structured documents with standard layouts, dedicated tools (AWS Textract, Google Document AI) may be more cost-effective at very high volumes. Claude excels when: the document structure varies, you need custom schemas, or the extraction requires reasoning beyond pattern matching.

Explore More Tools