Claude API Structured Output Guide: JSON Extraction, Data Parsing, and Schema Validation
Why Structured Output Is Claude’s Killer Feature for Developers
Most LLM use cases in production are not chatbots — they are data extraction pipelines. A company needs to parse invoices into accounting records. A recruiter needs to extract skills from resumes. A legal team needs to pull key terms from contracts. In every case, the output must be structured, validated, and machine-readable — not free-form text.
Claude excels at structured output because of three features: strong instruction following (it respects schema definitions), prefill (you can force the response to start with { for guaranteed JSON), and large context (you can process entire documents in one call). Combined with schema validation using Pydantic or Zod, you get a reliable extraction pipeline.
This guide covers the patterns for building production-grade structured output with Claude API.
The Basics: Getting JSON from Claude
Simple JSON Extraction
import anthropic
import json
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a data extraction assistant. Always respond with valid JSON only. No markdown, no explanation, no text outside the JSON object.",
messages=[
{
"role": "user",
"content": """Extract the following information from this text:
"John Smith, CEO of Acme Corp, announced today that the company
raised $50 million in Series C funding led by Sequoia Capital.
The round values the company at $500 million. Acme Corp, founded
in 2019, provides cloud infrastructure for AI startups."
Extract: person_name, title, company, funding_amount, funding_round,
lead_investor, valuation, founded_year, company_description"""
},
{
"role": "assistant",
"content": "{" # Prefill forces JSON output
}
]
)
# Complete the JSON (Claude continues from the prefill)
json_str = "{" + message.content[0].text
data = json.loads(json_str)
print(json.dumps(data, indent=2))
Output:
{
"person_name": "John Smith",
"title": "CEO",
"company": "Acme Corp",
"funding_amount": "$50 million",
"funding_round": "Series C",
"lead_investor": "Sequoia Capital",
"valuation": "$500 million",
"founded_year": 2019,
"company_description": "Provides cloud infrastructure for AI startups"
}
The Prefill Technique Explained
The key line is:
{"role": "assistant", "content": "{"}
This tells Claude that the assistant’s response has already started with {. Claude will continue from there, producing valid JSON. Without this, Claude may add explanatory text before or after the JSON.
Schema-Driven Extraction with Pydantic
Define Your Schema
from pydantic import BaseModel, Field
from typing import Optional, List
from enum import Enum
class FundingRound(str, Enum):
seed = "seed"
series_a = "series_a"
series_b = "series_b"
series_c = "series_c"
series_d = "series_d"
class FundingEvent(BaseModel):
company_name: str = Field(description="Name of the company")
funding_amount_usd: Optional[int] = Field(description="Amount in USD, null if not stated")
funding_round: Optional[FundingRound] = Field(description="Type of funding round")
lead_investors: List[str] = Field(description="List of lead investors")
valuation_usd: Optional[int] = Field(description="Company valuation in USD")
date: Optional[str] = Field(description="Date of announcement, ISO format")
sector: str = Field(description="Industry sector")
headquarters: Optional[str] = Field(description="City and country")
Generate the Schema Description for Claude
def schema_to_prompt(model_class):
schema = model_class.model_json_schema()
properties = schema.get("properties", {})
lines = [f"Return a JSON object with these fields:"]
for name, prop in properties.items():
type_str = prop.get("type", "string")
desc = prop.get("description", "")
required = name in schema.get("required", [])
nullable = "null if not found" if not required else "required"
lines.append(f' - "{name}": {type_str} — {desc} ({nullable})')
return "\n".join(lines)
Extract and Validate
def extract_structured(text: str, schema_class):
schema_prompt = schema_to_prompt(schema_class)
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system=f"""You are a precise data extraction assistant.
Extract information from the provided text and return valid JSON.
{schema_prompt}
Rules:
- Return ONLY the JSON object, no other text
- Use null for fields where information is not available
- Use exact values from the text, do not infer or guess
- Dates in ISO 8601 format (YYYY-MM-DD)
- Monetary amounts as integers in USD (convert if needed)""",
messages=[
{"role": "user", "content": text},
{"role": "assistant", "content": "{"}
]
)
json_str = "{" + message.content[0].text
# Validate with Pydantic
try:
result = schema_class.model_validate_json(json_str)
return result
except Exception as e:
# Handle validation failure
return {"error": str(e), "raw": json_str}
# Usage
text = "Acme Corp raised $50M Series C from Sequoia..."
result = extract_structured(text, FundingEvent)
print(result.model_dump_json(indent=2))
Advanced Extraction Patterns
Multi-Entity Extraction
Extract multiple entities from a single document:
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
vendor_address: Optional[str]
issue_date: str
due_date: str
line_items: List[LineItem]
subtotal: float
tax_amount: float
total: float
currency: str
payment_terms: Optional[str]
class LineItem(BaseModel):
description: str
quantity: float
unit_price: float
amount: float
Conditional Fields
Handle fields that depend on context:
class JobPosting(BaseModel):
title: str
company: str
location: str
remote_policy: str = Field(description="'remote', 'hybrid', 'onsite', or 'not_specified'")
salary_min: Optional[int] = Field(description="Minimum salary in USD/year, null if not listed")
salary_max: Optional[int] = Field(description="Maximum salary in USD/year, null if not listed")
experience_years: Optional[int] = Field(description="Minimum years of experience required")
skills_required: List[str]
skills_preferred: List[str]
benefits: List[str]
posted_date: Optional[str]
Extraction with Confidence Scores
class ExtractedField(BaseModel):
value: str
confidence: float = Field(description="0.0 to 1.0, how confident you are in this extraction")
source_text: str = Field(description="The exact text span this was extracted from")
class ContractExtraction(BaseModel):
parties: List[ExtractedField]
effective_date: ExtractedField
termination_date: Optional[ExtractedField]
total_value: Optional[ExtractedField]
payment_terms: Optional[ExtractedField]
governing_law: Optional[ExtractedField]
key_obligations: List[ExtractedField]
Adding confidence scores helps downstream systems decide when to flag extractions for human review.
Handling Edge Cases
Malformed JSON Recovery
import json
import re
def parse_json_safely(text: str):
"""Attempt to parse JSON with fallback recovery."""
# Try direct parse
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# Try to find JSON object in the text
match = re.search(r'\{.*\}', text, re.DOTALL)
if match:
try:
return json.loads(match.group())
except json.JSONDecodeError:
pass
# Try to fix common issues
# Remove trailing comma before closing brace
fixed = re.sub(r',\s*}', '}', text)
fixed = re.sub(r',\s*]', ']', fixed)
try:
return json.loads(fixed)
except json.JSONDecodeError:
return None
Missing Fields Handling
def extract_with_defaults(text: str, schema_class):
result = extract_structured(text, schema_class)
if isinstance(result, dict) and "error" in result:
# Retry with more explicit instructions
result = extract_structured(
text + "\n\nIMPORTANT: You MUST include all required fields. "
"Use null for any field where the information is truly not available.",
schema_class
)
return result
Ambiguous Input Handling
system_prompt = """... When the text is ambiguous: - If a field could have multiple interpretations, choose the most likely one and set confidence to 0.5-0.7 - If a field is genuinely not present in the text, set it to null - If a field is implied but not explicitly stated, include it with confidence 0.3-0.5 and note the inference in source_text - NEVER fabricate data that is not in or implied by the text"""
Production Pipeline Architecture
Batch Processing
import asyncio
from anthropic import AsyncAnthropic
async_client = AsyncAnthropic()
async def extract_batch(texts: List[str], schema_class, max_concurrent=5):
semaphore = asyncio.Semaphore(max_concurrent)
results = []
async def process_one(text, index):
async with semaphore:
try:
message = await async_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system=extraction_system_prompt,
messages=[
{"role": "user", "content": text},
{"role": "assistant", "content": "{"}
]
)
json_str = "{" + message.content[0].text
result = schema_class.model_validate_json(json_str)
return {"index": index, "result": result, "status": "success"}
except Exception as e:
return {"index": index, "error": str(e), "status": "failed"}
tasks = [process_one(text, i) for i, text in enumerate(texts)]
results = await asyncio.gather(*tasks)
return sorted(results, key=lambda x: x["index"])
Retry with Escalation
async def extract_with_retry(text, schema_class, max_retries=3):
models = ["claude-haiku-4-5-20251001", "claude-sonnet-4-20250514", "claude-opus-4-20250514"]
for attempt in range(max_retries):
model = models[min(attempt, len(models) - 1)]
try:
result = await extract_one(text, schema_class, model=model)
if result["status"] == "success":
return result
except Exception:
continue
return {"status": "failed", "error": "All retries exhausted"}
This pattern starts with the cheapest model (Haiku) and escalates to more capable models only when extraction fails — optimizing cost while maintaining reliability.
Cost Optimization
Model Selection by Task Complexity
| Task | Recommended Model | Cost per 1K extractions |
|---|---|---|
| Simple field extraction (name, date, amount) | Claude Haiku 4.5 | ~$0.50 |
| Multi-field document parsing | Claude Sonnet 4 | ~$5.00 |
| Complex contract analysis | Claude Opus 4.6 | ~$50.00 |
Reducing Token Usage
- Trim input: remove boilerplate, headers, footers before sending
- Batch related extractions: extract all fields in one call instead of separate calls per field
- Cache results: identical inputs should return cached results, not new API calls
- Use Haiku for validation: after Sonnet extracts, use Haiku to verify specific fields
Frequently Asked Questions
Does Claude support native JSON mode like OpenAI?
Claude does not have a response_format: json parameter. Instead, use the prefill technique (start the assistant response with {) combined with clear system prompt instructions. This produces reliable JSON output.
How do I handle very long documents?
Claude’s 200K token context window handles most documents. For extremely long documents (books, legal filings), split by section and extract from each section independently, then merge results.
Can Claude extract data from images?
Yes. Claude’s vision capability can extract structured data from images of forms, receipts, and documents. Include the image in the message and use the same schema-driven approach.
What about extraction in languages other than English?
Claude handles multilingual extraction well. You can provide source text in Korean, Japanese, or other languages and extract into English-labeled JSON fields. Or use native-language field labels if preferred.
How accurate is Claude for data extraction?
For well-structured documents (invoices, job postings, press releases), accuracy is typically 95-99% for clearly stated fields. For ambiguous or implied information, accuracy drops to 70-85%. Always validate critical extractions.
Should I use Claude or a dedicated OCR/NER tool?
For structured documents with standard layouts, dedicated tools (AWS Textract, Google Document AI) may be more cost-effective at very high volumes. Claude excels when: the document structure varies, you need custom schemas, or the extraction requires reasoning beyond pattern matching.