How to Build an Enterprise RAG System with Claude: Knowledge Base That Answers Employee Questions
Why Every Enterprise Needs an Internal Knowledge Base (And Why Search Is Not Enough)
Enterprise knowledge is scattered across hundreds of documents: HR policies in Google Drive, engineering runbooks in Confluence, product specs in Notion, compliance guidelines in SharePoint, and tribal knowledge in Slack threads. When an employee has a question — “What is our parental leave policy?” or “How do I set up a VPN connection?” — they search across 3-4 tools, find 5 potentially relevant documents, read through them, and hope they found the current answer.
Traditional search (keyword matching) fails because:
- The employee’s question (“Can I work from another country?”) does not match the document title (“Remote Work Policy — International Assignments”)
- Multiple documents contain conflicting information (an old policy and the current one)
- The answer is buried in paragraph 7 of a 20-page document
- Search returns documents, not answers
RAG (Retrieval-Augmented Generation) with Claude solves this by:
- Understanding the intent of the question (semantic search, not keyword matching)
- Finding the relevant sections across all documents (not just document titles)
- Synthesizing an answer from the relevant sections
- Citing the source document so the employee can verify
This guide covers building a production RAG system with Claude API.
Architecture Overview
Employee Question
↓
[Query Processing]
↓
[Vector Search] → finds top 5-10 relevant document chunks
↓
[Context Assembly] → formats chunks into a prompt
↓
[Claude API] → generates an answer grounded in the context
↓
[Post-Processing] → adds citations, checks for hallucination
↓
Answer with Sources
Step 1: Ingest Documents
Document Sources
Common enterprise document sources: - Google Drive / SharePoint (policies, procedures) - Confluence / Notion (engineering docs, product specs) - Internal wikis (tribal knowledge) - Slack / Teams (frequently asked questions, decisions) - Ticketing systems (common issues and resolutions) - Knowledge base articles (existing help content) - PDF manuals and handbooks
Document Processing Pipeline
import os
from pathlib import Path
def ingest_documents(source_dir):
documents = []
for file_path in Path(source_dir).rglob("*"):
if file_path.suffix in [".md", ".txt"]:
content = file_path.read_text(encoding="utf-8")
elif file_path.suffix == ".pdf":
content = extract_pdf_text(file_path)
elif file_path.suffix in [".docx", ".doc"]:
content = extract_docx_text(file_path)
elif file_path.suffix == ".html":
content = extract_html_text(file_path)
else:
continue
documents.append({
"content": content,
"source": str(file_path),
"title": file_path.stem,
"last_modified": os.path.getmtime(file_path),
"file_type": file_path.suffix
})
return documents
Metadata Extraction
For every document, extract metadata that improves retrieval:
Metadata fields: - title: document title - source: file path or URL - department: which team owns this document - document_type: policy, procedure, guide, FAQ, runbook - last_updated: when was this last modified - audience: all employees, engineering, HR, managers - confidentiality: public, internal, restricted
Step 2: Chunk and Embed
Chunking Strategy
def chunk_document(document, chunk_size=500, overlap=50):
"""Split document into overlapping chunks."""
text = document["content"]
chunks = []
# Split by headers first (semantic chunking)
sections = split_by_headers(text)
for section in sections:
if len(section.split()) <= chunk_size:
# Section fits in one chunk
chunks.append({
"text": section,
"metadata": {
**document,
"chunk_type": "section"
}
})
else:
# Section too long — split with overlap
words = section.split()
for i in range(0, len(words), chunk_size - overlap):
chunk_text = " ".join(words[i:i + chunk_size])
chunks.append({
"text": chunk_text,
"metadata": {
**document,
"chunk_type": "partial_section"
}
})
return chunks
Embedding Generation
from voyageai import Client as VoyageClient
voyage = VoyageClient()
def embed_chunks(chunks):
texts = [chunk["text"] for chunk in chunks]
embeddings = voyage.embed(
texts,
model="voyage-3",
input_type="document"
)
for i, chunk in enumerate(chunks):
chunk["embedding"] = embeddings.embeddings[i]
return chunks
Vector Storage
import pinecone
# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-api-key")
index = pc.Index("enterprise-knowledge")
def store_chunks(chunks):
vectors = []
for i, chunk in enumerate(chunks):
vectors.append({
"id": f"chunk_{i}",
"values": chunk["embedding"],
"metadata": {
"text": chunk["text"],
"source": chunk["metadata"]["source"],
"title": chunk["metadata"]["title"],
"department": chunk["metadata"].get("department", "unknown"),
"last_updated": chunk["metadata"].get("last_modified", 0)
}
})
# Upsert in batches
for batch_start in range(0, len(vectors), 100):
batch = vectors[batch_start:batch_start + 100]
index.upsert(vectors=batch)
Step 3: Build the Retrieval Pipeline
Query Processing
def retrieve_context(query, top_k=5):
# Embed the query
query_embedding = voyage.embed(
[query],
model="voyage-3",
input_type="query"
).embeddings[0]
# Search vector store
results = index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Format results
context_chunks = []
for match in results.matches:
context_chunks.append({
"text": match.metadata["text"],
"source": match.metadata["source"],
"title": match.metadata["title"],
"score": match.score
})
return context_chunks
Relevance Filtering
def filter_relevant_chunks(chunks, threshold=0.7):
"""Remove chunks below relevance threshold."""
return [c for c in chunks if c["score"] >= threshold]
Step 4: Connect Claude for Generation
The RAG Prompt
import anthropic
client = anthropic.Anthropic()
def generate_answer(query, context_chunks):
# Build context string
context = ""
for i, chunk in enumerate(context_chunks):
context += f"[Source {i+1}: {chunk['title']}]\n"
context += f"{chunk['text']}\n\n"
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="""You are an internal knowledge assistant for our company.
Answer employee questions based ONLY on the provided context.
Rules:
- Only use information from the provided sources
- Cite your sources using [Source N] notation
- If the context does not contain the answer, say:
"I could not find information about this in our documentation.
Please contact [relevant department] for help."
- Never make up information or answer from general knowledge
- Be concise and direct
- If sources conflict, note the conflict and cite the
most recently updated source""",
messages=[{
"role": "user",
"content": f"""Context from company documents:
{context}
Employee question: {query}
Answer based on the context above, citing sources:"""
}]
)
return response.content[0].text
Step 5: Add Guardrails
Hallucination Detection
def check_grounding(answer, context_chunks):
"""Verify the answer is grounded in the context."""
verification = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{
"role": "user",
"content": f"""Check if this answer is fully grounded
in the provided sources. Flag any claims that are NOT
supported by the sources.
Sources:
{format_chunks(context_chunks)}
Answer to check:
{answer}
Respond with JSON:
{{"grounded": true/false, "ungrounded_claims": [list of claims
not supported by sources]}}"""
}]
)
return parse_json(verification.content[0].text)
Citation Formatting
def format_with_citations(answer, context_chunks):
"""Convert [Source N] references to clickable links."""
for i, chunk in enumerate(context_chunks):
source_ref = f"[Source {i+1}]"
source_link = f"[{chunk['title']}]({chunk['source']})"
answer = answer.replace(source_ref, source_link)
return answer
Step 6: Deploy and Monitor
API Endpoint
from fastapi import FastAPI
app = FastAPI()
@app.post("/ask")
async def ask_question(query: str, user_id: str):
# Retrieve relevant context
chunks = retrieve_context(query, top_k=5)
chunks = filter_relevant_chunks(chunks)
if not chunks:
return {"answer": "No relevant documents found.",
"sources": []}
# Generate answer
answer = generate_answer(query, chunks)
# Check grounding
grounding = check_grounding(answer, chunks)
if not grounding["grounded"]:
answer += "\n\nNote: Some parts of this answer may not be "
answer += "fully supported by our documentation."
# Format citations
answer = format_with_citations(answer, chunks)
# Log for monitoring
log_query(query, answer, chunks, user_id)
return {
"answer": answer,
"sources": [{"title": c["title"], "url": c["source"]}
for c in chunks]
}
Quality Monitoring
Track weekly: - Total queries - Queries with no relevant context found (knowledge gaps) - User satisfaction (thumbs up/down) - Most common question topics - Average retrieval relevance score - Hallucination detection triggers Monthly: - Review lowest-rated answers - Identify missing documents (topics with no source material) - Update stale documents - Add new documents from recent policy changes
Common Enterprise RAG Challenges
Stale Documents
Problem: Old policies remain in the system alongside current ones. Solution: Add "last_updated" to metadata. When sources conflict, prefer the most recently updated. Add a document review pipeline that flags documents older than 12 months for human review.
Access Control
Problem: Not all employees should see all documents. Solution: Tag documents with access levels. Filter retrieved chunks based on the querying user's permissions before passing to Claude. Never include restricted documents in an unrestricted user's context.
Multi-Language Support
Problem: Documents exist in multiple languages. Solution: Use multilingual embedding models (Voyage multilingual or Cohere multilingual). Claude handles multilingual context well — a Korean document can answer an English query if the embedding model matches them semantically.
Frequently Asked Questions
How many documents can a RAG system handle?
Vector databases scale to millions of chunks. A typical enterprise with 10,000 documents produces 50,000-200,000 chunks — well within any vector database’s capacity.
What embedding model should I use?
Voyage AI voyage-3 or Cohere embed-v3 are strong choices. OpenAI text-embedding-3-large is also effective. The choice matters less than consistent usage — do not mix embedding models.
How accurate are RAG answers?
With good retrieval (relevant chunks found) and proper grounding (Claude stays within the context), accuracy is 90-95% for factual questions. For judgment questions (“Should I do X?”), accuracy depends on whether the documents contain guidance on that specific scenario.
What is the cost per query?
Embedding: ~$0.0001 per query. Vector search: ~$0.0001 per query. Claude Sonnet: ~$0.01-0.03 per query (depends on context size). Total: ~$0.01-0.03 per query. At 1,000 queries/day: $10-30/day.
How do I handle documents that change frequently?
Implement an incremental update pipeline: when a document is modified, re-chunk and re-embed only the changed document. Most vector databases support upsert operations for updating specific chunks without rebuilding the entire index.
Should I use Claude Haiku or Sonnet for the generation step?
Sonnet for complex questions that require reasoning across multiple sources. Haiku for simple factual lookups where the answer is in a single chunk. Using Haiku for simple queries and Sonnet for complex ones optimizes cost.