How to Build an Enterprise RAG System with Claude: Knowledge Base That Answers Employee Questions

Why Every Enterprise Needs an Internal Knowledge Base (And Why Search Is Not Enough)

Enterprise knowledge is scattered across hundreds of documents: HR policies in Google Drive, engineering runbooks in Confluence, product specs in Notion, compliance guidelines in SharePoint, and tribal knowledge in Slack threads. When an employee has a question — “What is our parental leave policy?” or “How do I set up a VPN connection?” — they search across 3-4 tools, find 5 potentially relevant documents, read through them, and hope they found the current answer.

Traditional search (keyword matching) fails because:

The employee’s question (“Can I work from another country?”) does not match the document title (“Remote Work Policy — International Assignments”)
Multiple documents contain conflicting information (an old policy and the current one)
The answer is buried in paragraph 7 of a 20-page document
Search returns documents, not answers

RAG (Retrieval-Augmented Generation) with Claude solves this by:

Understanding the intent of the question (semantic search, not keyword matching)
Finding the relevant sections across all documents (not just document titles)
Synthesizing an answer from the relevant sections
Citing the source document so the employee can verify

This guide covers building a production RAG system with Claude API.

Architecture Overview

Employee Question
    ↓
[Query Processing]
    ↓
[Vector Search] → finds top 5-10 relevant document chunks
    ↓
[Context Assembly] → formats chunks into a prompt
    ↓
[Claude API] → generates an answer grounded in the context
    ↓
[Post-Processing] → adds citations, checks for hallucination
    ↓
Answer with Sources

Step 1: Ingest Documents

Document Sources

Common enterprise document sources:
- Google Drive / SharePoint (policies, procedures)
- Confluence / Notion (engineering docs, product specs)
- Internal wikis (tribal knowledge)
- Slack / Teams (frequently asked questions, decisions)
- Ticketing systems (common issues and resolutions)
- Knowledge base articles (existing help content)
- PDF manuals and handbooks

Document Processing Pipeline

import os
from pathlib import Path

def ingest_documents(source_dir):
    documents = []

    for file_path in Path(source_dir).rglob("*"):
        if file_path.suffix in [".md", ".txt"]:
            content = file_path.read_text(encoding="utf-8")
        elif file_path.suffix == ".pdf":
            content = extract_pdf_text(file_path)
        elif file_path.suffix in [".docx", ".doc"]:
            content = extract_docx_text(file_path)
        elif file_path.suffix == ".html":
            content = extract_html_text(file_path)
        else:
            continue

        documents.append({
            "content": content,
            "source": str(file_path),
            "title": file_path.stem,
            "last_modified": os.path.getmtime(file_path),
            "file_type": file_path.suffix
        })

    return documents

Metadata Extraction

For every document, extract metadata that improves retrieval:

Metadata fields:
- title: document title
- source: file path or URL
- department: which team owns this document
- document_type: policy, procedure, guide, FAQ, runbook
- last_updated: when was this last modified
- audience: all employees, engineering, HR, managers
- confidentiality: public, internal, restricted

Step 2: Chunk and Embed

Chunking Strategy

def chunk_document(document, chunk_size=500, overlap=50):
    """Split document into overlapping chunks."""
    text = document["content"]
    chunks = []

    # Split by headers first (semantic chunking)
    sections = split_by_headers(text)

    for section in sections:
        if len(section.split()) <= chunk_size:
            # Section fits in one chunk
            chunks.append({
                "text": section,
                "metadata": {
                    **document,
                    "chunk_type": "section"
                }
            })
        else:
            # Section too long — split with overlap
            words = section.split()
            for i in range(0, len(words), chunk_size - overlap):
                chunk_text = " ".join(words[i:i + chunk_size])
                chunks.append({
                    "text": chunk_text,
                    "metadata": {
                        **document,
                        "chunk_type": "partial_section"
                    }
                })

    return chunks

Embedding Generation

from voyageai import Client as VoyageClient

voyage = VoyageClient()

def embed_chunks(chunks):
    texts = [chunk["text"] for chunk in chunks]
    embeddings = voyage.embed(
        texts,
        model="voyage-3",
        input_type="document"
    )
    for i, chunk in enumerate(chunks):
        chunk["embedding"] = embeddings.embeddings[i]
    return chunks

Vector Storage

import pinecone

# Initialize Pinecone
pc = pinecone.Pinecone(api_key="your-api-key")
index = pc.Index("enterprise-knowledge")

def store_chunks(chunks):
    vectors = []
    for i, chunk in enumerate(chunks):
        vectors.append({
            "id": f"chunk_{i}",
            "values": chunk["embedding"],
            "metadata": {
                "text": chunk["text"],
                "source": chunk["metadata"]["source"],
                "title": chunk["metadata"]["title"],
                "department": chunk["metadata"].get("department", "unknown"),
                "last_updated": chunk["metadata"].get("last_modified", 0)
            }
        })

    # Upsert in batches
    for batch_start in range(0, len(vectors), 100):
        batch = vectors[batch_start:batch_start + 100]
        index.upsert(vectors=batch)

Step 3: Build the Retrieval Pipeline

Query Processing

def retrieve_context(query, top_k=5):
    # Embed the query
    query_embedding = voyage.embed(
        [query],
        model="voyage-3",
        input_type="query"
    ).embeddings[0]

    # Search vector store
    results = index.query(
        vector=query_embedding,
        top_k=top_k,
        include_metadata=True
    )

    # Format results
    context_chunks = []
    for match in results.matches:
        context_chunks.append({
            "text": match.metadata["text"],
            "source": match.metadata["source"],
            "title": match.metadata["title"],
            "score": match.score
        })

    return context_chunks

Relevance Filtering

def filter_relevant_chunks(chunks, threshold=0.7):
    """Remove chunks below relevance threshold."""
    return [c for c in chunks if c["score"] >= threshold]

Step 4: Connect Claude for Generation

The RAG Prompt

import anthropic

client = anthropic.Anthropic()

def generate_answer(query, context_chunks):
    # Build context string
    context = ""
    for i, chunk in enumerate(context_chunks):
        context += f"[Source {i+1}: {chunk['title']}]\n"
        context += f"{chunk['text']}\n\n"

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="""You are an internal knowledge assistant for our company.
Answer employee questions based ONLY on the provided context.

Rules:
- Only use information from the provided sources
- Cite your sources using [Source N] notation
- If the context does not contain the answer, say:
  "I could not find information about this in our documentation.
  Please contact [relevant department] for help."
- Never make up information or answer from general knowledge
- Be concise and direct
- If sources conflict, note the conflict and cite the
  most recently updated source""",
        messages=[{
            "role": "user",
            "content": f"""Context from company documents:
{context}

Employee question: {query}

Answer based on the context above, citing sources:"""
        }]
    )

    return response.content[0].text

Step 5: Add Guardrails

Hallucination Detection

def check_grounding(answer, context_chunks):
    """Verify the answer is grounded in the context."""
    verification = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"""Check if this answer is fully grounded
in the provided sources. Flag any claims that are NOT
supported by the sources.

Sources:
{format_chunks(context_chunks)}

Answer to check:
{answer}

Respond with JSON:
{{"grounded": true/false, "ungrounded_claims": [list of claims
not supported by sources]}}"""
        }]
    )
    return parse_json(verification.content[0].text)

Citation Formatting

def format_with_citations(answer, context_chunks):
    """Convert [Source N] references to clickable links."""
    for i, chunk in enumerate(context_chunks):
        source_ref = f"[Source {i+1}]"
        source_link = f"[{chunk['title']}]({chunk['source']})"
        answer = answer.replace(source_ref, source_link)
    return answer

Step 6: Deploy and Monitor

API Endpoint

from fastapi import FastAPI

app = FastAPI()

@app.post("/ask")
async def ask_question(query: str, user_id: str):
    # Retrieve relevant context
    chunks = retrieve_context(query, top_k=5)
    chunks = filter_relevant_chunks(chunks)

    if not chunks:
        return {"answer": "No relevant documents found.",
                "sources": []}

    # Generate answer
    answer = generate_answer(query, chunks)

    # Check grounding
    grounding = check_grounding(answer, chunks)
    if not grounding["grounded"]:
        answer += "\n\nNote: Some parts of this answer may not be "
        answer += "fully supported by our documentation."

    # Format citations
    answer = format_with_citations(answer, chunks)

    # Log for monitoring
    log_query(query, answer, chunks, user_id)

    return {
        "answer": answer,
        "sources": [{"title": c["title"], "url": c["source"]}
                    for c in chunks]
    }

Quality Monitoring

Track weekly:
- Total queries
- Queries with no relevant context found (knowledge gaps)
- User satisfaction (thumbs up/down)
- Most common question topics
- Average retrieval relevance score
- Hallucination detection triggers

Monthly:
- Review lowest-rated answers
- Identify missing documents (topics with no source material)
- Update stale documents
- Add new documents from recent policy changes

Common Enterprise RAG Challenges

Stale Documents

Problem: Old policies remain in the system alongside current ones.
Solution: Add "last_updated" to metadata. When sources conflict,
prefer the most recently updated. Add a document review pipeline
that flags documents older than 12 months for human review.

Access Control

Problem: Not all employees should see all documents.
Solution: Tag documents with access levels. Filter retrieved
chunks based on the querying user's permissions before passing
to Claude. Never include restricted documents in an unrestricted
user's context.

Multi-Language Support

Problem: Documents exist in multiple languages.
Solution: Use multilingual embedding models (Voyage multilingual
or Cohere multilingual). Claude handles multilingual context
well — a Korean document can answer an English query if the
embedding model matches them semantically.

Frequently Asked Questions

How many documents can a RAG system handle?

Vector databases scale to millions of chunks. A typical enterprise with 10,000 documents produces 50,000-200,000 chunks — well within any vector database’s capacity.

What embedding model should I use?

Voyage AI voyage-3 or Cohere embed-v3 are strong choices. OpenAI text-embedding-3-large is also effective. The choice matters less than consistent usage — do not mix embedding models.

How accurate are RAG answers?

With good retrieval (relevant chunks found) and proper grounding (Claude stays within the context), accuracy is 90-95% for factual questions. For judgment questions (“Should I do X?”), accuracy depends on whether the documents contain guidance on that specific scenario.

What is the cost per query?

Embedding: ~$0.0001 per query. Vector search: ~$0.0001 per query. Claude Sonnet: ~$0.01-0.03 per query (depends on context size). Total: ~$0.01-0.03 per query. At 1,000 queries/day: $10-30/day.

How do I handle documents that change frequently?

Implement an incremental update pipeline: when a document is modified, re-chunk and re-embed only the changed document. Most vector databases support upsert operations for updating specific chunks without rebuilding the entire index.

Should I use Claude Haiku or Sonnet for the generation step?

Sonnet for complex questions that require reasoning across multiple sources. Haiku for simple factual lookups where the answer is in a single chunk. Using Haiku for simple queries and Sonnet for complex ones optimizes cost.

Explore More Tools