Imported blog post

By Dima Kramskoy — Senior Cloud Architect at DoiT International

Why This Post Exists

Most Bedrock Knowledge Base tutorials follow the same script: upload a PDF to S3, create a Knowledge Base, ask it a question. Done. Blog post written.

That's fine for a demo. It's useless for production.

I recently built a policy audit system for a LATAM fintech company — one that evaluates expenses against internal policies in real-time using AI. The system processes ~500 queries per day across 100+ policy documents in two languages. It runs in production. People's expense reports get approved or denied based on what it says.

Here's what I actually learned — the architecture decisions, the chunking mistakes, the cost surprises, and the gotchas that no AWS documentation warns you about.

Your First Bedrock Knowledge Base (5 Minutes)

Before I go deep, let's make sure you have the basics. If you've already built a KB, skip to the next section. If not, here's the fastest path to "it works":

Step 1: Create an S3 bucket with your documents

aws s3 mb s3://my-kb-source-docs
aws s3 cp ./policies/ s3://my-kb-source-docs/ --recursive

Step 2: Create the Knowledge Base (Console)

Go to Amazon Bedrock → Knowledge Bases → Create
Name it, select an embedding model (Titan Embeddings v2 is the default — it's fine to start)
Point it at your S3 bucket
For vector store: choose S3 Vectors (serverless, zero config) or let it create one for you
Click Create → wait 2-3 minutes for sync

Step 3: Query it

import boto3

client = boto3.client('bedrock-agent-runtime')

response = client.retrieve_and_generate(
    input={'text': 'What is our policy on travel expenses?'},
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': 'YOUR_KB_ID',
            'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.nova-pro-v1:0'
        }
    }
)
print(response['output']['text'])

That's it. You have a working RAG system. It answers questions from your docs.

Now here's the thing: this gets you 60% of the way. The remaining 40% — where production quality lives — is what the rest of this post is about. Chunking strategy, cost control, latency, the transformation pipeline, and the gotchas that will bite you at scale.

The Use Case

The client had a problem every growing company eventually hits: complex internal policies that nobody reads, applied inconsistently, across multiple countries and languages.

Specifically: a company with 100+ internal policy documents (English and Spanish) needed AI to evaluate employee expenses against those policies in real-time. Not "search for a policy" — actually decide whether an expense complies, cite the relevant rule, and explain why.

The requirements were clear:

Sub-3-second decision latency
Every decision traceable to a source policy paragraph
Human approval gate for policy changes (legal liability)
Multi-language support (English/Spanish)
Cost-effective at scale (~500 queries/day, growing)

Architecture Overview

Here's the high-level flow:

Transaction Evaluation Path:

API Gateway → SQS → Step Functions → Bedrock Nova Pro (receipt validation)
→ Bedrock AgentCore (policy decision via KB) → Aurora MySQL (persist decision)

Policy Ingestion Path:

S3 Upload → Step Functions (human approval via task token)
→ LLM transformation (restructure policy, extract expenses, structure new view)
→ Transformed chunks stored in S3 (transition state)
→ API exposes chunks for review → Human approval
→ Ingested into Bedrock Knowledge Base

Two distinct paths. One for real-time decisions, one for policy lifecycle management. Step Functions orchestrates both — and that's not an accident. State machines give you exactly the visibility and retry semantics you need when legal compliance is on the line.

Why S3 Vectors (Not OpenSearch)

This is where I'll save you weeks of deliberation.

When I started this project, the default vector store for Bedrock KB was OpenSearch Serverless. It works. It's battle-tested. It's also dramatically over-engineered for a document set under 10K documents.

S3 Vectors launched as a simpler alternative, and for this use case, it was the obvious choice:

Factor	OpenSearch Serverless	S3 Vectors
Monthly base cost	~$700+ (2 OCUs minimum)	Pay-per-query
Operational overhead	Index management, scaling	Zero
Setup complexity	Moderate	Minimal
Query latency (p50)	~200ms	~350ms
Sweet spot	10K+ docs, complex queries	<10K docs, straightforward retrieval

The decision framework is simple: If your document corpus is under 10K documents and you don't need complex filtering or hybrid search, S3 Vectors saves you money and operational headaches. If you need sub-200ms latency or have tens of thousands of documents with complex metadata queries, go OpenSearch.

For our ~100 policy documents? S3 Vectors was a no-brainer. We're paying cents per day instead of $700/month minimum. The latency tradeoff (an extra ~150ms) is invisible in a workflow that includes LLM inference anyway.

Chunking Strategies That Actually Matter

Before I share what worked for us, here's the landscape — because most guides only show you one or two options:

Strategy	How It Works	Best For	Watch Out
Fixed-size (default)	Splits every N characters/tokens	Quick start, generic docs	Splits mid-sentence, mid-rule — causes hallucinated answers
Sentence-based	Splits on sentence boundaries	Simple documents, FAQs	Doesn't respect logical sections; a policy rule may span 5+ sentences
Semantic / Section-based	Splits on document structure (headings, sections)	Structured docs with clear hierarchy	Requires parsing doc structure; chunks vary in size
Hierarchical (parent-child)	Parent chunks (full section) + child chunks (paragraphs)	Best retrieval quality — match child, return parent for context	More complex, higher storage, slower indexing
LLM-assisted	LLM restructures the document BEFORE chunking	High-stakes docs where wrong retrieval = wrong decision	Adds cost + latency at ingestion; worth it when accuracy matters

Our choice: LLM-assisted. For policy documents where a wrong retrieval means a wrong expense decision, the upfront cost of LLM transformation pays for itself immediately. More on this in the pipeline section below.

But first — let me show you the failure mode that led us here:

This is where I made my most expensive mistake early on.

What didn't work: Default chunking

Bedrock KB's default chunking splits documents by character count with overlap. For generic documents, that's fine. For policy documents, it's catastrophic.

Here's why: a policy rule might say "Meals over $75 require manager approval, except during client-facing travel where the limit is $150." Default chunking can split this mid-sentence. The retrieval then returns "Meals over $75 require manager approval" without the exception. The agent denies a legitimate $100 client dinner. Your users lose trust in the system on day one.

What worked: Semantic boundary chunking

We pre-process documents before ingestion, chunking by policy section — each rule or sub-rule becomes its own chunk with full context preserved.

import re
from dataclasses import dataclass

@dataclass
class PolicyChunk:
    content: str
    metadata: dict

def chunk_policy_document(text: str, doc_id: str, language: str) -> list[PolicyChunk]:
    """Chunk policy documents by semantic boundaries (section headers)."""

    # Split on policy section patterns (numbered rules, headers)
    section_pattern = r'\n(?=\d+\.\s|\#{1,3}\s|Article\s+\d+|Artículo\s+\d+)'
    sections = re.split(section_pattern, text)

    chunks = []
    for i, section in enumerate(sections):
        section = section.strip()
        if len(section) < 50:  # Skip trivial sections
            continue

        # Keep chunks between 200-1500 chars for optimal retrieval
        if len(section) > 1500:
            # Sub-chunk by paragraph, preserving section header
            header = section.split('\n')[0]
            paragraphs = section.split('\n\n')
            for j, para in enumerate(paragraphs[1:], 1):
                chunks.append(PolicyChunk(
                    content=f"{header}\n\n{para}",
                    metadata={
                        "doc_id": doc_id,
                        "section_index": i,
                        "sub_index": j,
                        "language": language,
                        "chunk_type": "policy_rule"
                    }
                ))
        else:
            chunks.append(PolicyChunk(
                content=section,
                metadata={
                    "doc_id": doc_id,
                    "section_index": i,
                    "sub_index": 0,
                    "language": language,
                    "chunk_type": "policy_rule"
                }
                ))

    return chunks

Practical guidance

Chunk size sweet spot: 200–1500 characters for policy documents. Smaller chunks improve precision; larger chunks preserve context. Find your balance.
Overlap: If you must use character-based chunking, use 20% overlap minimum. But seriously, chunk by semantic boundaries.
Metadata is retrieval: Tag every chunk with language, policy_type, effective_date, and department. You'll filter on these later — it's not optional.

The Ingestion Pipeline

Policy documents aren't blog posts. You can't just throw them into a vector store and hope for the best. A wrong policy interpretation has legal consequences.

Here's the pipeline:

The LLM Transformation + Human-in-the-Loop Pattern

The key design insight: transform BEFORE approval. The flow:

S3 Upload (raw policy PDF)
    ↓
LLM Transformation (restructure, extract expense rules, structure into desired view)
    ↓
Transformed chunks stored in S3 (transition state — cached for re-use)
    ↓
API exposes structured chunks for review
    ↓
Reviewer approves / rejects (one click)
    ↓
Approved chunks ingested into Bedrock Knowledge Base

This is a logical approval gate, not a heavyweight orchestration. The LLM does the hard work upfront — parsing multi-page PDFs, extracting individual expense rules, and structuring them consistently. By the time a reviewer sees the output, they're looking at clean, structured chunks — not raw documents.

Why this matters for production:

Re-ingestion is fast — the transformation is cached in S3. Policy updates don't require re-processing from scratch.
Reviewers see quality output — they approve structured rules, not walls of PDF text.
Fewer errors at retrieval time — because the LLM pre-structures the content, the KB receives consistently formatted chunks every time.

Why human-in-the-loop matters

I've seen teams skip the approval gate because "we trust our policy team." Then someone uploads a draft document, it gets embedded, and the AI starts enforcing draft rules. One incident like that, and you've lost organizational trust in the system.

Anti-pattern: Auto-ingest on S3 upload. Never do this for compliance-sensitive documents.

Pattern: Upload → LLM transforms & structures the policy → transformed chunks cached in S3 → API exposes chunks for review → human approves/rejects → then ingest into KB. This means re-ingestion is fast (transformation is already done and cached), and approvers see clean, structured output — not raw PDFs.

Retrieval Latency & Cost

Real numbers from production (500 queries/day workload):

Latency (S3 Vectors)

p50 = median response time (typical experience). p99 = 99th percentile (worst case excluding extreme outliers).

p50: 340ms (retrieval only, excluding LLM inference)
p99: 890ms
End-to-end decision (including AgentCore): p50 ~2.1s, p99 ~4.8s

Monthly Cost Breakdown

Component	Monthly Cost
S3 Vectors (storage + queries)	~$12
Bedrock KB API calls	~$8
Titan Embeddings (ingestion)	~$3
Nova Pro (receipt validation)	~$45
AgentCore (policy decisions)	~$120
Step Functions	~$5
Aurora MySQL (persistence)	~$65
Total	~$258/month

Compare that to OpenSearch Serverless alone at $700/month minimum. Architecture choices compound.

Gotchas Nobody Warns You About

After three months in production, here's my list:

Sync delay after upload. After you call StartIngestionJob, the KB isn't immediately queryable with new content. Expect 30–90 seconds for small updates. Plan for this in your UX — show "policy update processing" states.
Metadata filtering is exact-match only (S3 Vectors). You can't do range queries or partial matching on metadata. Design your metadata schema around equality filters. If you need "all policies updated after January 2025," you'll need a different approach.
Embedding model choice is permanent. Once you create a KB with Titan Embeddings v2, you can't switch to Cohere without re-creating the entire Knowledge Base. Choose carefully upfront. (We went with Titan v2 — good balance of cost and quality for bilingual content.)
Multi-language retrieval isn't magic. A query in Spanish will retrieve Spanish chunks well, but cross-language retrieval (Spanish query → English policy) is unreliable. We solved this by maintaining parallel chunks in both languages and filtering by the query's detected language.
Re-indexing cost spikes. If you sync your entire KB frequently (instead of incremental updates), embedding costs spike. A full re-index of 100 documents costs ~$2. Do that hourly by mistake, and you're burning $1,400/month on embeddings alone.
KB quotas are surprisingly low. Default concurrent ingestion jobs: 1. Default document size: 50MB. Request quota increases before you hit production scale.
The "confident wrong answer" problem. When the agent retrieves a chunk that's close but not right, it will confidently apply the wrong rule. Mitigate this by setting a similarity score threshold (we use 0.7) and routing low-confidence retrievals to human review.

How to Get Started

Five steps to your first production Bedrock KB implementation:

Start with 10 documents, not 100. Get your chunking strategy right on a small set. Validate retrieval quality manually before scaling.
Choose S3 Vectors unless you have a reason not to. For most Knowledge Base use cases under 10K documents, it's cheaper and simpler. Graduate to OpenSearch when you actually need it.
Invest in chunking before anything else. Default chunking is a trap for structured documents. Spend a week on your chunking strategy — it's the highest-leverage work you'll do.
Build the approval pipeline from day one. Even if you don't need human-in-the-loop today, you will when stakes get higher. Task token pattern in Step Functions makes this trivial to add later, but expensive to retrofit.
Instrument everything. Log retrieval scores, chunk IDs, decision confidence. You can't improve what you can't measure. When retrieval quality degrades (and it will, as your corpus grows), you need data to diagnose why.

Wrapping Up

Bedrock Knowledge Base is genuinely good infrastructure for production RAG systems. But the gap between "demo" and "production" is where every interesting engineering decision lives — chunking strategies, ingestion pipelines, cost optimization, evidence chains, failure modes.

S3 Vectors made this project economically viable at a scale where OpenSearch would have been overkill. Step Functions gave us the orchestration guarantees that compliance demands. And AgentCore turned retrieval into structured, auditable decisions.

The stack works. The hard part was never the AWS services — it was the data engineering around them.

Build it right the first time. Your future self (and your client's legal team) will thank you.

Dima Kramskoy is a Senior Cloud Architect at DoiT International with 20+ years in software engineering, 10 AWS certifications, and is an AWS Community Builder (2026). He helps organizations build production AI/ML systems on AWS.