Production-ready RAG: Investor Q&A over Financial Filings
Summary: This tutorial walks through building a Retrieval-Augmented Generation (RAG) system for a finance use case — an Investor Q&A assistant that answers questions using a company's SEC filings, earnings call transcripts, and analyst research. You'll get architecture guidance, production considerations, and runnable sample code (indexer + API) plus Docker and deployment tips.
Why RAG for Finance?
Finance teams need accurate, up-to-date, and source-grounded answers. RAG is ideal because it:
- Combines retrieval (fast lookups of factual documents) with generation (clear, consolidated answers).
- Keeps answers grounded in the original filings (important for regulatory/compliance needs).
- Handles large corpora (10Ks, 10-Qs, transcripts) by retrieving relevant snippets instead of trying to fit everything into an LLM context window.
Example user story
An investor or analyst asks: “What did Company X say about guidance for FY26, and which assumptions were highlighted?” The RAG assistant returns a short, sourced answer: a 2–3 sentence summary plus citations to specific filing sections and timestamps in the earnings transcript.
High-level architecture
- Ingestion & normalization — fetch SEC filings, earnings transcripts, research PDFs; convert to text, normalize.
- Chunking — split long documents into overlapping chunks (e.g., 800 tokens, 100-token overlap).
- Embeddings — convert chunks to vectors (OpenAI or sentence-transformers).
- Vector store / index — FAISS (local) or managed store (Pinecone, Weaviate, Milvus) for fast k-NN retrieval.
- Retriever — fetch top-k chunks for a query.
- Reranking / filtering (optional) — lexical scoring, recency filter, source weighting.
- LLM generator — use the retrieved chunks as context + the query to produce a grounded answer.
- API & app — serve queries, add caching, rate limits, logging.
- Monitoring & evaluation — record user feedback, track hallucinations, latency, and cost.
Implementation overview (what we'll build)
ingest.py— fetch local sample documents, convert to text, chunk, and store embeddings in FAISS.app.py— FastAPI service that accepts queries, retrieves relevant chunks, and calls the LLM to generate a grounded answer with citations.Dockerfileandk8s/example manifest for deployment.
The sample uses OpenAI embeddings & completion for clarity, but you can substitute any embeddings model and LLM.
Setup & dependencies
Install (recommended in a virtualenv):
pip install fastapi uvicorn openai faiss-cpu tiktoken numpy python-multipart requests
(If you prefer LangChain, you can adapt the code to use langchain's Document, TextSplitter, FAISS, etc.)
Environment variables (example):
export OPENAI_API_KEY="sk-..."
export EMBEDDING_MODEL="text-embedding-3-small"
export GENERATION_MODEL="gpt-4o-mini" # or gpt-4o, or local LLM
1) Ingest & index: ingest.py
This script:
- Reads sample
.txtor.pdf(pdf conversion omitted here — assume pre-extracted text). - Splits text into chunks with overlap.
- Creates embeddings using OpenAI.
- Builds a FAISS index and saves it to disk with metadata (source, doc id, chunk id, char offsets).
# ingest.py
import os
import json
import math
from pathlib import Path
import numpy as np
import openai
import faiss
# Simple tokenizer length estimate -- replace with tiktoken for exact split if needed
def rough_token_count(text):
return len(text.split())
# Sliding window splitter
def chunk_text(text, chunk_tokens=800, overlap_tokens=100):
words = text.split()
i = 0
chunks = []
while i < len(words):
chunk = words[i:i+chunk_tokens]
chunks.append(' '.join(chunk))
if i + chunk_tokens >= len(words):
break
i += chunk_tokens - overlap_tokens
return chunks
openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
DATA_DIR = Path("data")
INDEX_DIR = Path("index")
INDEX_DIR.mkdir(exist_ok=True)
# load documents (expects .txt files in data/)
docs = []
for p in DATA_DIR.glob("*.txt"):
docs.append({
"id": p.stem,
"text": p.read_text(encoding='utf-8'),
"path": str(p)
})
# create embeddings and metadata arrays
metadatas = []
vectors = []
texts = []
for doc in docs:
chunks = chunk_text(doc["text"]) # defaults
for i, c in enumerate(chunks):
# call OpenAI embeddings
resp = openai.Embedding.create(input=c, model=EMB_MODEL)
emb = resp['data'][0]['embedding']
vectors.append(np.array(emb, dtype='float32'))
metadatas.append({
"doc_id": doc["id"],
"chunk_id": i,
"source": doc["path"],
"text_preview": c[:200]
})
texts.append(c)
vectors = np.vstack(vectors)
# FAISS index
d = vectors.shape[1]
index = faiss.IndexFlatL2(d)
index.add(vectors)
# persist
faiss.write_index(index, str(INDEX_DIR / "faiss.index"))
with open(INDEX_DIR / "metadatas.jsonl", "w", encoding='utf-8') as f:
for m in metadatas:
f.write(json.dumps(m) + "\n")
with open(INDEX_DIR / "texts.jsonl", "w", encoding='utf-8') as f:
for t in texts:
f.write(json.dumps({"text": t}) + "\n")
print("Index created with", vectors.shape[0], "vectors")
Notes:
- For accurate chunk token counts and to match your LLM prompt size, use
tiktokento split by tokens. - For production, prefer batched embedding requests to reduce API calls and cost.
2) Inference API: app.py
A simple FastAPI app that:
- Loads FAISS index and metadata.
- On a query: computes embedding for the question, retrieves top-k chunks, formats a prompt with citations, and calls the LLM for a grounded answer.
# app.py
import os
import json
import openai
import numpy as np
import faiss
from fastapi import FastAPI
from pydantic import BaseModel
openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
GEN_MODEL = os.getenv("GENERATION_MODEL", "gpt-4o-mini")
INDEX_DIR = "index"
# load index and metadata
index = faiss.read_index(os.path.join(INDEX_DIR, "faiss.index"))
metadatas = [json.loads(l) for l in open(os.path.join(INDEX_DIR, "metadatas.jsonl"), "r")]
texts = [json.loads(l)["text"] for l in open(os.path.join(INDEX_DIR, "texts.jsonl"), "r")]
app = FastAPI()
class QueryIn(BaseModel):
query: str
top_k: int = 6
@app.post("/query")
async def query_endpoint(q: QueryIn):
# embed query
emb_resp = openai.Embedding.create(input=q.query, model=EMB_MODEL)
q_emb = np.array(emb_resp['data'][0]['embedding'], dtype='float32')
# retrieve
D, I = index.search(q_emb.reshape(1, -1), q.top_k)
retrieved = []
for score, idx in zip(D[0], I[0]):
m = metadatas[idx]
retrieved.append({
"score": float(score),
"source": m["source"],
"doc_id": m["doc_id"],
"chunk_id": m["chunk_id"],
"text": texts[idx]
})
# build prompt. Be explicit about provenance
prompt_chunks = []
for i, r in enumerate(retrieved):
label = f"[SOURCE {i+1}] {r['source']} (doc={r['doc_id']}, chunk={r['chunk_id']})"
prompt_chunks.append(f"{label}\n{r['text']}\n")
prompt = (
"You are a helpful assistant for financial filings. Use only the information in the sources below to answer the user query. "
"If the answer is not in the sources, say 'I couldn't find a definitive answer in the provided documents.'\n\n"
"SOURCES:\n" + "\n---\n".join(prompt_chunks) + "\n\nUSER QUERY:\n" + q.query + "\n\nProvide a short answer (2-4 sentences) and then list the sources you used by their SOURCE label."
)
# call LLM (completion-style)
resp = openai.ChatCompletion.create(
model=GEN_MODEL,
messages=[{"role": "user", "content": prompt}],
max_tokens=400,
temperature=0.0,
)
answer = resp['choices'][0]['message']['content']
return {"answer": answer, "retrieved": retrieved}
# To run: uvicorn app:app --reload --port 8000
Security & Safety notes:
- Validate user input and rate-limit requests.
- Never log full user queries in production without PII-scrubbing if regulatory concerns exist.
3) Dockerfile (simple)
FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir fastapi uvicorn openai faiss-cpu tiktoken numpy
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
4) Production considerations
Vector store choice
- FAISS: great for local or single-node setups.
- Managed (Pinecone/Weaviate/Milvus): better for scale, persistence, and distributed queries.
Embeddings & LLM
- Use batching for embeddings to reduce API calls.
- Consider using a cheaper embedding model for indexing and a higher-quality model for reranking.
- For closed environments or on-premise compliance, use a local embedding model (e.g.,
sentence-transformers) and a hosted/certified LLM that meets regulatory needs.
Prompt design & grounding
- Always include the retrieved chunks and a strict instruction to only use the sources.
- Include explicit citation labels (
[SOURCE 1]) and prompt the model to return them. - Use a deterministic temp (e.g., temperature=0) for factual answers.
Caching and latency
- Cache embeddings for repeated queries or identical documents.
- Cache LLM answers for frequently asked questions.
- Use asynchronous request handling and connection pooling for LLM calls.
Monitoring and evaluation
- Track: request rate, average latency, token usage, cost per query, and hallucination rate (via human feedback).
- Store sample queries + retrieved docs + final answer for offline evaluation.
- Ask users for feedback (thumbs up/down) and use it to fine-tune rerankers or prompt templates.
Logging & Audit
- Log what sources were used to create an answer (not necessarily the entire chunk text) for auditability.
- For regulated workflows, keep immutable audit logs and consider redaction policies.
Security
- Protect API keys and secrets with a secrets manager.
- Use network policies and encryption at rest for persistent stores.
- Rate-limit user queries and add throttling.
5) Testing & CI
- Unit tests for: ingestion (chunks count), embeddings shape, retrieval top-k stability, prompt formatting.
- Integration tests: run a small index in CI, call
/querywith seeded docs, assert returned citations contain expected doc ids. - Example GitHub Actions job: run tests, build Docker image, push to registry, and deploy to staging.
6) Example conversation & expected output
User: "What did Acme Corp say about FY2026 revenue guidance in their most recent 10-Q?"
System (RAG): Short answer with citations: two sentences summarizing the guidance, followed by a bullet list showing [SOURCE 1] filings/10-Q-Acme_2025_Q3.txt chunk=3 (so the human can inspect the original chunk).
This improves trust and compliance because you can show the original evidence.
7) Improvements & advanced topics
- Reranking with a cross-encoder: use a cross-attention reranker to reorder the retrieved chunks before generation.
- Temporal / recency weighting: prefer more recent filings and transcripts; include a time decay or explicit recency filter.
- Query rewriting: expand/normalize queries for better retrieval (e.g., map 'EPS' -> 'earnings per share').
- Hybrid retrieval: combine lexical search (BM25) with vector search for exact-match numeric queries.
- Red-teaming & hallucination detection: run adversarial prompts and monitor hallucination metrics.
8) Cost estimate & operational tips
- Embeddings: typically the largest per-document up-front cost. Use cheaper models for indexing, better models for reranking only when needed.
- Generation: cost per query depends on the model and response length. Use conservative
max_tokens+ short summarization prompts.
9) Example repo layout
rag-finance/
├─ data/ # .txt filings used for demo
├─ ingest.py
├─ app.py
├─ requirements.txt
├─ Dockerfile
├─ index/ # generated index files
└─ tests/
└─ test_integration.py
Closing notes
RAG systems are a pragmatic way to deliver accurate, auditable answers in finance. The pattern above — robust ingestion, smart chunking, appropriate vector store, and conservative prompt engineering — is a strong blueprint for production.
If you want, I can:
- Provide a ready-to-run GitHub repo with the full sample data and CI pipeline.
- Swap the sample to use a specific vector DB (Pinecone, Weaviate) and show the exact integration.
- Add a UI demo (React) that highlights source citations alongside the generated answer.
License: MIT