
Production-ready RAG: Investor Q&A over Financial Filings
Summary: This tutorial walks through building a Retrieval-Augmented Generation (RAG) system for a finance use case — an Investor Q&A assistant that answers questions using a company's SEC filings, earnings call transcripts, and analyst research. You'll get architecture guidance, production considerations, and runnable sample code (indexer + API) plus Docker and deployment tips.
Why RAG for Finance?
Finance teams need accurate, up-to-date, and source-grounded answers. RAG is ideal because it:
- Combines retrieval (fast lookups of factual documents) with generation (clear, consolidated answers).
- Keeps answers grounded in the original filings (important for regulatory/compliance needs).
- Handles large corpora (10Ks, 10-Qs, transcripts) by retrieving relevant snippets instead of trying to fit everything into an LLM context window.
Example user story
An investor or analyst asks: “What did Company X say about guidance for FY26, and which assumptions were highlighted?” The RAG assistant returns a short, sourced answer: a 2–3 sentence summary plus citations to specific filing sections and timestamps in the earnings transcript.
High-level architecture
- Ingestion & normalization — fetch SEC filings, earnings transcripts, research PDFs; convert to text, normalize.
- Chunking — split long documents into overlapping chunks (e.g., 800 tokens, 100-token overlap).
- Embeddings — convert chunks to vectors (OpenAI or sentence-transformers).
- Vector store / index — FAISS (local) or managed store (Pinecone, Weaviate, Milvus) for fast k-NN retrieval.
- Retriever — fetch top-k chunks for a query.
- Reranking / filtering (optional) — lexical scoring, recency filter, source weighting.
- LLM generator — use the retrieved chunks as context + the query to produce a grounded answer.
- API & app — serve queries, add caching, rate limits, logging.
- Monitoring & evaluation — record user feedback, track hallucinations, latency, and cost.
Implementation overview (what we'll build)
ingest.py
— fetch local sample documents, convert to text, chunk, and store embeddings in FAISS.app.py
— FastAPI service that accepts queries, retrieves relevant chunks, and calls the LLM to generate a grounded answer with citations.Dockerfile
andk8s/
example manifest for deployment.
The sample uses OpenAI embeddings & completion for clarity, but you can substitute any embeddings model and LLM.
Setup & dependencies
Install (recommended in a virtualenv):
pip install fastapi uvicorn openai faiss-cpu tiktoken numpy python-multipart requests
(If you prefer LangChain, you can adapt the code to use langchain
's Document
, TextSplitter
, FAISS
, etc.)
Environment variables (example):
export OPENAI_API_KEY="sk-..."
export EMBEDDING_MODEL="text-embedding-3-small"
export GENERATION_MODEL="gpt-4o-mini" # or gpt-4o, or local LLM
1) Ingest & index: ingest.py
This script:
- Reads sample
.txt
or.pdf
(pdf conversion omitted here — assume pre-extracted text). - Splits text into chunks with overlap.
- Creates embeddings using OpenAI.
- Builds a FAISS index and saves it to disk with metadata (source, doc id, chunk id, char offsets).
# ingest.py
import os
import json
import math
from pathlib import Path
import numpy as np
import openai
import faiss
# Simple tokenizer length estimate -- replace with tiktoken for exact split if needed
def rough_token_count(text):
return len(text.split())
# Sliding window splitter
def chunk_text(text, chunk_tokens=800, overlap_tokens=100):
words = text.split()
i = 0
chunks = []
while i < len(words):
chunk = words[i:i+chunk_tokens]
chunks.append(' '.join(chunk))
if i + chunk_tokens >= len(words):
break
i += chunk_tokens - overlap_tokens
return chunks
openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
DATA_DIR = Path("data")
INDEX_DIR = Path("index")
INDEX_DIR.mkdir(exist_ok=True)
# load documents (expects .txt files in data/)
docs = []
for p in DATA_DIR.glob("*.txt"):
docs.append({
"id": p.stem,
"text": p.read_text(encoding='utf-8'),
"path": str(p)
})
# create embeddings and metadata arrays
metadatas = []
vectors = []
texts = []
for doc in docs:
chunks = chunk_text(doc["text"]) # defaults
for i, c in enumerate(chunks):
# call OpenAI embeddings
resp = openai.Embedding.create(input=c, model=EMB_MODEL)
emb = resp['data'][0]['embedding']
vectors.append(np.array(emb, dtype='float32'))
metadatas.append({
"doc_id": doc["id"],
"chunk_id": i,
"source": doc["path"],
"text_preview": c[:200]
})
texts.append(c)
vectors = np.vstack(vectors)
# FAISS index
d = vectors.shape[1]
index = faiss.IndexFlatL2(d)
index.add(vectors)
# persist
faiss.write_index(index, str(INDEX_DIR / "faiss.index"))
with open(INDEX_DIR / "metadatas.jsonl", "w", encoding='utf-8') as f:
for m in metadatas:
f.write(json.dumps(m) + "\n")
with open(INDEX_DIR / "texts.jsonl", "w", encoding='utf-8') as f:
for t in texts:
f.write(json.dumps({"text": t}) + "\n")
print("Index created with", vectors.shape[0], "vectors")
Notes:
- For accurate chunk token counts and to match your LLM prompt size, use
tiktoken
to split by tokens. - For production, prefer batched embedding requests to reduce API calls and cost.
2) Inference API: app.py
A simple FastAPI app that:
- Loads FAISS index and metadata.
- On a query: computes embedding for the question, retrieves top-k chunks, formats a prompt with citations, and calls the LLM for a grounded answer.
# app.py
import os
import json
import openai
import numpy as np
import faiss
from fastapi import FastAPI
from pydantic import BaseModel
openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
GEN_MODEL = os.getenv("GENERATION_MODEL", "gpt-4o-mini")
INDEX_DIR = "index"
# load index and metadata
index = faiss.read_index(os.path.join(INDEX_DIR, "faiss.index"))
metadatas = [json.loads(l) for l in open(os.path.join(INDEX_DIR, "metadatas.jsonl"), "r")]
texts = [json.loads(l)["text"] for l in open(os.path.join(INDEX_DIR, "texts.jsonl"), "r")]
app = FastAPI()
class QueryIn(BaseModel):
query: str
top_k: int = 6
@app.post("/query")
async def query_endpoint(q: QueryIn):
# embed query
emb_resp = openai.Embedding.create(input=q.query, model=EMB_MODEL)
q_emb = np.array(emb_resp['data'][0]['embedding'], dtype='float32')
# retrieve
D, I = index.search(q_emb.reshape(1, -1), q.top_k)
retrieved = []
for score, idx in zip(D[0], I[0]):
m = metadatas[idx]
retrieved.append({
"score": float(score),
"source": m["source"],
"doc_id": m["doc_id"],
"chunk_id": m["chunk_id"],
"text": texts[idx]
})
# build prompt. Be explicit about provenance
prompt_chunks = []
for i, r in enumerate(retrieved):
label = f"[SOURCE {i+1}] {r['source']} (doc={r['doc_id']}, chunk={r['chunk_id']})"
prompt_chunks.append(f"{label}\n{r['text']}\n")
prompt = (
"You are a helpful assistant for financial filings. Use only the information in the sources below to answer the user query. "
"If the answer is not in the sources, say 'I couldn't find a definitive answer in the provided documents.'\n\n"
"SOURCES:\n" + "\n---\n".join(prompt_chunks) + "\n\nUSER QUERY:\n" + q.query + "\n\nProvide a short answer (2-4 sentences) and then list the sources you used by their SOURCE label."
)
# call LLM (completion-style)
resp = openai.ChatCompletion.create(
model=GEN_MODEL,
messages=[{"role": "user", "content": prompt}],
max_tokens=400,
temperature=0.0,
)
answer = resp['choices'][0]['message']['content']
return {"answer": answer, "retrieved": retrieved}
# To run: uvicorn app:app --reload --port 8000
Security & Safety notes:
- Validate user input and rate-limit requests.
- Never log full user queries in production without PII-scrubbing if regulatory concerns exist.
3) Dockerfile (simple)
FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir fastapi uvicorn openai faiss-cpu tiktoken numpy
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
4) Production considerations
Vector store choice
- FAISS: great for local or single-node setups.
- Managed (Pinecone/Weaviate/Milvus): better for scale, persistence, and distributed queries.
Embeddings & LLM
- Use batching for embeddings to reduce API calls.
- Consider using a cheaper embedding model for indexing and a higher-quality model for reranking.
- For closed environments or on-premise compliance, use a local embedding model (e.g.,
sentence-transformers
) and a hosted/certified LLM that meets regulatory needs.
Prompt design & grounding
- Always include the retrieved chunks and a strict instruction to only use the sources.
- Include explicit citation labels (
[SOURCE 1]
) and prompt the model to return them. - Use a deterministic temp (e.g., temperature=0) for factual answers.
Caching and latency
- Cache embeddings for repeated queries or identical documents.
- Cache LLM answers for frequently asked questions.
- Use asynchronous request handling and connection pooling for LLM calls.
Monitoring and evaluation
- Track: request rate, average latency, token usage, cost per query, and hallucination rate (via human feedback).
- Store sample queries + retrieved docs + final answer for offline evaluation.
- Ask users for feedback (thumbs up/down) and use it to fine-tune rerankers or prompt templates.
Logging & Audit
- Log what sources were used to create an answer (not necessarily the entire chunk text) for auditability.
- For regulated workflows, keep immutable audit logs and consider redaction policies.
Security
- Protect API keys and secrets with a secrets manager.
- Use network policies and encryption at rest for persistent stores.
- Rate-limit user queries and add throttling.
5) Testing & CI
- Unit tests for: ingestion (chunks count), embeddings shape, retrieval top-k stability, prompt formatting.
- Integration tests: run a small index in CI, call
/query
with seeded docs, assert returned citations contain expected doc ids. - Example GitHub Actions job: run tests, build Docker image, push to registry, and deploy to staging.
6) Example conversation & expected output
User: "What did Acme Corp say about FY2026 revenue guidance in their most recent 10-Q?"
System (RAG): Short answer with citations: two sentences summarizing the guidance, followed by a bullet list showing [SOURCE 1] filings/10-Q-Acme_2025_Q3.txt chunk=3
(so the human can inspect the original chunk).
This improves trust and compliance because you can show the original evidence.
7) Improvements & advanced topics
- Reranking with a cross-encoder: use a cross-attention reranker to reorder the retrieved chunks before generation.
- Temporal / recency weighting: prefer more recent filings and transcripts; include a time decay or explicit recency filter.
- Query rewriting: expand/normalize queries for better retrieval (e.g., map 'EPS' -> 'earnings per share').
- Hybrid retrieval: combine lexical search (BM25) with vector search for exact-match numeric queries.
- Red-teaming & hallucination detection: run adversarial prompts and monitor hallucination metrics.
8) Cost estimate & operational tips
- Embeddings: typically the largest per-document up-front cost. Use cheaper models for indexing, better models for reranking only when needed.
- Generation: cost per query depends on the model and response length. Use conservative
max_tokens
+ short summarization prompts.
9) Example repo layout
rag-finance/
├─ data/ # .txt filings used for demo
├─ ingest.py
├─ app.py
├─ requirements.txt
├─ Dockerfile
├─ index/ # generated index files
└─ tests/
└─ test_integration.py
Closing notes
RAG systems are a pragmatic way to deliver accurate, auditable answers in finance. The pattern above — robust ingestion, smart chunking, appropriate vector store, and conservative prompt engineering — is a strong blueprint for production.
If you want, I can:
- Provide a ready-to-run GitHub repo with the full sample data and CI pipeline.
- Swap the sample to use a specific vector DB (Pinecone, Weaviate) and show the exact integration.
- Add a UI demo (React) that highlights source citations alongside the generated answer.
License: MIT