Production-ready RAG: Investor Q&A over Financial Filings

Production-ready RAG: Investor Q&A over Financial Filings

Summary: This tutorial walks through building a Retrieval-Augmented Generation (RAG) system for a finance use case — an Investor Q&A assistant that answers questions using a company's SEC filings, earnings call transcripts, and analyst research. You'll get architecture guidance, production considerations, and runnable sample code (indexer + API) plus Docker and deployment tips.


Why RAG for Finance?

Finance teams need accurate, up-to-date, and source-grounded answers. RAG is ideal because it:

  • Combines retrieval (fast lookups of factual documents) with generation (clear, consolidated answers).
  • Keeps answers grounded in the original filings (important for regulatory/compliance needs).
  • Handles large corpora (10Ks, 10-Qs, transcripts) by retrieving relevant snippets instead of trying to fit everything into an LLM context window.

Example user story

An investor or analyst asks: “What did Company X say about guidance for FY26, and which assumptions were highlighted?” The RAG assistant returns a short, sourced answer: a 2–3 sentence summary plus citations to specific filing sections and timestamps in the earnings transcript.


High-level architecture

  1. Ingestion & normalization — fetch SEC filings, earnings transcripts, research PDFs; convert to text, normalize.
  2. Chunking — split long documents into overlapping chunks (e.g., 800 tokens, 100-token overlap).
  3. Embeddings — convert chunks to vectors (OpenAI or sentence-transformers).
  4. Vector store / index — FAISS (local) or managed store (Pinecone, Weaviate, Milvus) for fast k-NN retrieval.
  5. Retriever — fetch top-k chunks for a query.
  6. Reranking / filtering (optional) — lexical scoring, recency filter, source weighting.
  7. LLM generator — use the retrieved chunks as context + the query to produce a grounded answer.
  8. API & app — serve queries, add caching, rate limits, logging.
  9. Monitoring & evaluation — record user feedback, track hallucinations, latency, and cost.

Implementation overview (what we'll build)

  • ingest.py — fetch local sample documents, convert to text, chunk, and store embeddings in FAISS.
  • app.py — FastAPI service that accepts queries, retrieves relevant chunks, and calls the LLM to generate a grounded answer with citations.
  • Dockerfile and k8s/ example manifest for deployment.

The sample uses OpenAI embeddings & completion for clarity, but you can substitute any embeddings model and LLM.


Setup & dependencies

Install (recommended in a virtualenv):

pip install fastapi uvicorn openai faiss-cpu tiktoken numpy python-multipart requests

(If you prefer LangChain, you can adapt the code to use langchain's Document, TextSplitter, FAISS, etc.)

Environment variables (example):

export OPENAI_API_KEY="sk-..."
export EMBEDDING_MODEL="text-embedding-3-small"
export GENERATION_MODEL="gpt-4o-mini"    # or gpt-4o, or local LLM

1) Ingest & index: ingest.py

This script:

  • Reads sample .txt or .pdf (pdf conversion omitted here — assume pre-extracted text).
  • Splits text into chunks with overlap.
  • Creates embeddings using OpenAI.
  • Builds a FAISS index and saves it to disk with metadata (source, doc id, chunk id, char offsets).
# ingest.py
import os
import json
import math
from pathlib import Path
import numpy as np
import openai
import faiss

# Simple tokenizer length estimate -- replace with tiktoken for exact split if needed
def rough_token_count(text):
    return len(text.split())

# Sliding window splitter
def chunk_text(text, chunk_tokens=800, overlap_tokens=100):
    words = text.split()
    i = 0
    chunks = []
    while i < len(words):
        chunk = words[i:i+chunk_tokens]
        chunks.append(' '.join(chunk))
        if i + chunk_tokens >= len(words):
            break
        i += chunk_tokens - overlap_tokens
    return chunks

openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")

DATA_DIR = Path("data")
INDEX_DIR = Path("index")
INDEX_DIR.mkdir(exist_ok=True)

# load documents (expects .txt files in data/)
docs = []
for p in DATA_DIR.glob("*.txt"):
    docs.append({
        "id": p.stem,
        "text": p.read_text(encoding='utf-8'),
        "path": str(p)
    })

# create embeddings and metadata arrays
metadatas = []
vectors = []
texts = []

for doc in docs:
    chunks = chunk_text(doc["text"])  # defaults
    for i, c in enumerate(chunks):
        # call OpenAI embeddings
        resp = openai.Embedding.create(input=c, model=EMB_MODEL)
        emb = resp['data'][0]['embedding']
        vectors.append(np.array(emb, dtype='float32'))
        metadatas.append({
            "doc_id": doc["id"],
            "chunk_id": i,
            "source": doc["path"],
            "text_preview": c[:200]
        })
        texts.append(c)

vectors = np.vstack(vectors)

# FAISS index
d = vectors.shape[1]
index = faiss.IndexFlatL2(d)
index.add(vectors)

# persist
faiss.write_index(index, str(INDEX_DIR / "faiss.index"))
with open(INDEX_DIR / "metadatas.jsonl", "w", encoding='utf-8') as f:
    for m in metadatas:
        f.write(json.dumps(m) + "\n")
with open(INDEX_DIR / "texts.jsonl", "w", encoding='utf-8') as f:
    for t in texts:
        f.write(json.dumps({"text": t}) + "\n")

print("Index created with", vectors.shape[0], "vectors")

Notes:

  • For accurate chunk token counts and to match your LLM prompt size, use tiktoken to split by tokens.
  • For production, prefer batched embedding requests to reduce API calls and cost.

2) Inference API: app.py

A simple FastAPI app that:

  • Loads FAISS index and metadata.
  • On a query: computes embedding for the question, retrieves top-k chunks, formats a prompt with citations, and calls the LLM for a grounded answer.
# app.py
import os
import json
import openai
import numpy as np
import faiss
from fastapi import FastAPI
from pydantic import BaseModel

openai.api_key = os.getenv("OPENAI_API_KEY")
EMB_MODEL = os.getenv("EMBEDDING_MODEL", "text-embedding-3-small")
GEN_MODEL = os.getenv("GENERATION_MODEL", "gpt-4o-mini")

INDEX_DIR = "index"

# load index and metadata
index = faiss.read_index(os.path.join(INDEX_DIR, "faiss.index"))
metadatas = [json.loads(l) for l in open(os.path.join(INDEX_DIR, "metadatas.jsonl"), "r")]
texts = [json.loads(l)["text"] for l in open(os.path.join(INDEX_DIR, "texts.jsonl"), "r")]

app = FastAPI()

class QueryIn(BaseModel):
    query: str
    top_k: int = 6

@app.post("/query")
async def query_endpoint(q: QueryIn):
    # embed query
    emb_resp = openai.Embedding.create(input=q.query, model=EMB_MODEL)
    q_emb = np.array(emb_resp['data'][0]['embedding'], dtype='float32')

    # retrieve
    D, I = index.search(q_emb.reshape(1, -1), q.top_k)
    retrieved = []
    for score, idx in zip(D[0], I[0]):
        m = metadatas[idx]
        retrieved.append({
            "score": float(score),
            "source": m["source"],
            "doc_id": m["doc_id"],
            "chunk_id": m["chunk_id"],
            "text": texts[idx]
        })

    # build prompt. Be explicit about provenance
    prompt_chunks = []
    for i, r in enumerate(retrieved):
        label = f"[SOURCE {i+1}] {r['source']} (doc={r['doc_id']}, chunk={r['chunk_id']})"
        prompt_chunks.append(f"{label}\n{r['text']}\n")

    prompt = (
        "You are a helpful assistant for financial filings. Use only the information in the sources below to answer the user query. "
        "If the answer is not in the sources, say 'I couldn't find a definitive answer in the provided documents.'\n\n"
        "SOURCES:\n" + "\n---\n".join(prompt_chunks) + "\n\nUSER QUERY:\n" + q.query + "\n\nProvide a short answer (2-4 sentences) and then list the sources you used by their SOURCE label."
    )

    # call LLM (completion-style)
    resp = openai.ChatCompletion.create(
        model=GEN_MODEL,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=400,
        temperature=0.0,
    )
    answer = resp['choices'][0]['message']['content']

    return {"answer": answer, "retrieved": retrieved}

# To run: uvicorn app:app --reload --port 8000

Security & Safety notes:

  • Validate user input and rate-limit requests.
  • Never log full user queries in production without PII-scrubbing if regulatory concerns exist.

3) Dockerfile (simple)

FROM python:3.11-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir fastapi uvicorn openai faiss-cpu tiktoken numpy
ENV PYTHONUNBUFFERED=1
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

4) Production considerations

Vector store choice

  • FAISS: great for local or single-node setups.
  • Managed (Pinecone/Weaviate/Milvus): better for scale, persistence, and distributed queries.

Embeddings & LLM

  • Use batching for embeddings to reduce API calls.
  • Consider using a cheaper embedding model for indexing and a higher-quality model for reranking.
  • For closed environments or on-premise compliance, use a local embedding model (e.g., sentence-transformers) and a hosted/certified LLM that meets regulatory needs.

Prompt design & grounding

  • Always include the retrieved chunks and a strict instruction to only use the sources.
  • Include explicit citation labels ([SOURCE 1]) and prompt the model to return them.
  • Use a deterministic temp (e.g., temperature=0) for factual answers.

Caching and latency

  • Cache embeddings for repeated queries or identical documents.
  • Cache LLM answers for frequently asked questions.
  • Use asynchronous request handling and connection pooling for LLM calls.

Monitoring and evaluation

  • Track: request rate, average latency, token usage, cost per query, and hallucination rate (via human feedback).
  • Store sample queries + retrieved docs + final answer for offline evaluation.
  • Ask users for feedback (thumbs up/down) and use it to fine-tune rerankers or prompt templates.

Logging & Audit

  • Log what sources were used to create an answer (not necessarily the entire chunk text) for auditability.
  • For regulated workflows, keep immutable audit logs and consider redaction policies.

Security

  • Protect API keys and secrets with a secrets manager.
  • Use network policies and encryption at rest for persistent stores.
  • Rate-limit user queries and add throttling.

5) Testing & CI

  • Unit tests for: ingestion (chunks count), embeddings shape, retrieval top-k stability, prompt formatting.
  • Integration tests: run a small index in CI, call /query with seeded docs, assert returned citations contain expected doc ids.
  • Example GitHub Actions job: run tests, build Docker image, push to registry, and deploy to staging.

6) Example conversation & expected output

User: "What did Acme Corp say about FY2026 revenue guidance in their most recent 10-Q?"

System (RAG): Short answer with citations: two sentences summarizing the guidance, followed by a bullet list showing [SOURCE 1] filings/10-Q-Acme_2025_Q3.txt chunk=3 (so the human can inspect the original chunk).

This improves trust and compliance because you can show the original evidence.


7) Improvements & advanced topics

  • Reranking with a cross-encoder: use a cross-attention reranker to reorder the retrieved chunks before generation.
  • Temporal / recency weighting: prefer more recent filings and transcripts; include a time decay or explicit recency filter.
  • Query rewriting: expand/normalize queries for better retrieval (e.g., map 'EPS' -> 'earnings per share').
  • Hybrid retrieval: combine lexical search (BM25) with vector search for exact-match numeric queries.
  • Red-teaming & hallucination detection: run adversarial prompts and monitor hallucination metrics.

8) Cost estimate & operational tips

  • Embeddings: typically the largest per-document up-front cost. Use cheaper models for indexing, better models for reranking only when needed.
  • Generation: cost per query depends on the model and response length. Use conservative max_tokens + short summarization prompts.

9) Example repo layout

rag-finance/
├─ data/                 # .txt filings used for demo
├─ ingest.py
├─ app.py
├─ requirements.txt
├─ Dockerfile
├─ index/                # generated index files
└─ tests/
   └─ test_integration.py

Closing notes

RAG systems are a pragmatic way to deliver accurate, auditable answers in finance. The pattern above — robust ingestion, smart chunking, appropriate vector store, and conservative prompt engineering — is a strong blueprint for production.

If you want, I can:

  • Provide a ready-to-run GitHub repo with the full sample data and CI pipeline.
  • Swap the sample to use a specific vector DB (Pinecone, Weaviate) and show the exact integration.
  • Add a UI demo (React) that highlights source citations alongside the generated answer.

License: MIT