A multi-step agent that reads private-equity fund quarterly reports and runs a nine-item LP diligence checklist
Every LP and fund-of-funds manager reads hundreds of GP quarterly reports per quarter. Most of the work is locating the same nine or ten figures inside differently structured documents. This prototype performs that extraction in under a minute per report, with citation-required answers and an explicit refusal mode when data is missing or redacted.
Documents land as PDFs (PSERS quarterly reports) or HTML (SEC 10-Qs). A section detector splits each into named sections with page anchors. A deterministic sentence-aware chunker turns the text into ~500-token windows with overlap.
Each checklist item is its own retrieval + LLM call. Local sentence-transformers embeddings and sqlite-vec cosine search surface the top-8 chunks per query. Anthropic Claude synthesizes a citation-tagged answer with prompt caching on the system prompt.
The system prompt requires a citation for every numeric claim and forces the LLM to return 'data not available' when retrieved excerpts can't support an answer. Each response is tagged high, medium, or refused — and the refusals are evaluated as a first-class metric.
The agent runs the same checklist against documents that disclose different things. PSERS quarterly reports redact fee terms and don't discuss valuation methodology; SEC 10-Qs disclose both. The interesting signal isn't that the agent answers when data exists — it's that it refuses correctly when it doesn't.
A hand-curated golden set covers all five documents and all nine checklist items, including questions targeting redacted or out-of-scope content (which the agent must refuse to score well). Faithfulness, context recall, and context precision are scored by Claude Haiku as judge. Numbers below are published verbatim from the latest run, including the worst-performing rows. Tuning to beat the metrics is out of scope for v1; the next iteration would add chunk re-ranking and per-section query rewriting to lift retrieval precision.
Five publicly available documents. Three consecutive quarters of PSERS Hamilton Lane quarterly reports (2017), FOIA-released via the Pennsylvania Joint State Government Commission Act 5 archive — the closest public analog to GP-to-LP quarterly communications, with redacted fee terms that exercise the agent's refusal mode by design. Two Blackstone Private Equity Strategies Fund 10-Q filings (Q1 and Q3 2025) cover what the PSERS reports redact: fund-level fee detail and the Level 1/2/3 fair value hierarchy.
Five documents is enough to demonstrate both single-document retrieval and cross-document behavior without overstating coverage. A production deployment would replace the hand-curated corpus with the firm's existing fund-master tables and document store.
Every numeric claim must be followed by a bracketed citation label matching a retrieved chunk. The LLM cannot answer from general knowledge.
The eval penalizes both false refusals and false confidences. Refusing when data exists is as bad as fabricating when it doesn't.
Every API and MCP call returns the retrieved chunks, the citation labels cited, the model used, and input/output token counts.
Document parsing and embedding run locally. Only synthesis calls touch a hosted LLM. The synthesis provider is configurable per environment.
Python backend with FastAPI + sqlite-vec + sentence-transformers, Anthropic Claude for synthesis, Haiku as the eval judge. The same tools are exposed via an MCP server (usable directly from Claude Desktop) and a Next.js frontend. ~600 chunks across the five-document corpus. The agent uses prompt caching on the system prompt so repeated checklist runs amortize most of the input cost. A checklist run on Anthropic Sonnet 4.6 costs about $0.13 end-to-end.