CODY LEE

Building evaluation-driven, tool-using AI systems — and analyzing where they fail.

Forensic measurement frameworks, autonomous document agents, and adversarial stress-testing — across legal, financial, and security domains.

CS @ UC Riverside, graduating June 2026·Los Angeles·Open to relocation

Get in touch GitHub → Resume →

Approach

Six projects. One through-line.

All six deal with systems whose outputs can be silently wrong. Each project taught me something the next one needed. The arc isn't a plan — it's what happened.

01 ChainTax Feb–Mar 2026

Started with a crypto tax engine. The work surfaced the real problem: reconciling data across exchanges, wallets, and chains is enormously painful, and the difficulty has nothing to do with tax law. Messy data is the engineering problem.

02 Aether Apr 2026

Took the messy-data lesson into LLM territory. Built an agentic reason-act-observe engine that reasons one step at a time, discovers its path at runtime rather than following a fixed pipeline, and refuses to fabricate when the evidence isn't there — every step traced and auditable. Validated on financial-document QA, then extended into a full GTM lead-triage and outbound platform running on the same loop.

03 Polymarket Autopsy Apr–Jun 2026

Applied the LLM-workflow toolkit at scale: a 3-layer classification pipeline over thousands of trader wallets, feeding 180 paper bots through millions of simulated trades. The bots traded real capital. The autopsy documents why paper performance was anti-predictive of live — and what the three measurement bugs were that caused it.

04 Production RAG Stack Forensics 2026

Combined the previous three: the LLM workflow from Aether, the forensic methodology from Polymarket, applied to the production stack AI engineering teams ship on — LangGraph, Pinecone, Langfuse, cross-provider inference, and a custom MCP server exposing the instrumentation as agent-queryable tools. The sharpest finding was in the measurement itself.

05 Meridian May 2026

The sharpest finding from RAG Forensics was in the measurement. Meridian turns that into the project: a deterministic evaluation framework rigorous enough to catch its own bugs, calibrated against an external published baseline. The retrieval pipeline is the proving ground; the measurement layer — not the agent — is the contribution.

06 Aegis Jun 2026

Meridian proved the forensic discipline on retrieval metrics. Aegis applies it to security patches: a fix can pass the entire test suite and still leave the bug wide open. Aegis is a deterministic verifier that scores whether a patch genuinely closes the vulnerability — validated on real CVEs and measured like a classifier against genuine-vs-gamed patches. Same contribution, harder target.

Projects

In conviction order

Aether — GTM Agent Platform + Reasoning Engine

Apr–Jun 2026

A go-to-market agent platform that triages inbound leads and runs outbound campaigns on a single reason-act-observe loop: each lead is parsed, enriched against a source-of-truth waterfall, scored, and drafted for outreach — with a source on every fact or an honest abstention. The same loop is independently validated as a document-QA engine on FinQA, so the architecture isn't just asserted to work; it has a number.

What it proves

A production-shaped GTM platform built on a reasoning loop that stays honest. Enrichment runs a source-of-truth waterfall so every claim in an outreach draft traces to a record; scoring is deterministic with a clamped LLM nudge so the model can't run away with the verdict; and the agent abstains rather than fabricate. The same loop is independently validated as a document-QA engine on FinQA — so "this architecture works" is a measured claim, not a demo.

Key findings

GTM triage tier accuracy 62.9% (22/35 holdout); Hot precision 1.000 with zero false-hot — it never over-promotes a cold lead
Outbound grounding: 0% fabrication across 24 drafts (12 companies × 2 variants) — hard and soft hallucination both zero
Engine validated on FinQA (n=200): 68.5% strict / 75.5% lenient end-to-end; retrieval R@5 0.86, MRR@3 0.74, nDCG@5 0.78
Enrichment waterfall (PDL → Apollo → Brave → website → Productboard) puts a source on every fact; the engine returns a partial result with evidence rather than inventing what isn't there
Full observability: every step, tool call, and observation written to a SQLite trace and Langfuse; deployed frontend on Vercel, API on Render

Stack

Engine — Python · reason-act-observe loop (direct SDK, no LangChain/LangGraph) · BM25 + all-MiniLM-L6-v2 · ChromaDB · RRF + flashrank rerank · DuckDB · pdfplumber/Camelot/PyMuPDF · Vega-Lite · Pydantic v2 · SQLite trace · Streamlit
Platform — FastAPI · Next.js 16 / React 19 / Tailwind 4 · Postgres (Neon) · HubSpot CRM · People Data Labs · Apollo · Brave · Productboard · MCP server · Langfuse · Vercel + Render

Live demo → Repo →

Meridian — Forensic RAG Measurement Framework

May 2026

A two-layer forensic measurement framework for retrieval-augmented generation. Layer 1 is a deterministic span-overlap taxonomy — no LLM in the loop — that scores retrieval against human-annotated character offsets across four legal corpora (~79M characters, ~800 evaluation queries). Layer 2 is a separate LLM-judged layer for answer correctness and groundedness, isolated so a metric shift is never confused with a pipeline change. The measurement layer is the contribution; the retrieval pipeline is the proving ground.

What it proves

A deterministic measurement anchor rigorous enough to catch its own bugs — and isolated from the pipeline so you always know what moved and why. The two-layer separation means a metric change is never confused with a pipeline improvement. The deterministic Layer 1 can be run against an external published baseline to verify the ruler itself, not just the system it measures.

Key findings

Retrieval Precision@1 2–20× the published ZeroEntropy baseline, 3 of 4 corpora calibrated within a ~2–3pp drift floor — full optimized system vs. their dense-only baseline
Answer accuracy lifted from 26% to 75% on contracts; +7.9pp average across four corpora after tuning the retrieval config
Four silent measurement bugs caught before shipping — a model-ID alias routing every "Pro" run to the cheaper model, a (0,0)-span default zeroing all retrieval metrics, a channel mismatch, and a chunk-size confound — each surfaced by verify-before-trust checks the layer separation made possible
Retrieval stack transferred zero-shot to NFCorpus (3,633 medical IR documents), beating classic BEIR baselines at nDCG@10 0.399

Stack

Python · Qdrant · voyage-4-large · BM25 · hybrid retrieval · convex-combination fusion · document routing + LLM selector · deterministic span-overlap taxonomy · DeepSeek (synthesis) · Langfuse · LegalBench-RAG (ContractNLI, CUAD, MAUD, PrivacyQA)

8-page research → Repo →

Aegis — Deterministic Patch Verifier + Eval Harness

Jun 2026

A deterministic verifier that judges whether a security patch genuinely closes a vulnerability instead of just passing tests — built from orthogonal signals that each defeat a named gaming class, with calibrated abstention. Validated on real CVEs (MLflow, LibreChat) on live Docker targets, and measured like a classifier against genuine-vs-gamed patches. The verifier is the contribution; the agent and harness are how it gets stress-tested. Extends Meridian's forensic measurement discipline from retrieval metrics to reward integrity.

What it proves

A passing test suite is not proof a vulnerability is fixed. Aegis scores a patch with orthogonal signals that each defeat a named gaming class — deleting the feature, tampering the test, patching the literal payload while leaving the technique live — and abstains when the signals disagree rather than guessing. Because it's evaluated like a classifier against a labeled gold set of genuine fixes and deliberately gamed solutions, "it catches reward hacking" is a claim with a number behind it, not an assertion.

Key findings

Genuine-vs-gamed detection at 100% precision with zero false positives; fuzzing lifts recall to 100% (+45pp on the Windows set, +15.4pp on Linux) over enumeration alone
Validated on real CVEs — MLflow (CVE-2024-1558) and LibreChat (CVE-2024-11170) on live Docker targets — and caught a maintainer's fix that passes the full suite yet leaves the bug open
Benchmark-agnostic parallel eval harness (CVE-Bench on Inspect, BountyBench), process-isolated for clean concurrent runs
Controlled bare-vs-scaffold agent study returned an honest null result — isolating the real bottleneck instead of chasing a headline number

How it extends Meridian

Meridian asked: does the retrieval metric mean what you think it means? Aegis asks the same question of a security reward signal — does a passing test mean the bug is actually fixed? The verifier is scored against an external labeled ground truth, the same verification discipline, now applied to a patch-acceptance decision rather than a retrieval benchmark.

Stack

Python · inspect_ai (UK AISI) · Docker / Kali sandbox targets · behavioral exploit oracle · grammar + mutation fuzzer · happy-path regression · three-way scoring with calibrated abstention · process-isolated parallel harness · CVE-Bench · BountyBench · DeepSeek V4 Flash · GCP Compute Engine

Repo → Verifier report → Infra report → Agent report →

Polymarket Trading Bot Autopsy

Apr–Jun 2026

A 45-day systematic trading bot project documenting how measurement bugs made paper backtests anti-predictive of live performance.

What it proves

Forensic methodology applied to a real production system — from data pipeline, to LLM-driven analysis, to live execution, to public technical writeup. Built the system, ran it, broke it, documented why.

Key findings

Paper PnL inflated ~135× by three measurement bug classes — reversing real-world performance rankings
The bots my paper trading flagged as best performed worst live
2,409 wallets classified via 3-layer LLM pipeline (Haiku → Sonnet → Opus)
180 paper bots, 417,008 simulated trades
Tiered LLM cost: $30–40 total vs ~$150 on Opus alone
Deployed real capital across 6 architectures over 35 hours of live execution

Stack

Python · async websocket pipelines · Hyperliquid + Polymarket + Binance APIs · SQLite · tiered LLM classification (Anthropic SDK)

Autopsy repo → Live execution repo → 2-page summary → 15-page autopsy →

Production RAG Stack Forensics

2026

Forensic study of a production RAG system — LangGraph, Pinecone, Langfuse, and cross-provider inference — over the FastAPI documentation corpus. Extends the polymarket autopsy methodology to the stack AI engineering teams actually ship on, with a custom MCP server exposing the instrumentation as agent-queryable tools.

What it proves

Forensic methodology generalizes from a trading system I built to the production AI stack companies hire for — industry-standard tools throughout, with measurement-driven analysis of where the abstractions help and where they get in the way. The sharpest finding was in the measurement itself: an evaluation bug that nearly shipped a wrong conclusion.

Key findings

Documented four failure modes with measured fixes against a 150-question eval suite
Caught an evaluation bug scoring answers against empty retrieved content — it had inverted a cross-provider result; the fix reversed it, preventing a wrong published finding
Generation capability wasn't the bottleneck with strong retrieval: 0.23 faithfulness spread across a 28× cost difference between providers
Custom MCP server exposes eval results, retrieval latency, failure-mode breakdown, and per-stage cost as agent-queryable tools

Stack

Python · LangGraph · Pinecone · Langfuse · Anthropic / OpenAI / Google APIs · FastAPI · MCP (Model Context Protocol)

Repo → 5-page autopsy → MCP demo →

ChainTax — Crypto Tax Engine

Feb–Mar 2026

Cross-source data ingestion and reconciliation pipeline turning hundreds of thousands of crypto events into auditable IRS filings.

What it proves

Deterministic data engineering on adversarially messy inputs — multiple APIs, multiple chains, multiple semantic conventions — collapsed into a single reproducible pipeline with auditable output.

Key findings

540K+ events processed in a single pipeline run
Cross-chain bridge detection prevents double-counting the same dollar across chains
FIFO lot matching with SpecID retrospective comparison
Three competing IRS funding treatments produce three Form 8949 variants per run
Section 1092 offsetting-position detection across spot, perp, and correlated pairs

Stack

Python · Hyperliquid API · Alchemy + DeBank APIs · pandas · Streamlit · deterministic multi-source reconciliation

Repo →

Contact

Get in touch

Email cody.lee.cl1@gmail.com

GitHub github.com/clee12111

Resume Cody_Lee_Resume.pdf