CODY LEE

Building end-to-end LLM pipelines on messy real-world data — retrieval, agent loops, evaluation, cost-tiered model routing.

Forensically documenting where they break across trading systems, financial documents, and codebases.

CS @ UC Riverside, graduating June 2026·Los Angeles·Open to relocation

Four projects, one through-line

All four deal with messy data. Each project taught me something the next one needed. The arc isn't a plan — it's what happened.

01 ChainTax Feb–Mar 2026 messy data is the real problem 02 Aether Apr 2026 LLM workflow + retrieval 03 Polymarket Apr–Jun 2026 LLM workflow + forensics 04 vLLM Forensics In Progress each carries the previous project's hard-won lesson forward
01 ChainTax Feb–Mar 2026

Started with a crypto tax engine. The work surfaced the real problem: reconciling data across exchanges, wallets, and chains is enormously painful, and the difficulty has nothing to do with tax law. Messy data is the engineering problem.

02 Aether Apr 2026

Took the messy-data lesson into LLM territory. Built a workflow engine and retrieval pipeline that handles structurally inconsistent financial documents — hybrid retrieval, agent loops, audit trails, and cost-tiered model routing engineered against a real eval suite.

03 Polymarket Autopsy Apr–Jun 2026

Applied the LLM-workflow toolkit at scale: a 3-layer classification pipeline over thousands of trader wallets, feeding 180 paper bots through millions of simulated trades. The bots traded real capital. The autopsy documents why paper performance was anti-predictive of live.

04 vLLM Retrieval Forensics In Progress, 2026

Combines the previous three: the LLM workflow from Aether, the forensic methodology from Polymarket, applied to LLM-based code retrieval over the vLLM codebase. Stage-by-stage failure documentation as the project proceeds.

In conviction order

Polymarket Trading Bot Autopsy

Apr–Jun 2026

A 45-day systematic trading bot project documenting how measurement bugs made paper backtests anti-predictive of live performance.

Dune scrape 3,675 wallets 3-layer LLM Haiku → Sonnet → Opus 2,409 classified paper simulation 180 bot variants cross-source data 417K trades live deployment 6 architectures real capital 35 hrs runtime autopsy 3 bug classes found 135× inflation scrape → classify → simulate → deploy → diagnose
What it proves
Forensic methodology applied to a real production system — from data pipeline, to LLM-driven analysis, to live execution, to public technical writeup. Built the system, ran it, broke it, documented why.
Key findings
  • Paper PnL inflated ~135× by three measurement bug classes
  • The bots my paper trading flagged as best performed worst live
  • 2,409 wallets classified via 3-layer LLM pipeline (Haiku → Sonnet → Opus)
  • 180 paper bots, 417,008 simulated trades
  • Tiered LLM cost: $30–40 total vs ~$150 on Opus alone
  • Deployed real capital across 6 architectures over 35 hours of live execution
Stack
Python · async websocket pipelines · Hyperliquid + Polymarket + Binance APIs · SQLite · tiered LLM classification (Anthropic SDK)

Aether — LLM Workflow Engine

Apr 2026

Hybrid RAG pipeline over financial compliance documents with a planner/executor/critic agent loop and full audit trails.

documents + chunking hybrid retrieval BM25 + dense (MiniLM-L6) ChromaDB RRF + rerank flashrank cross-encoder 96% precision agent loop Opus planner Haiku critic $0.65/run audit trail + eval suite + output 87% e2e ingest → retrieve → rerank → reason → audit
What it proves
Retrieval architecture decisions made deliberately — sparse + dense fusion, cross-encoder reranking, cost-tiered model routing — measured against a real eval suite with perfect reproducibility on the retrieval side.
Key findings
  • Retrieval precision: 96% on the eval suite, perfectly reproducible across 5 runs
  • End-to-end correctness 87% at $0.65 per run
  • Tiered model routing: Opus plans, Haiku critiques, executor LLM-free
  • Engine/domain separation — domain knowledge enters through configuration, architecture generalizes beyond financial documents
Stack
Python · BM25 (rank_bm25) + all-MiniLM-L6-v2 dense embeddings · ChromaDB · Reciprocal Rank Fusion · flashrank cross-encoder (ms-marco-MiniLM-L-12-v2) · Anthropic SDK · DuckDB

ChainTax — Crypto Tax Engine

Feb–Mar 2026

Cross-source data ingestion and reconciliation pipeline turning hundreds of thousands of crypto events into auditable IRS filings.

Hyperliquid API Alchemy API DeBank API reconciliation + bridge detection 540K+ events FIFO + SpecID lot matching + Section 1092 3 treatments §163(d) / basis-adj / §165(c)(2) 3× Form 8949 + partner PDF three sources → one ledger → three variants of the same truth
What it proves
Deterministic data engineering on adversarially messy inputs — multiple APIs, multiple chains, multiple semantic conventions — collapsed into a single reproducible pipeline with auditable output.
Key findings
  • 540K+ events processed in a single pipeline run
  • Cross-chain bridge detection prevents double-counting the same dollar across chains
  • FIFO lot matching with SpecID retrospective comparison
  • Three competing IRS funding treatments produce three Form 8949 variants per run
  • Section 1092 offsetting-position detection across spot, perp, and correlated pairs
Stack
Python · Hyperliquid API · Alchemy + DeBank APIs · pandas · Streamlit · deterministic multi-source reconciliation

vLLM Retrieval ForensicsIN PROGRESS

2026

Forensic study of retrieval and answer-generation over the vLLM codebase — extending the polymarket autopsy methodology to LLM infrastructure systems.

vLLM codebase + chunking hybrid retrieval BM25 + dense + rerank code chunks tiered inference multi-model routing answers stratified eval stage-by-stage attribution where it fails forensic case studies published as work proceeds planned architecture — implementation begins late May 2026
What it proves
The forensic measurement methodology generalizes from trading systems to LLM systems themselves. Project planning published; implementation underway with first case study expected within 4–6 weeks.
Stack
Python · Anthropic SDK · vector retrieval · evaluation methodology

Get in touch