588x Faster SQLite Ingestion With an Autoresearch Loop

I have a personal ETL pipeline that pulls financial data from a remote source, parses it, and writes it into SQLite. One market (USA), single Python process, runs on a schedule. It has worked decent enough for the few weeks it has been live.

A few weeks ago I saw Karpathy’s autoresearch idea: give an agent a codebase, a goal, and a way to measure progress, then let it loop. Each iteration proposes a change, runs a benchmark, and decides whether to keep or revert. You define what “better” means, the agent does the tedious part. I bookmarked it and moved on.

Today I found pi-autoresearch, which adapts the same concept for the Pi agent. That lowered the barrier enough to actually try it.

My ETL has two phases: download (FTP) and processing (parse CSV, transform, persist to SQLite). Download is network-bound and not interesting to optimize. Processing is all local computation and SQL, so I isolated it: a small harness that generates 50,000 synthetic rows in the same format the real pipeline expects, feeds them to process_data() with enrichment disabled, and measures wall-clock time. No network, no API calls. Just parse, transform, persist. I first used Gemini 3 Pro to set up the autonomous research loop, then once it was running successfully I switched to Gemini 3 Flash.

Five commits came out. One did almost all the work. The original _upsert_stock_metadata updated rows using correlated subqueries – one UPDATE per column, each with a WHERE EXISTS (SELECT ...) against a temp table. On 50,000 rows SQLite evaluates the subquery for every row in the outer table. Classic O(n^2). The fix was UPDATE ... FROM (available since SQLite 3.33.0), which joins the temp table directly. One statement, one pass. That single change took the benchmark from ~397 seconds to ~0.7 seconds.

The other four commits stacked smaller wins: vectorized numeric coercion with pd.to_numeric() instead of per-row parsing, usecols and engine='c' on the CSV reader, vectorized string ops for FIGI cleaning instead of a per-value function, and dead code removal. Together about 8% on top of the SQL fix. Real, but marginal by comparison.

50,000 synthetic tickers, 3 rounds each, enrichment disabled:

Step	Avg	Speedup
Baseline (pre-optimization)	397.0s	1.0x
After UPDATE FROM	0.735s	540x
+ Vectorize pd.to_numeric	0.697s	570x
+ usecols, engine=‘c’, vectorize FIGI	0.675s	588x

The baseline number looks absurd, and it is. Correlated subqueries on an unindexed temp table is quadratic in SQLite. In production with ~2,000 symbols per run it was slow but not obviously broken. At 50,000 rows the cost becomes impossible to miss.

The loop found a real bug I had been living with. I knew the ingest was slower than it should be, I just had not sat down to profile it yet. The loop got there in minutes and the fix was correct. The smaller changes are things I would not have bothered with manually – a few milliseconds each – but the loop tries them because it is cheap to try.

I would not point this at architectural work or cross-module refactors. But “make this function faster, here is a benchmark” is a well-scoped problem, and the diffs were clean enough that I merged them without edits. Gemini 3 Pro on Pi helped get the loop working, then Gemini 3 Flash took over for the actual iterations – cheap model, fast cycles. Bad attempt gets reverted, loop keeps going. This pattern is worth testing.