What is Parameter Golf, and why I spent a month on it

March 18th, 2026. OpenAI posts the rules for something called Parameter Golf.

Train the best language model that fits in a 16MB artifact and trains in under 10 minutes on 8xH100s.

I read it twice and almost closed the tab. I trained a GAN a long time ago. I'd picked up transformer basics here and there. I had never trained a language model. I had never rented a GPU by the hour. Every word in the announcement was slightly beyond my reach.

I decided to try anyway.

A month later, my submission is open on OpenAI's repository at openai/parameter-golf#1747 — a hair behind the world record on the public leaderboard.

What follows is that month, told as a map of the ideas I had to understand along the way: partial rotary embeddings, quantized weights, test-time training, and a handful of things I'd never heard of on March 17th. The rest of this series unpacks them one at a time. This post is about why the challenge itself became the best curriculum I've ever found for getting into modern language modeling.

The rules, in plain language

Parameter Golf is a narrow, beautifully specified contest. You submit a single Python file — train_gpt.py — plus a compressed model blob. Together, they have to weigh less than 16 megabytes. The file is run on a rented 8-GPU machine (8×H100, which costs about twenty dollars an hour), and your training script has ten minutes to turn random weights into a working language model. Then your model is scored on a held-out slice of the web — the FineWeb validation set — using a measure called bits-per-byte.

Bits-per-byte is a tidy idea once it clicks. Your model is given real text it has never seen, and for each next character it predicts, it assigns some probability. If the model thinks the real next character was likely, it "pays" few bits; if the real next character surprises it, it pays many. Sum the bill across the entire validation set, divide by the number of bytes, and you have the average number of bits your model needed to describe each byte of real English. Lower is better. The naive baseline OpenAI shipped with the repo scored 1.2244. A month later, the top of the leaderboard was 1.0810.

The gap between those two numbers is the whole game. And because the measure is tokenizer-agnostic, nobody can cheat by picking a friendly vocabulary — all scores are on the same, absolute scale.

What makes the rule set unusual isn't any single constraint. It's the pairing of all three. Sixteen megabytes means you can't have many parameters; ten minutes means you can't train for long; a fixed evaluation means you can't tune against the test. Every trick someone uses is squeezed through all three constraints at once. Which is, it turns out, a phenomenal way to force you to actually understand what each trick does.

The best curriculum I've ever found

I've tried to "learn modern transformers" three times in the last few years. Each time I read a handful of papers, felt smarter for a week, and forgot most of it. Parameter Golf broke that pattern, and I think I know why.

The rules are public, and so is the code. Every record submission on the leaderboard is a pull request. You can click through and read the exact working Python that beat every previous entry. Papers describe ideas — often after the fact, often in selective detail. This challenge gives you running implementations, in a repository you can clone, with discussions attached to the PRs explaining why each change was made.

The surface area is small enough to hold in your head. A training script in Parameter Golf is about a thousand lines. You can read it in one sitting. Compare that with trying to learn from a production training codebase: you drown before you learn anything.

The feedback loop is honest and quick. You train for ten minutes. You get a number. The number goes on a leaderboard. A better number or a worse number is the only arbiter. There's no benchmark gaming, no cherry-picked evaluation. Either your idea works or it doesn't, and you find out before lunch.

The incentive structure is generous. OpenAI put a million dollars of compute credits on the table so that newcomers — me included — could afford to try. I wasn't expected to show up with my own cluster.

All of that adds up to something I've never had before in machine learning: a concrete, bounded, self-scoring problem where every successful idea is already in front of me as working code.

The month, roughly

The arc of my month wasn't week-by-week. It was concept-by-concept. Four ideas I didn't understand on March 17th, each of which unlocked the next phase.

Phase 1: tokenization is a compression scheme

I started by reading the naive baseline train_gpt.py line by line. I didn't type a thing for two days. I just wanted to understand what a small transformer — eleven layers, 512 hidden dimensions, a 1024-word vocabulary — actually looked like as code.

The first real decision I made was to make the vocabulary bigger. The intuition, which I picked up from the SentencePiece paper and a Hugging Face tutorial, is that the vocabulary is itself a compression scheme: a bigger vocabulary breaks sentences into fewer, larger tokens, which means each step of training sees more context. The catch is that the embedding table — one learned vector per vocabulary entry — scales linearly with vocab size, and it has to fit in that 16MB budget. At 4096 entries it was already a quarter of the artifact. I ended up at 8192, the sweet spot where the token reduction paid for the extra bytes.

One stumble worth naming: I tried to retokenize the dataset the obvious way — load a shard into memory, run it through the new tokenizer, write it out — and my machine died. A single training shard is 191 megabytes of raw tokens. I had to rewrite the pipeline using memory-mapped files. I spent a weekend learning what np.memmap does.

Phase 2: test-time training is not cheating

My first working submission was a non-record — val_bpb 1.1573, mid-table at the time. But it was on OpenAI's repository, under my GitHub handle, and that mattered more than the score. I got it by adding a trick called LoRA test-time training: at evaluation time, after the model scores each chunk of text, it briefly fine-tunes itself on that chunk before predicting the next one. The first time I read about it I was sure it was cheating. It isn't — you're only training on tokens the model has already been graded on, not on tokens it will be scored against. The research lineage is well worth a read on its own.

One stumble: my first pull request included all my local experiment files, because I hadn't yet learned how to keep a clean submission branch. I had to rebuild it from scratch off openai/parameter-golf:main and re-submit. Nobody on the team made me feel dumb about it, which I still appreciate.

Phase 3: the leaderboard is a curriculum

The most counter-intuitive thing about this challenge is that every record submission is public, working code. I wrote a forty-line Python script that could pull down a winning submission, decompress the packed model blob (the submissions use LZMA plus base85 encoding), and leave me with the full train_gpt.py someone had used to beat everyone else.

Then I read them. Over and over. The same three or four ideas kept recurring at the top of the leaderboard: a quantization scheme called GPTQ, score-first test-time training, partial rotary position embeddings, and depth recurrence. Once I'd seen a name three times, I went and read the paper.

The shift in Phase 3 wasn't about writing new code. It was about stopping being a spectator. I ran ablations: take one idea from the leading submission, turn it off, retrain, measure the hit. Four ablations cost me about fifty dollars of GPU time. They told me, clearly, which single idea was worth porting into my own stack.

Phase 4: partial RoPE is obvious, once you see it

The change that got me onto the near-top of the leaderboard was partial rotary embeddings. Rotary embeddings — RoPE — are how transformers encode position: instead of adding a position vector to each token, you rotate the query and key vectors by a position-dependent angle, and the dot product between them ends up depending only on their relative distance. It's elegant.

What I didn't know before reading the SOTA submission is that you don't have to rotate every dimension. You can rotate only 16 out of 64 head dimensions and leave the other 48 untouched. The intuition, once I got it, felt obvious: position is real information, but it doesn't need all of the head's capacity to carry. Leaving 48 dimensions unrotated lets them specialize on content instead. I ran an ablation first — turned RoPE off entirely on an already-trained model, watched the score get 82% worse — to convince myself the position signal mattered at all. Then I retrained with partial RoPE instead of full RoPE, and the number went down.

I trained three seeds to prove it wasn't luck. I shipped the submission. The PR is open right now.

Four ideas I didn't know a month earlier, one pull request on the world's most-watched ML repo, and a leaderboard entry that sits a hair behind the record.

What the rest of this series will cover

This post is the entry point. The other four go deep on the mechanics:

Post 2 — A ground-up explanation of attention. Queries, keys, values; why three matrices instead of one; what causal masking is. No competition context. A foundation you can read on its own.
Post 3 — What moved the leaderboard. A survey of the attention-side ideas that showed up repeatedly at the top: partial RoPE, parallel residuals, depth recurrence, sliding-window attention, LeakyReLU². What each one actually does, and why it helped.
Post 4 — Fitting 36 million parameters into 16 megabytes. The compression leg most outside commentary skips. GPTQ, SDClip, mixed int6/int8 quantization, the "bits budget" mental model.
Post 5 — Landing at 1.0820. My submission, end to end: the stack, the three-seed validation, the cross-region GPU hacks when the US was out of stock, and what decompressing competitor code taught me about my own blind spots.

Anyone can — because the on-ramp is there

I'm not going to pretend this was effortless. I burned real money on failed runs. I re-did my first pull request from scratch because I'd committed garbage. I learned what DevToolsActivePort is for reasons unrelated to the challenge and what np.memmap is for reasons very related. I had a weekend where nothing I tried moved the score and I wondered if I was one of those people who was going to flame out before shipping anything.

I kept going because the on-ramp was there, and I'd like to make the case that it is here for you too. The rules are bounded and public. The code is working and readable. The leaderboard doesn't care who you are. A thousand-dollar GPU grant can be requested on OpenAI's site. If you've been feeling like modern machine learning is a field that moved on without you, Parameter Golf is a concrete, unambiguous way to walk yourself back in.

The barrier to getting started on hard things is almost never intelligence. It's the absence of a clear on-ramp. Here, for once, the on-ramp is obvious. I'm going to spend the next four posts walking you up it.

What is Parameter Golf, and why I spent a month on it

The rules, in plain language

The best curriculum I've ever found

The month, roughly

Phase 1: tokenization is a compression scheme

Phase 2: test-time training is not cheating

Phase 3: the leaderboard is a curriculum

Phase 4: partial RoPE is obvious, once you see it

What the rest of this series will cover

Anyone can — because the on-ramp is there

Comments

More from this blog

How I built @moltybuilds90's autonomous loop (and what it taught me about X)

Command Palette

The rules, in plain language

The best curriculum I've ever found

The month, roughly

Phase 1: tokenization is a compression scheme

Phase 2: test-time training is not cheating

Phase 3: the leaderboard is a curriculum

Phase 4: partial RoPE is obvious, once you see it

What the rest of this series will cover

Anyone can — because the on-ramp is there

Comments

More from this blog