I implemented a subset of Assignment 1 from Stanford's CS 336 (Language Modeling from Scratch) involving Byte-Pair Encoding (BPE) tokenization. I expected a "cute algorithm + a couple unit tests." What I got was a surprisingly real systems problem: data structures, CPU caches, file I/O, and a lot of "why is this slower?" moments.
Check out my code here
The lecture and assignment notes gave a great idea of why modern-day subword tokenizers are used in most language models - subword tokenization provides a decent tradeoff between a larger vocabulary size and better compression of the input byte sequence. When I first read the Byte-Pair Encoding (BPE) algorithm used by Sennrich et al. (2016), and its naïve example in the assignment notes, I was excited to try and implement the algorithm myself, without the help of Claude code and Cursor.
This is one of the benefits of being a full-time student - you are allowed to work on interesting problems with no immediate useful rewards!
The naïve implementation of BPE is actually quite simple if you think about it. You split the input corpus into pre-tokens, and then iteratively merge adjacent sets of byte-level tokens, starting from the most-frequent token pair. Every time you merge a byte-level token pair, it becomes a "sub-word" token, and it replaces all individual occurrences of the adjacent bytes with the merged token. The next maximally frequent pair of tokens is picked up, merged, and the process repeats.
Simple, right? Well not quite so simple. It took me a couple of days to successfully pass all the test cases on my algorithm. I immediately tried jumping to the most "optimal" solution (thanks to years of leetcoding) and could not have guessed that the actual solution would end up involving a complex mix of data structures, optimization techniques, parallelization, and File I/O computer architecture internals. Phew! The catch is that "count all adjacent pairs, then update everything" becomes brutal once your dataset is gigabytes and your vocab target is 32K.
<|endoftext|><|endoftext|><|endoftext|><|endoftext|>
I start from:
<|endoftext|>)Then I run:
merges = vocab_size - (256 + #specials)
That's why:
BPE correctness is fragile: if your pair counts drift even slightly, the merge order changes and unit tests explode.
The bug pattern I hit was something simple, yet an edge case I completely missed: repeated pairs inside a token sequence (e.g., A B A B A B). Naively updating "before/after" counts per occurrence can double-subtract or miss boundary pairs.
💡 My fix: compute the before/after adjacent-pair multiset per affected word, then apply a diff to pair_counts (pair_counts is a dictionary with token pairs as keys and their frequency in the corpus as values). It's more expensive than a purely local update, but it's very hard to get wrong. That got me past the unit tests and gave stable merges on TinyStories Val (243 merges) and beyond.
✨ Nerd note: if you want both correctness and speed at scale, the next step is to track pair occurrences (positions), and update only local neighborhoods around each merge (no full rescans of the word). This is quite doable and I am thinking of using Cursor for a quick reimplementation. Will update here.
I tried two strategies for selecting the most frequent pair in each iteration:
pair_counts to find argmax. (I use the max() function in Python to perform a linear scan and choose the pair with maximum frequency)Honestly speaking, when I saw the problem at hand - I have to iteratively select the maximum-frequent token pair from the corpus - I thought using a max-heap was the obvious choice. the heap invariant keeps the maximally frequent element on top. It can be accessed in O(1) time, popped in O(1) time, and any new frequencies can be pushed with O(logN) complexity where N is the average heap size.
The improvement works! (and then it doesn't). For the tiny stories datasets and validation split of OWT, using a heap brings noticeable improvements in merging time. However this behavior doesn't extend to the (much larger) training set of the OWT split.
⚠️ There's a problem here ⚠️
When I merge the maximally frequent tokens, I need to update all their occurences. Thus the pair counts of some existing pairs might have to be decremented. This is handled by the pair_counts dictionary. However, the stale entries of such pair counts in the heap persist and cannot be popped feasibly. Yikes!
This means, if the count of a pair is decreased, its stale entry might still persists in the heap, even though I just pushed its updated count. It also means, the heap size will increase a lot for larger datasets because the stale entries explode.
Lazy validation is simple -
pair_counts, meaning you have popped the latest updated entryThe heap looks like an asymptotic win, but there's a systems caveat:
heappop() can end up discarding huge piles of junk and become slower than the scan.pair_counts) to prevent pathological slowdown.The run logs show both sides of the story:
20260109_162315 → total 21018.43s, merge 20672.69sTranslation: "O(log N)" is real for argmax extraction, but your actual runtime is dominated by how much you mutate counts + how much garbage the heap accumulates.
Pre-tokenization is easy to parallelize because chunks can be processed independently. BPE merging is fundamentally sequential because merge (t+1) depends on the exact state after merge (t).
I parallelized pre-tokenization with multiprocessing:
<|endoftext|> (avoid cross-document merges).Counter of pre-token byte strings.This sped up pre-tokenization substantially, but I learned the hard way that:
The fix was "boring but effective":
split() lists in memory,maxtasksperchild=1).Concrete pre-tokenization wins experimental runs:
Same pretokenization code, ~3.4× faster. That's multiprocessing doing its job.
I saw runs where "identical" pre-tokenization code was 2× faster on the second attempt. The reason was not magic — just OS page cache + CPU scheduling + thermals. Benchmarking on laptops requires discipline:
You can see this effect in the OWT Val O(N) runs which had a runtime randomly varying between 33 and 38 minutes.
Same dataset and method; different run time. Your laptop is a noisy lab instrument.
I learned that a good representation of total tokenization time should be an average across 3-5 runs, excluding the first run. For all my experiments, I averaged the time taken for 3 runs after excluding the first cold run.
Below are the headline results from multiple experimental runs.
Without multiprocessing:
With multiprocessing:
Takeaway: on tiny inputs, all optimizations are basically "meh"; overhead dominates. Pre-tokenization is the bottleneck. Parallelizations helps bring down pre-tokenization time!
Without multiprocessing:
With multiprocessing:
Takeaway: this is the "sweet spot" where heap + MP looks god-tier. Pre-tokenization is still the bottleneck. Parallelizations helps bring down pre-tokenization time and heap brings down merging time!
With multiprocessing (these runs are already merge-dominated):
run_1 → total 1983.55s (pretok 6.70s, merge 1976.85s)run_2 → total 2315.59s (pretok 7.48s, merge 2308.11s)run_1 → total 107.74s (pretok 7.15s, merge 100.60s)run_2 → total 111.81s (pretok 7.20s, merge 104.61s)Takeaway: on OWT Val, heap wins massively (merge time drops from ~2k seconds to ~100 seconds).
With multiprocessing:
Takeaway: this is where the "heap version should be faster" intuition died. The algorithmic story is bigger than argmax selection.
len(heap) >> len(pair_counts).On OWT Train, both methods are ~98% merge time. That screams: optimize the merge update mechanics.
If you want to go full nerd, the real performance roadmap looks like:
That's how you get from "hours" to "minutes" on 32k merges.
On OWT, my longest learned tokens sometimes decode into weird byte soup (e.g., repeated "Ã/Ä" glyphs). That's not necessarily a bug: BPE operates over bytes and happily merges non-UTF8-friendly sequences if they're frequent.
Tokenization looks like an NLP detail until you implement it. Then it becomes a systems problem.
If I were to take this further, the next frontier is a more advanced merging implementation:
Tokenization is fun. Tokenization is pain. Tokenization is real engineering.