Dmytro Nikolaiev

Building smoltok - BPE Tokenizer in Rust

This post is a quick reflection on motivation and journey behind smoltok project I built recently. For a more technical overview and benchmark results, check out the project README.

TL;DR: smoltok is a byte-pair encoding tokenizer I built primarily to learn Rust, Python bindings, and more about tokenization. It's an educational project with a hands-on exercise if you want to build your own.

Why I Built This

In order of priority (though even the last one was genuinely interesting):

1. Practice Rust with Python Bindings

This is my first "real" Rust project. But more than just practicing the language, I was interested into a specific pattern: write performance-critical code in Rust, use it seamlessly from Python.

As someone coming from Python, I wanted to:

This pattern feels powerful especially for Python developers who don't want to abandon their ecosystem / learn Rust but need raw performance for certain tasks.

2. Learn More About Tokenization

Tokenization sits at the foundation of modern LLMs, and has its own famous set of quirks and challenges, so I was genuinely curious to learn more about it in a practical way.

There's a set of promising approaches, like Byte Latent Transformer, adapting trained language models to be tokenization-free, etc., or completely different, but these apparently haven't been adopted by any big lab so far, so here we are. + see resources section from README.md.

3. Try ty as a Type Checker

Both uv and ruff are amazing tools. Likely for me, during development, ty appeared in beta as a type checker built by the same team.

Both pyrefly (I was initially trying) and ty are in beta so you do need mypy still on the stack, but I think it's beneficial to have a faster type checker for quicker feedback, e.g. use it for local development and use mypy in CI.

Not Everything Should Be Optimized

This might sound obvious, but building smoltok gave me yet another reminder that blindly optimizing doesn't work.

Look at the benchmark results in the README. For the small 1.2 MB Wikitext dataset, the parallel implementation is actually slower than single-threaded Rust. Parallelization overhead only pays off with larger inputs, and even then, the gains may not justify the added complexity.

Before smoltok, I actually wanted to build fast time series calculations with the same Rust + Python bindings pattern. After quick experimentation, I discovered my Rust bindings were slower than pure Python - not even NumPy, just native Python.

Why? Operations like sum() or mean() over large arrays already compile to highly optimized code. The overhead of crossing the Python-Rust boundary, plus data marshalling, can outweigh any gains when the actual computation is trivial.

So Rust isn't magic. It shines in certain situations, and BPE is actually a good fit for that - iterative BPE merges with lots of dictionary lookups and array/string manipulation.

On Using AI to Code

You could probably vibe-code something like this in a few hours; I deliberately took time since this was my first Rust experience.

Rust is hard (and awesome though!). Initially, having just started the Rust book, I thought I would vibe-code something and learn by exploring code. It doesn't work. Rust is the kind of language where you really need to understand fundamentals first (and I still probably don't šŸ™‚).

I used LLMs learning companions rather than code generators (as it's 2024 hah) asking questions like:

Several times I'd implement a feature with a code agent, verify it worked, review the implementation, then throw it away and rewrite from scratch myself. I think effect of using powerful AI models to learn to code is unclear; it's easy to get lost.

Final Thoughts

smoltok is an educational project. I learned a lot building it, and I hope the exercise.md guide helps others do the same.

If even one person follows that exercise and builds their own tokenizer from scratch, that would be a great outcome. šŸ¦€


  1. which is a big problem by itself

#project #rust #smol