Files
ollama/x/imagegen/tokenizer/README.md
Daniel Hiltgen 33ee7168ba Add experimental MLX backend and engine with imagegen support (#13648)
* WIP - MLX backend with gemma3

* MLX: add cmake and go tag build toggles

To build the new MLX backend code:
  cmake --preset MLX
  cmake --build --preset MLX --parallel
  cmake --install build --component MLX
  go build -tags mlx .

Note: the main.go entrypoint for the MLX engine will change in a follow up commit.

* add experimental image generation runtime

* add experimental image generation runtime

* MLX: wire up cuda build for linux

* MLX: get dependencies correct and dedup

This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
directory.

* fix relative link bug in dedup

* Add darwin build and readme

* add go build tag for mlx dependent code and wire up build_darwin.sh

* lint cleanup

* macos: build mlx for x86

This will be CPU only.

* cuda build instructions and fix drift from mlx bump

* stale comment

* Delete agent helper doc

* Clean up readme.md

* Revise README for tokenizer clarity and details

Updated README to clarify tokenizer functionality and removed correctness section.

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
2026-01-08 16:18:59 -08:00

2.9 KiB

Tokenizer

Tokenizer for LLM inference supporting BPE, SentencePiece, and WordPiece algorithms. The goal of this package is to see if a pure Go tokenizer can be fast and correct. It primarily supports the imagegen models however it (or parts of it) could be considered to replace Ollama's tokenizer in the model package.

Features

  • BPE (Byte Pair Encoding) - GPT-2/Llama style with byte-level encoding
  • SentencePiece - Gemma style with space handling
  • WordPiece - BERT style with ## continuation tokens
  • Parallel encoding - Automatic parallelization for inputs >4KB
  • HuggingFace compatible - Loads tokenizer.json directly

Usage

import "github.com/ollama/ollama/x/imagegen/tokenizer"

// Load from HuggingFace model directory
tok, err := tokenizer.Load("./weights/Llama-3.2-1B")
if err != nil {
    log.Fatal(err)
}

// Encode text to token IDs
ids := tok.Encode("Hello, world!", false) // false = don't add BOS

// Decode back to text
text := tok.Decode(ids)

// Check special tokens
if tok.IsEOS(ids[len(ids)-1]) {
    // End of sequence
}

Performance

Benchmarks on Apple M3 Max:

Input Size Encode Decode Tokens
1 KB 14.5 MB/s 267 MB/s 231
10 KB 10.9 MB/s 321 MB/s 2,301
100 KB 8.9 MB/s 311 MB/s 23,001
1 MB 9.6 MB/s 321 MB/s 230,001

Comparison with other implementations (10 MB input):

Implementation Encode Speed Notes
Engine (this) ~10 MB/s stdlib RE2, parallel >4KB
tiktoken (Rust) ~17 MB/s Highly optimized regex
Ollama (Go) ~2-3 MB/s regexp2 backtracking

Performance Opportunities

Potential optimizations not yet implemented:

Optimization Expected Gain Complexity
Aho-Corasick for special tokens 2-3x for many special tokens Medium
Custom regex engine (like tiktoken) 1.5-2x High
SIMD byte scanning 1.3-1.5x for pretokenizer Medium
Assembly BPE merge loop 1.2-1.5x High
Memoization for repeated substrings Variable Low

Current bottleneck is the pretokenizer regex (~60% of encode time). tiktoken achieves ~17 MB/s with a hand-tuned Rust regex engine.

Not Yet Implemented

Feature Used By Notes
Unigram tokenizer T5, ALBERT, mBART Different algorithm (not BPE)
Unicode normalizers Some multilingual models NFD, NFKC, lowercase, etc.
Custom pretokenizers Model-specific Beyond standard patterns

Most HuggingFace models use BPE or SentencePiece, which are fully supported. WordPiece (BERT-style) is also supported with standard [UNK] fallback for out-of-vocabulary characters.

Files

File Description
tokenizer.go Main implementation (~1000 lines)
tokenizer_test.go Tests and benchmarks
testdata/ Mini tokenizer for unit tests