tokenizer / README.md
aimabai's picture
Update README.md
f0bf641 verified

Project Overview

This repository contains all the scripts, data samples, and artifacts for training, evaluating, and testing a multilingual SentencePiece BPE tokenizer (Kazakh, Russian, English) and running inference with the Gemma 1B model.

The workflow encompasses:

Sampling & Corpus Preparation: Extracting text samples from large datasets (JSON, Parquet) and assembling a training corpus.

Tokenizer Training: Using SentencePiece to train a BPE tokenizer on the sampled corpus.

Evaluation: Measuring metrics (compression ratio, fertility, continued-word ratio) on language-specific test sets.

Inference: Generating text with the Gemma model using either the default or custom tokenizer.

β”œβ”€β”€ .gitattributes
β”œβ”€β”€ english_eval_texts.json        # Collected English test texts (~25 MB)
β”œβ”€β”€ kazakh_eval_texts.json         # Collected Kazakh test texts (~25 MB)
β”œβ”€β”€ russian_eval_texts.json        # Collected Russian test texts (~25 MB)
β”œβ”€β”€ sentencepiece-bpe-tokenizer.py # Script: train tokenizer from multiple sources
β”œβ”€β”€ test-tokenizer.py              # Script: evaluate custom tokenizer metrics
β”œβ”€β”€ test-tokenizer-gemma-3-1b.py   # Script: evaluate Gemma tokenizer metrics
β”œβ”€β”€ inference_gemma.py             # Script: run text generation with Gemma 1B
β”œβ”€β”€ tokenizer_evaluation.json      # Saved evaluation metrics (per-language & overall)
β”‚
└── spm_bpe_tokenizer_50000_new/   # Artifacts and sampled data
    β”œβ”€β”€ samples/                   # Per-source sampled texts used for training
    β”œβ”€β”€ training_corpus.txt        # Combined training corpus (one sentence per line)
    β”œβ”€β”€ tokenizer.model            # Trained SentencePiece model file
    β”œβ”€β”€ tokenizer.vocab            # Corresponding vocabulary file
    β”œβ”€β”€ tokenizer_config.json      # Hugging Face tokenizer config
    β”œβ”€β”€ tokenizer_multilingual.model # (Optional) alternate multilingual model
    └── tokenizer_multilingual.vocab # (Optional) alternate vocab file

Usage

  1. Sample and Train Tokenizer

python sentencepiece-bpe-tokenizer.py

This will:

Read and randomly sample from the specified JSON and Parquet sources.

Write per-file samples into spm_bpe_tokenizer_50000_new/samples/.

Assemble training_corpus.txt in the same directory.

Train a BPE tokenizer with vocab size 50β€―000.

Output tokenizer.model, tokenizer.vocab, and tokenizer_config.json.

  1. Evaluate Custom Tokenizer

python test-tokenizer.py

Generates metrics on compression ratio, fertility, and segmentation balance for each language and saves results to tokenizer_evaluation.json.

  1. Evaluate Gemma’s Tokenizer

python test-tokenizer-gemma-3-1b.py

Computes the same metrics using the Gemma 3B tokenizer via Hugging Face.

  1. Run Inference with Gemma 1B

python inference_gemma.py

Prompts the Gemma model in English, Russian, and Kazakh (customizable in the script) and prints generated outputs.

Reproducing & Modifying

Re-run sampling: tweak target_samples or word count targets in sentencepiece-bpe-tokenizer.py and re-run to regenerate samples/ and training_corpus.txt.

Re-train: adjust vocab_size, model_type, or other SentencePiece parameters in the same script.

Re-evaluate: modify test-tokenizer.py parameters (e.g. test corpus size) and re-run.