| Project Overview | |
| This repository contains all the scripts, data samples, and artifacts for training, evaluating, and testing a multilingual SentencePiece BPE tokenizer (Kazakh, Russian, English) and running inference with the Gemma 1B model. | |
| The workflow encompasses: | |
| Sampling & Corpus Preparation: Extracting text samples from large datasets (JSON, Parquet) and assembling a training corpus. | |
| Tokenizer Training: Using SentencePiece to train a BPE tokenizer on the sampled corpus. | |
| Evaluation: Measuring metrics (compression ratio, fertility, continued-word ratio) on language-specific test sets. | |
| Inference: Generating text with the Gemma model using either the default or custom tokenizer. | |
| ```text | |
| βββ .gitattributes | |
| βββ english_eval_texts.json # Collected English test texts (~25 MB) | |
| βββ kazakh_eval_texts.json # Collected Kazakh test texts (~25 MB) | |
| βββ russian_eval_texts.json # Collected Russian test texts (~25 MB) | |
| βββ sentencepiece-bpe-tokenizer.py # Script: train tokenizer from multiple sources | |
| βββ test-tokenizer.py # Script: evaluate custom tokenizer metrics | |
| βββ test-tokenizer-gemma-3-1b.py # Script: evaluate Gemma tokenizer metrics | |
| βββ inference_gemma.py # Script: run text generation with Gemma 1B | |
| βββ tokenizer_evaluation.json # Saved evaluation metrics (per-language & overall) | |
| β | |
| βββ spm_bpe_tokenizer_50000_new/ # Artifacts and sampled data | |
| βββ samples/ # Per-source sampled texts used for training | |
| βββ training_corpus.txt # Combined training corpus (one sentence per line) | |
| βββ tokenizer.model # Trained SentencePiece model file | |
| βββ tokenizer.vocab # Corresponding vocabulary file | |
| βββ tokenizer_config.json # Hugging Face tokenizer config | |
| βββ tokenizer_multilingual.model # (Optional) alternate multilingual model | |
| βββ tokenizer_multilingual.vocab # (Optional) alternate vocab file | |
| ``` | |
| Usage | |
| 1. Sample and Train Tokenizer | |
| python sentencepiece-bpe-tokenizer.py | |
| This will: | |
| Read and randomly sample from the specified JSON and Parquet sources. | |
| Write per-file samples into spm_bpe_tokenizer_50000_new/samples/. | |
| Assemble training_corpus.txt in the same directory. | |
| Train a BPE tokenizer with vocab sizeΒ 50β―000. | |
| Output tokenizer.model, tokenizer.vocab, and tokenizer_config.json. | |
| 2. Evaluate Custom Tokenizer | |
| python test-tokenizer.py | |
| Generates metrics on compression ratio, fertility, and segmentation balance for each language and saves results to tokenizer_evaluation.json. | |
| 3. Evaluate Gemmaβs Tokenizer | |
| python test-tokenizer-gemma-3-1b.py | |
| Computes the same metrics using the GemmaΒ 3B tokenizer via Hugging Face. | |
| 4. Run Inference with GemmaΒ 1B | |
| python inference_gemma.py | |
| Prompts the Gemma model in English, Russian, and Kazakh (customizable in the script) and prints generated outputs. | |
| Reproducing & Modifying | |
| Re-run sampling: tweak target_samples or word count targets in sentencepiece-bpe-tokenizer.py and re-run to regenerate samples/ and training_corpus.txt. | |
| Re-train: adjust vocab_size, model_type, or other SentencePiece parameters in the same script. | |
| Re-evaluate: modify test-tokenizer.py parameters (e.g. test corpus size) and re-run. |