SRP-base-model-training
/

tokenizer

Model card Files Files and versions

tokenizer / README.md

aimabai's picture

Update README.md

f0bf641 verified 7 months ago

|

history blame contribute delete

3.33 kB

	Project Overview

	This repository contains all the scripts, data samples, and artifacts for training, evaluating, and testing a multilingual SentencePiece BPE tokenizer (Kazakh, Russian, English) and running inference with the Gemma 1B model.

	The workflow encompasses:

	Sampling & Corpus Preparation: Extracting text samples from large datasets (JSON, Parquet) and assembling a training corpus.

	Tokenizer Training: Using SentencePiece to train a BPE tokenizer on the sampled corpus.

	Evaluation: Measuring metrics (compression ratio, fertility, continued-word ratio) on language-specific test sets.

	Inference: Generating text with the Gemma model using either the default or custom tokenizer.

	```text
	├── .gitattributes
	├── english_eval_texts.json # Collected English test texts (~25 MB)
	├── kazakh_eval_texts.json # Collected Kazakh test texts (~25 MB)
	├── russian_eval_texts.json # Collected Russian test texts (~25 MB)
	├── sentencepiece-bpe-tokenizer.py # Script: train tokenizer from multiple sources
	├── test-tokenizer.py # Script: evaluate custom tokenizer metrics
	├── test-tokenizer-gemma-3-1b.py # Script: evaluate Gemma tokenizer metrics
	├── inference_gemma.py # Script: run text generation with Gemma 1B
	├── tokenizer_evaluation.json # Saved evaluation metrics (per-language & overall)
	│
	└── spm_bpe_tokenizer_50000_new/ # Artifacts and sampled data
	├── samples/ # Per-source sampled texts used for training
	├── training_corpus.txt # Combined training corpus (one sentence per line)
	├── tokenizer.model # Trained SentencePiece model file
	├── tokenizer.vocab # Corresponding vocabulary file
	├── tokenizer_config.json # Hugging Face tokenizer config
	├── tokenizer_multilingual.model # (Optional) alternate multilingual model
	└── tokenizer_multilingual.vocab # (Optional) alternate vocab file
	```



	Usage

	1. Sample and Train Tokenizer

	python sentencepiece-bpe-tokenizer.py

	This will:

	Read and randomly sample from the specified JSON and Parquet sources.

	Write per-file samples into spm_bpe_tokenizer_50000_new/samples/.

	Assemble training_corpus.txt in the same directory.

	Train a BPE tokenizer with vocab size 50 000.

	Output tokenizer.model, tokenizer.vocab, and tokenizer_config.json.

	2. Evaluate Custom Tokenizer

	python test-tokenizer.py

	Generates metrics on compression ratio, fertility, and segmentation balance for each language and saves results to tokenizer_evaluation.json.

	3. Evaluate Gemma’s Tokenizer

	python test-tokenizer-gemma-3-1b.py

	Computes the same metrics using the Gemma 3B tokenizer via Hugging Face.

	4. Run Inference with Gemma 1B

	python inference_gemma.py

	Prompts the Gemma model in English, Russian, and Kazakh (customizable in the script) and prints generated outputs.

	Reproducing & Modifying

	Re-run sampling: tweak target_samples or word count targets in sentencepiece-bpe-tokenizer.py and re-run to regenerate samples/ and training_corpus.txt.

	Re-train: adjust vocab_size, model_type, or other SentencePiece parameters in the same script.

	Re-evaluate: modify test-tokenizer.py parameters (e.g. test corpus size) and re-run.