caixiaoshun
/

tiny-translator-zh2en

Model card Files Files and versions

tiny-translator-zh2en / README.md

caixiaoshun's picture

Update README.md

7c41697 verified 4 months ago

|

history blame contribute delete

2.62 kB

	---
	language:
	- zh
	- en
	library_name: pytorch
	tags:
	- translation
	- transformer
	- tiny-llm
	- zh-en
	pipeline_tag: translation
	---

	# Tiny-LLM (ZH→EN) Checkpoint

	Minimal Transformer encoder–decoder for Chinese → English translation. This repository hosts the inference assets (checkpoint and tokenizer) usable in Python or Gradio apps.

	## Files

	- `translate-step=290000.ckpt` — PyTorch state_dict checkpoint (Lightning-format state under `state_dict`)
	- `tokenizer.json` — Hugging Face Tokenizers (BPE) with special tokens `[UNK]`, `[PAD]`, `[SOS]`, `[EOS]`

	## Quick start

	Load files using `huggingface_hub` and run with your own model code:

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from tokenizers import Tokenizer

	# Replace with your repo id if you fork
	REPO_ID = "caixiaoshun/tiny-llm-zh2en"

	ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="translate-step=290000.ckpt")
	tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json")

	# Example: integrate with a minimal Transformer implementation
	# from src.config import Config
	# from src.model import TranslateModel

	# config = Config()
	# config.tokenizer_file = tokenizer_path
	# model = TranslateModel(config)
	# state = torch.load(ckpt_path, map_location="cpu")["state_dict"]

	# # Strip potential Lightning/compile prefixes
	# prefix = "net._orig_mod."
	# state = { (k[len(prefix):] if k.startswith(prefix) else k): v for k, v in state.items() }
	# model.load_state_dict(state, strict=True)
	# model.eval()

	# tokenizer = Tokenizer.from_file(tokenizer_path)
	```

	If you deploy on Hugging Face Spaces or ModelScope, set environment variables to make your app fetch from this repo:

	```bash
	export HF_REPO_ID=caixiaoshun/tiny-llm-zh2en
	export CKPT_FILE=translate-step=290000.ckpt
	export TOKENIZER_FILE=tokenizer.json
	```

	## Notes

	- Trained on a Chinese→English parallel dataset (CSV layout with ZH at column 0 and EN at column 1). Ensure the tokenizer and model hyperparameters match your training run.
	- Decoding strategies supported in the reference app: greedy, nucleus (top-p), and beam search.

	## Intended use

	- Educational and demo purposes for small-scale translation tasks.
	- Not intended for production-grade translation quality without further training/finetuning and evaluation.

	## Limitations

	- Small model capacity; outputs may be inaccurate or inconsistent on complex inputs.
	- Tokenizer and checkpoint must match; mismatches lead to degraded results or load errors.

	## Acknowledgements

	- PyTorch for the deep learning framework
	- Hugging Face Tokenizers for fast BPE