|
|
--- |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- translation |
|
|
- transformer |
|
|
- tiny-llm |
|
|
- zh-en |
|
|
pipeline_tag: translation |
|
|
--- |
|
|
|
|
|
# Tiny-LLM (ZH→EN) Checkpoint |
|
|
|
|
|
Minimal Transformer encoder–decoder for Chinese → English translation. This repository hosts the inference assets (checkpoint and tokenizer) usable in Python or Gradio apps. |
|
|
|
|
|
## Files |
|
|
|
|
|
- `translate-step=290000.ckpt` — PyTorch state_dict checkpoint (Lightning-format state under `state_dict`) |
|
|
- `tokenizer.json` — Hugging Face Tokenizers (BPE) with special tokens `[UNK]`, `[PAD]`, `[SOS]`, `[EOS]` |
|
|
|
|
|
## Quick start |
|
|
|
|
|
Load files using `huggingface_hub` and run with your own model code: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from huggingface_hub import hf_hub_download |
|
|
from tokenizers import Tokenizer |
|
|
|
|
|
# Replace with your repo id if you fork |
|
|
REPO_ID = "caixiaoshun/tiny-llm-zh2en" |
|
|
|
|
|
ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="translate-step=290000.ckpt") |
|
|
tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json") |
|
|
|
|
|
# Example: integrate with a minimal Transformer implementation |
|
|
# from src.config import Config |
|
|
# from src.model import TranslateModel |
|
|
|
|
|
# config = Config() |
|
|
# config.tokenizer_file = tokenizer_path |
|
|
# model = TranslateModel(config) |
|
|
# state = torch.load(ckpt_path, map_location="cpu")["state_dict"] |
|
|
|
|
|
# # Strip potential Lightning/compile prefixes |
|
|
# prefix = "net._orig_mod." |
|
|
# state = { (k[len(prefix):] if k.startswith(prefix) else k): v for k, v in state.items() } |
|
|
# model.load_state_dict(state, strict=True) |
|
|
# model.eval() |
|
|
|
|
|
# tokenizer = Tokenizer.from_file(tokenizer_path) |
|
|
``` |
|
|
|
|
|
If you deploy on Hugging Face Spaces or ModelScope, set environment variables to make your app fetch from this repo: |
|
|
|
|
|
```bash |
|
|
export HF_REPO_ID=caixiaoshun/tiny-llm-zh2en |
|
|
export CKPT_FILE=translate-step=290000.ckpt |
|
|
export TOKENIZER_FILE=tokenizer.json |
|
|
``` |
|
|
|
|
|
## Notes |
|
|
|
|
|
- Trained on a Chinese→English parallel dataset (CSV layout with ZH at column 0 and EN at column 1). Ensure the tokenizer and model hyperparameters match your training run. |
|
|
- Decoding strategies supported in the reference app: greedy, nucleus (top-p), and beam search. |
|
|
|
|
|
## Intended use |
|
|
|
|
|
- Educational and demo purposes for small-scale translation tasks. |
|
|
- Not intended for production-grade translation quality without further training/finetuning and evaluation. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Small model capacity; outputs may be inaccurate or inconsistent on complex inputs. |
|
|
- Tokenizer and checkpoint must match; mismatches lead to degraded results or load errors. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- PyTorch for the deep learning framework |
|
|
- Hugging Face Tokenizers for fast BPE |
|
|
|