caixiaoshun's picture
Update README.md
7c41697 verified
---
language:
- zh
- en
library_name: pytorch
tags:
- translation
- transformer
- tiny-llm
- zh-en
pipeline_tag: translation
---
# Tiny-LLM (ZH→EN) Checkpoint
Minimal Transformer encoder–decoder for Chinese → English translation. This repository hosts the inference assets (checkpoint and tokenizer) usable in Python or Gradio apps.
## Files
- `translate-step=290000.ckpt` — PyTorch state_dict checkpoint (Lightning-format state under `state_dict`)
- `tokenizer.json` — Hugging Face Tokenizers (BPE) with special tokens `[UNK]`, `[PAD]`, `[SOS]`, `[EOS]`
## Quick start
Load files using `huggingface_hub` and run with your own model code:
```python
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer
# Replace with your repo id if you fork
REPO_ID = "caixiaoshun/tiny-llm-zh2en"
ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="translate-step=290000.ckpt")
tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json")
# Example: integrate with a minimal Transformer implementation
# from src.config import Config
# from src.model import TranslateModel
# config = Config()
# config.tokenizer_file = tokenizer_path
# model = TranslateModel(config)
# state = torch.load(ckpt_path, map_location="cpu")["state_dict"]
# # Strip potential Lightning/compile prefixes
# prefix = "net._orig_mod."
# state = { (k[len(prefix):] if k.startswith(prefix) else k): v for k, v in state.items() }
# model.load_state_dict(state, strict=True)
# model.eval()
# tokenizer = Tokenizer.from_file(tokenizer_path)
```
If you deploy on Hugging Face Spaces or ModelScope, set environment variables to make your app fetch from this repo:
```bash
export HF_REPO_ID=caixiaoshun/tiny-llm-zh2en
export CKPT_FILE=translate-step=290000.ckpt
export TOKENIZER_FILE=tokenizer.json
```
## Notes
- Trained on a Chinese→English parallel dataset (CSV layout with ZH at column 0 and EN at column 1). Ensure the tokenizer and model hyperparameters match your training run.
- Decoding strategies supported in the reference app: greedy, nucleus (top-p), and beam search.
## Intended use
- Educational and demo purposes for small-scale translation tasks.
- Not intended for production-grade translation quality without further training/finetuning and evaluation.
## Limitations
- Small model capacity; outputs may be inaccurate or inconsistent on complex inputs.
- Tokenizer and checkpoint must match; mismatches lead to degraded results or load errors.
## Acknowledgements
- PyTorch for the deep learning framework
- Hugging Face Tokenizers for fast BPE