File size: 2,616 Bytes
7c41697
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language:
  - zh
  - en
library_name: pytorch
tags:
  - translation
  - transformer
  - tiny-llm
  - zh-en
pipeline_tag: translation
---

# Tiny-LLM (ZH→EN) Checkpoint

Minimal Transformer encoder–decoder for Chinese → English translation. This repository hosts the inference assets (checkpoint and tokenizer) usable in Python or Gradio apps.

## Files

- `translate-step=290000.ckpt` — PyTorch state_dict checkpoint (Lightning-format state under `state_dict`)
- `tokenizer.json` — Hugging Face Tokenizers (BPE) with special tokens `[UNK]`, `[PAD]`, `[SOS]`, `[EOS]`

## Quick start

Load files using `huggingface_hub` and run with your own model code:

```python
import torch
from huggingface_hub import hf_hub_download
from tokenizers import Tokenizer

# Replace with your repo id if you fork
REPO_ID = "caixiaoshun/tiny-llm-zh2en"

ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="translate-step=290000.ckpt")
tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json")

# Example: integrate with a minimal Transformer implementation
# from src.config import Config
# from src.model import TranslateModel

# config = Config()
# config.tokenizer_file = tokenizer_path
# model = TranslateModel(config)
# state = torch.load(ckpt_path, map_location="cpu")["state_dict"]

# # Strip potential Lightning/compile prefixes
# prefix = "net._orig_mod."
# state = { (k[len(prefix):] if k.startswith(prefix) else k): v for k, v in state.items() }
# model.load_state_dict(state, strict=True)
# model.eval()

# tokenizer = Tokenizer.from_file(tokenizer_path)
```

If you deploy on Hugging Face Spaces or ModelScope, set environment variables to make your app fetch from this repo:

```bash
export HF_REPO_ID=caixiaoshun/tiny-llm-zh2en
export CKPT_FILE=translate-step=290000.ckpt
export TOKENIZER_FILE=tokenizer.json
```

## Notes

- Trained on a Chinese→English parallel dataset (CSV layout with ZH at column 0 and EN at column 1). Ensure the tokenizer and model hyperparameters match your training run.
- Decoding strategies supported in the reference app: greedy, nucleus (top-p), and beam search.

## Intended use

- Educational and demo purposes for small-scale translation tasks.
- Not intended for production-grade translation quality without further training/finetuning and evaluation.

## Limitations

- Small model capacity; outputs may be inaccurate or inconsistent on complex inputs.
- Tokenizer and checkpoint must match; mismatches lead to degraded results or load errors.

## Acknowledgements

- PyTorch for the deep learning framework
- Hugging Face Tokenizers for fast BPE