caixiaoshun commited on
Commit
7c41697
·
verified ·
1 Parent(s): 52c2e61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,3 +1,82 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ - en
5
+ library_name: pytorch
6
+ tags:
7
+ - translation
8
+ - transformer
9
+ - tiny-llm
10
+ - zh-en
11
+ pipeline_tag: translation
12
+ ---
13
+
14
+ # Tiny-LLM (ZH→EN) Checkpoint
15
+
16
+ Minimal Transformer encoder–decoder for Chinese → English translation. This repository hosts the inference assets (checkpoint and tokenizer) usable in Python or Gradio apps.
17
+
18
+ ## Files
19
+
20
+ - `translate-step=290000.ckpt` — PyTorch state_dict checkpoint (Lightning-format state under `state_dict`)
21
+ - `tokenizer.json` — Hugging Face Tokenizers (BPE) with special tokens `[UNK]`, `[PAD]`, `[SOS]`, `[EOS]`
22
+
23
+ ## Quick start
24
+
25
+ Load files using `huggingface_hub` and run with your own model code:
26
+
27
+ ```python
28
+ import torch
29
+ from huggingface_hub import hf_hub_download
30
+ from tokenizers import Tokenizer
31
+
32
+ # Replace with your repo id if you fork
33
+ REPO_ID = "caixiaoshun/tiny-llm-zh2en"
34
+
35
+ ckpt_path = hf_hub_download(repo_id=REPO_ID, filename="translate-step=290000.ckpt")
36
+ tokenizer_path = hf_hub_download(repo_id=REPO_ID, filename="tokenizer.json")
37
+
38
+ # Example: integrate with a minimal Transformer implementation
39
+ # from src.config import Config
40
+ # from src.model import TranslateModel
41
+
42
+ # config = Config()
43
+ # config.tokenizer_file = tokenizer_path
44
+ # model = TranslateModel(config)
45
+ # state = torch.load(ckpt_path, map_location="cpu")["state_dict"]
46
+
47
+ # # Strip potential Lightning/compile prefixes
48
+ # prefix = "net._orig_mod."
49
+ # state = { (k[len(prefix):] if k.startswith(prefix) else k): v for k, v in state.items() }
50
+ # model.load_state_dict(state, strict=True)
51
+ # model.eval()
52
+
53
+ # tokenizer = Tokenizer.from_file(tokenizer_path)
54
+ ```
55
+
56
+ If you deploy on Hugging Face Spaces or ModelScope, set environment variables to make your app fetch from this repo:
57
+
58
+ ```bash
59
+ export HF_REPO_ID=caixiaoshun/tiny-llm-zh2en
60
+ export CKPT_FILE=translate-step=290000.ckpt
61
+ export TOKENIZER_FILE=tokenizer.json
62
+ ```
63
+
64
+ ## Notes
65
+
66
+ - Trained on a Chinese→English parallel dataset (CSV layout with ZH at column 0 and EN at column 1). Ensure the tokenizer and model hyperparameters match your training run.
67
+ - Decoding strategies supported in the reference app: greedy, nucleus (top-p), and beam search.
68
+
69
+ ## Intended use
70
+
71
+ - Educational and demo purposes for small-scale translation tasks.
72
+ - Not intended for production-grade translation quality without further training/finetuning and evaluation.
73
+
74
+ ## Limitations
75
+
76
+ - Small model capacity; outputs may be inaccurate or inconsistent on complex inputs.
77
+ - Tokenizer and checkpoint must match; mismatches lead to degraded results or load errors.
78
+
79
+ ## Acknowledgements
80
+
81
+ - PyTorch for the deep learning framework
82
+ - Hugging Face Tokenizers for fast BPE