File size: 10,328 Bytes
b03b79f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
language:
  - zh
  - en
tags:
  - translation
  - chinese
  - nlp
  - text-processing
  - markdown
  - offline
  - deep-translator
  - marianmt
license: mit
library_name: transformers
pipeline_tag: translation
model-index:
  - name: Helsinki-NLP/opus-mt-zh-en
    results: []
---

# ChineseFileTranslator

[![Python](https://img.shields.io/badge/Python-3.9%2B-blue?logo=python&logoColor=white)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-1.0.0-orange)](CHANGELOG.md)
[![Offline Support](https://img.shields.io/badge/offline-Helsinki--NLP%2Fopus--mt--zh--en-blueviolet)](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en)
[![Hugging Face](https://img.shields.io/badge/HuggingFace-Model-yellow?logo=huggingface)](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en)
[![Maintenance](https://img.shields.io/badge/Maintained-yes-brightgreen)](https://github.com/algorembrant/ChineseFileTranslator)
[![Code Style](https://img.shields.io/badge/code%20style-PEP8-informational)](https://peps.python.org/pep-0008/)

Translate Chinese text (Simplified, Traditional, Cantonese, Classical) inside `.txt` and `.md` files
to English. Preserves full Markdown syntax. Supports Google Translate, Microsoft Translator, and a
fully offline Helsinki-NLP MarianMT backend with vectorized batching.

---

### Key Features

- **'Never Miss' Global Surgical Translation**: Unique strategy to capture ALL Chinese while protecting structure.
- **Inclusive CJK Detection**: Comprehensive 32-bit Unicode coverage (Basic, Ext A-E, Symbols, Punctuation).
- **Proactive Markdown Protection**: Frontmatter, code blocks, links, and HTML are safely tokenized.
- **Robust Placeholder Restoration**: Space-lenient, case-insensitive restoration handles engine mangling.
- **Unstoppable Backend Resilience**: Explicit failure detection with automatic retries and non-crashing fallbacks.
- **Offline First Option**: Fully local Helsinki-NLP MarianMT backend with vectorized batching.
- **Bilingual Mode**: Optional side-by-side Chinese and English output.
- **Batch Processing**: Translate entire directories with recursive discovery and persistent configuration.

---

## Project Structure

```
ChineseFileTranslator/
β”œβ”€β”€ chinese_file_translator.py   # Main script (single-file, no extra modules)
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ README.md                    # This file
β”œβ”€β”€ .gitattributes               # Git line-ending and LFS rules
β”œβ”€β”€ .gitignore                   # Ignored paths
└── LICENSE                      # MIT License
```

---

## Quickstart

### 1. Clone the repository

```bash
git clone https://github.com/algorembrant/ChineseFileTranslator.git
cd ChineseFileTranslator
```

### 2. Create and activate a virtual environment (recommended)

```bash
python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate
```

### 3. Install core dependencies

```bash
pip install -r requirements.txt
```

### 4. (Optional) Install offline translation backend

Choose the correct PyTorch build for your system:

```bash
# CPU only
pip install torch --index-url https://download.pytorch.org/whl/cpu

# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Then install Transformers stack
pip install transformers sentencepiece sacremoses
```

The Helsinki-NLP/opus-mt-zh-en model (~300 MB) downloads automatically on first use.

---

## Usage

### Command Reference

| Command | Description |
|---|---|
| `python chinese_file_translator.py input.txt` | Translate a plain-text file (Google backend) |
| `python chinese_file_translator.py input.md` | Translate a Markdown file, preserve structure |
| `python chinese_file_translator.py input.txt -o out.txt` | Set explicit output path |
| `python chinese_file_translator.py input.txt --backend offline` | Use offline MarianMT model |
| `python chinese_file_translator.py input.txt --backend microsoft` | Use Microsoft Translator |
| `python chinese_file_translator.py input.txt --offline --gpu` | Offline + GPU (CUDA) |
| `python chinese_file_translator.py input.txt --lang simplified` | Force Simplified Chinese |
| `python chinese_file_translator.py input.txt --lang traditional` | Force Traditional Chinese |
| `python chinese_file_translator.py input.txt --bilingual` | Keep Chinese + show English |
| `python chinese_file_translator.py input.txt --extract-only` | Extract Chinese lines only |
| `python chinese_file_translator.py input.txt --stdout` | Print output to terminal |
| `python chinese_file_translator.py --batch ./docs/` | Batch translate a directory |
| `python chinese_file_translator.py --batch ./in/ --batch-out ./out/` | Batch with output dir |
| `python chinese_file_translator.py input.txt --chunk-size 2000` | Custom chunk size |
| `python chinese_file_translator.py input.txt --export-history h.json` | Export history |
| `python chinese_file_translator.py input.txt --verbose` | Debug logging |
| `python chinese_file_translator.py --version` | Print version |
| `python chinese_file_translator.py --help` | Full help |

### Arguments

| Argument | Type | Default | Description |
|---|---|---|---|
| `input` | positional | β€” | Path to `.txt` or `.md` file |
| `-o / --output` | string | `<name>_translated.<ext>` | Output file path |
| `--batch DIR` | string | β€” | Directory to batch translate |
| `--batch-out DIR` | string | same as `--batch` | Output directory for batch |
| `--backend` | choice | `google` | `google`, `microsoft`, `offline` |
| `--offline` | flag | `false` | Shorthand for `--backend offline` |
| `--lang` | choice | `auto` | `auto`, `simplified`, `traditional` |
| `--gpu` | flag | `false` | Use CUDA for offline model |
| `--confidence` | float | `0.05` | Min Chinese character ratio for detection |
| `--chunk-size` | int | `4000` | Max chars per translation request |
| `--bilingual` | flag | `false` | Output both Chinese and English |
| `--extract-only` | flag | `false` | Save only the detected Chinese lines |
| `--stdout` | flag | `false` | Print result to stdout |
| `--export-history` | string | β€” | Save session history to JSON |
| `--verbose` | flag | `false` | Enable DEBUG logging |
| `--version` | flag | β€” | Show version and exit |

---

## Configuration

The tool writes a JSON config file on first run:

```
~/.chinese_file_translator/config.json
```

Example `config.json`:

```json
{
  "backend": "google",
  "lang": "auto",
  "use_gpu": false,
  "chunk_size": 4000,
  "batch_size": 10,
  "bilingual": false,
  "microsoft_api_key": "YOUR_KEY_HERE",
  "microsoft_region": "eastus",
  "offline_model_dir": "~/.chinese_file_translator/models",
  "output_suffix": "_translated",
  "retry_attempts": 3,
  "retry_delay_seconds": 1.5,
  "max_history": 1000
}
```

---

## Supported Chinese Variants

| Variant | Notes |
|---|---|
| Simplified Chinese | Mandarin, mainland China standard |
| Traditional Chinese | Taiwan, Hong Kong, Macau standard |
| Cantonese / Yue | Detected via CJK Unicode ranges |
| Classical Chinese | Treated as Traditional for translation |
| Mixed Chinese-English | Code-switching text handled transparently |

---

## Translation Backends

| Backend | Requires | Speed | Quality | Internet |
|---|---|---|---|---|
| Google Translate | `deep-translator` | Fast | High | Yes |
| Microsoft Translator | Azure API key + `deep-translator` | Fast | High | Yes |
| Helsinki-NLP MarianMT | `transformers`, `torch` | Medium | Good | No (after download) |

Google Translate is the default. If it fails, the tool falls back to the offline model automatically.

---

---

## Technical Strategy: 'Never Miss' Logic

The tool employs a sophisticated "Global Surgical" approach to ensure no Chinese fragment is overlooked, regardless of its depth in JSON, HTML, or complex Markdown.

### 1. Surgical Block Extraction
Instead of line-by-line translation, the script identifies every continuous block of CJK characters (including ideographic symbols and punctuation) across the entire document. This ensures that contextually related characters are translated together for better accuracy.

### 2. Structural Protection
Markdown and metadata structures are tokenized using unique, collision-resistant placeholders (`___MY_PROTECT_PH_{idx}___`).
- **YAML/TOML**: Frontmatter is protected globally.
- **Code Fences**: Backticks and language identifiers are protected; Chinese content *inside* comments or strings remains translatable.
- **Links & HTML**: URLs and tag names are guarded, while display text is surgically translated.

### 3. Verification & Restoration
- **Longest-First Replacement**: Translated segments are restored starting from the longest strings to prevent partial match overwrites.
- **Fuzzy Restoration**: The restoration logic is space-lenient and case-insensitive to handle cases where online translation engines mangle the placeholder tokens.

---

## Markdown Preservation

The following elements are meticulously protected:

| Element | Example | Protection Method |
|---|---|---|
| Front Matter | `---\ntitle: ...\n---` | Full Tokenization |
| Fenced Code | ` ```python ... ``` ` | Boundary Tokenization |
| Inline Code | `` `code` `` | Full Tokenization |
| Links / Images | `[text](url)` | URL Tokenization |
| HTML Tags | `<div class="...">` | Tag Tokenization |
| Symbols | `&copy;`, `&#x...;` | Entity Tokenization |

---

## Microsoft Translator Setup

1. Go to [Azure Cognitive Services](https://portal.azure.com/)
2. Create a Translator resource (Free tier: 2M chars/month)
3. Copy your API key and region
4. Add them to `~/.chinese_file_translator/config.json`:

```json
{
  "microsoft_api_key": "abc123...",
  "microsoft_region": "eastus"
}
```

Then run:

```bash
python chinese_file_translator.py input.txt --backend microsoft
```

---

## Files Generated

| Path | Description |
|---|---|
| `~/.chinese_file_translator/config.json` | Persistent settings |
| `~/.chinese_file_translator/history.json` | Session history log |
| `~/.chinese_file_translator/app.log` | Application log file |
| `~/.chinese_file_translator/models/` | Offline model cache (if used) |

---

## Author

**algorembrant**

---

## License

MIT License. See [LICENSE](LICENSE) for details.