File size: 10,328 Bytes
b03b79f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 | ---
language:
- zh
- en
tags:
- translation
- chinese
- nlp
- text-processing
- markdown
- offline
- deep-translator
- marianmt
license: mit
library_name: transformers
pipeline_tag: translation
model-index:
- name: Helsinki-NLP/opus-mt-zh-en
results: []
---
# ChineseFileTranslator
[](https://www.python.org/)
[](LICENSE)
[](CHANGELOG.md)
[](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en)
[](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en)
[](https://github.com/algorembrant/ChineseFileTranslator)
[](https://peps.python.org/pep-0008/)
Translate Chinese text (Simplified, Traditional, Cantonese, Classical) inside `.txt` and `.md` files
to English. Preserves full Markdown syntax. Supports Google Translate, Microsoft Translator, and a
fully offline Helsinki-NLP MarianMT backend with vectorized batching.
---
### Key Features
- **'Never Miss' Global Surgical Translation**: Unique strategy to capture ALL Chinese while protecting structure.
- **Inclusive CJK Detection**: Comprehensive 32-bit Unicode coverage (Basic, Ext A-E, Symbols, Punctuation).
- **Proactive Markdown Protection**: Frontmatter, code blocks, links, and HTML are safely tokenized.
- **Robust Placeholder Restoration**: Space-lenient, case-insensitive restoration handles engine mangling.
- **Unstoppable Backend Resilience**: Explicit failure detection with automatic retries and non-crashing fallbacks.
- **Offline First Option**: Fully local Helsinki-NLP MarianMT backend with vectorized batching.
- **Bilingual Mode**: Optional side-by-side Chinese and English output.
- **Batch Processing**: Translate entire directories with recursive discovery and persistent configuration.
---
## Project Structure
```
ChineseFileTranslator/
βββ chinese_file_translator.py # Main script (single-file, no extra modules)
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ .gitattributes # Git line-ending and LFS rules
βββ .gitignore # Ignored paths
βββ LICENSE # MIT License
```
---
## Quickstart
### 1. Clone the repository
```bash
git clone https://github.com/algorembrant/ChineseFileTranslator.git
cd ChineseFileTranslator
```
### 2. Create and activate a virtual environment (recommended)
```bash
python -m venv venv
# Windows
venv\Scripts\activate
# Linux / macOS
source venv/bin/activate
```
### 3. Install core dependencies
```bash
pip install -r requirements.txt
```
### 4. (Optional) Install offline translation backend
Choose the correct PyTorch build for your system:
```bash
# CPU only
pip install torch --index-url https://download.pytorch.org/whl/cpu
# CUDA 12.1
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Then install Transformers stack
pip install transformers sentencepiece sacremoses
```
The Helsinki-NLP/opus-mt-zh-en model (~300 MB) downloads automatically on first use.
---
## Usage
### Command Reference
| Command | Description |
|---|---|
| `python chinese_file_translator.py input.txt` | Translate a plain-text file (Google backend) |
| `python chinese_file_translator.py input.md` | Translate a Markdown file, preserve structure |
| `python chinese_file_translator.py input.txt -o out.txt` | Set explicit output path |
| `python chinese_file_translator.py input.txt --backend offline` | Use offline MarianMT model |
| `python chinese_file_translator.py input.txt --backend microsoft` | Use Microsoft Translator |
| `python chinese_file_translator.py input.txt --offline --gpu` | Offline + GPU (CUDA) |
| `python chinese_file_translator.py input.txt --lang simplified` | Force Simplified Chinese |
| `python chinese_file_translator.py input.txt --lang traditional` | Force Traditional Chinese |
| `python chinese_file_translator.py input.txt --bilingual` | Keep Chinese + show English |
| `python chinese_file_translator.py input.txt --extract-only` | Extract Chinese lines only |
| `python chinese_file_translator.py input.txt --stdout` | Print output to terminal |
| `python chinese_file_translator.py --batch ./docs/` | Batch translate a directory |
| `python chinese_file_translator.py --batch ./in/ --batch-out ./out/` | Batch with output dir |
| `python chinese_file_translator.py input.txt --chunk-size 2000` | Custom chunk size |
| `python chinese_file_translator.py input.txt --export-history h.json` | Export history |
| `python chinese_file_translator.py input.txt --verbose` | Debug logging |
| `python chinese_file_translator.py --version` | Print version |
| `python chinese_file_translator.py --help` | Full help |
### Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
| `input` | positional | β | Path to `.txt` or `.md` file |
| `-o / --output` | string | `<name>_translated.<ext>` | Output file path |
| `--batch DIR` | string | β | Directory to batch translate |
| `--batch-out DIR` | string | same as `--batch` | Output directory for batch |
| `--backend` | choice | `google` | `google`, `microsoft`, `offline` |
| `--offline` | flag | `false` | Shorthand for `--backend offline` |
| `--lang` | choice | `auto` | `auto`, `simplified`, `traditional` |
| `--gpu` | flag | `false` | Use CUDA for offline model |
| `--confidence` | float | `0.05` | Min Chinese character ratio for detection |
| `--chunk-size` | int | `4000` | Max chars per translation request |
| `--bilingual` | flag | `false` | Output both Chinese and English |
| `--extract-only` | flag | `false` | Save only the detected Chinese lines |
| `--stdout` | flag | `false` | Print result to stdout |
| `--export-history` | string | β | Save session history to JSON |
| `--verbose` | flag | `false` | Enable DEBUG logging |
| `--version` | flag | β | Show version and exit |
---
## Configuration
The tool writes a JSON config file on first run:
```
~/.chinese_file_translator/config.json
```
Example `config.json`:
```json
{
"backend": "google",
"lang": "auto",
"use_gpu": false,
"chunk_size": 4000,
"batch_size": 10,
"bilingual": false,
"microsoft_api_key": "YOUR_KEY_HERE",
"microsoft_region": "eastus",
"offline_model_dir": "~/.chinese_file_translator/models",
"output_suffix": "_translated",
"retry_attempts": 3,
"retry_delay_seconds": 1.5,
"max_history": 1000
}
```
---
## Supported Chinese Variants
| Variant | Notes |
|---|---|
| Simplified Chinese | Mandarin, mainland China standard |
| Traditional Chinese | Taiwan, Hong Kong, Macau standard |
| Cantonese / Yue | Detected via CJK Unicode ranges |
| Classical Chinese | Treated as Traditional for translation |
| Mixed Chinese-English | Code-switching text handled transparently |
---
## Translation Backends
| Backend | Requires | Speed | Quality | Internet |
|---|---|---|---|---|
| Google Translate | `deep-translator` | Fast | High | Yes |
| Microsoft Translator | Azure API key + `deep-translator` | Fast | High | Yes |
| Helsinki-NLP MarianMT | `transformers`, `torch` | Medium | Good | No (after download) |
Google Translate is the default. If it fails, the tool falls back to the offline model automatically.
---
---
## Technical Strategy: 'Never Miss' Logic
The tool employs a sophisticated "Global Surgical" approach to ensure no Chinese fragment is overlooked, regardless of its depth in JSON, HTML, or complex Markdown.
### 1. Surgical Block Extraction
Instead of line-by-line translation, the script identifies every continuous block of CJK characters (including ideographic symbols and punctuation) across the entire document. This ensures that contextually related characters are translated together for better accuracy.
### 2. Structural Protection
Markdown and metadata structures are tokenized using unique, collision-resistant placeholders (`___MY_PROTECT_PH_{idx}___`).
- **YAML/TOML**: Frontmatter is protected globally.
- **Code Fences**: Backticks and language identifiers are protected; Chinese content *inside* comments or strings remains translatable.
- **Links & HTML**: URLs and tag names are guarded, while display text is surgically translated.
### 3. Verification & Restoration
- **Longest-First Replacement**: Translated segments are restored starting from the longest strings to prevent partial match overwrites.
- **Fuzzy Restoration**: The restoration logic is space-lenient and case-insensitive to handle cases where online translation engines mangle the placeholder tokens.
---
## Markdown Preservation
The following elements are meticulously protected:
| Element | Example | Protection Method |
|---|---|---|
| Front Matter | `---\ntitle: ...\n---` | Full Tokenization |
| Fenced Code | ` ```python ... ``` ` | Boundary Tokenization |
| Inline Code | `` `code` `` | Full Tokenization |
| Links / Images | `[text](url)` | URL Tokenization |
| HTML Tags | `<div class="...">` | Tag Tokenization |
| Symbols | `©`, `&#x...;` | Entity Tokenization |
---
## Microsoft Translator Setup
1. Go to [Azure Cognitive Services](https://portal.azure.com/)
2. Create a Translator resource (Free tier: 2M chars/month)
3. Copy your API key and region
4. Add them to `~/.chinese_file_translator/config.json`:
```json
{
"microsoft_api_key": "abc123...",
"microsoft_region": "eastus"
}
```
Then run:
```bash
python chinese_file_translator.py input.txt --backend microsoft
```
---
## Files Generated
| Path | Description |
|---|---|
| `~/.chinese_file_translator/config.json` | Persistent settings |
| `~/.chinese_file_translator/history.json` | Session history log |
| `~/.chinese_file_translator/app.log` | Application log file |
| `~/.chinese_file_translator/models/` | Offline model cache (if used) |
---
## Author
**algorembrant**
---
## License
MIT License. See [LICENSE](LICENSE) for details.
|