Peter Yang commited on
Commit
5bdee4b
·
1 Parent(s): daf9263

Add LLM translation feasibility analysis and development workflow guide

Browse files
DEVELOPMENT_WORKFLOW.md ADDED
@@ -0,0 +1,442 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Development & Debugging Workflow
2
+ ## Testing LLM Translation Locally Before HF Spaces Deployment
3
+
4
+ ---
5
+
6
+ ## Overview
7
+
8
+ **You don't need to connect your IDE to Hugging Face Spaces.** Instead, develop and test locally first, then deploy to HF Spaces. This is faster and more efficient.
9
+
10
+ ---
11
+
12
+ ## Recommended Workflow
13
+
14
+ ### Phase 1: Local Development & Testing
15
+
16
+ #### 1.1 Set Up Local Environment
17
+
18
+ ```bash
19
+ # Create virtual environment (if not already done)
20
+ python -m venv venv
21
+ source venv/bin/activate # On Windows: venv\Scripts\activate
22
+
23
+ # Install dependencies
24
+ pip install -r requirements.txt
25
+
26
+ # Install additional dependencies for LLM
27
+ pip install bitsandbytes accelerate
28
+ ```
29
+
30
+ #### 1.2 Test Locally with Sample Code
31
+
32
+ Create a test script to verify LLM translation works:
33
+
34
+ ```python
35
+ # test_llm_translation.py
36
+ import asyncio
37
+ from document_processing_agent import DocumentProcessingAgent
38
+
39
+ async def test_llm_translation():
40
+ """Test LLM translation locally"""
41
+ processor = DocumentProcessingAgent("http://localhost:8080")
42
+
43
+ # Test Chinese text
44
+ chinese_text = "今天我们要学习神的话语,让我们一起来祷告。"
45
+
46
+ print("Testing LLM translation...")
47
+ result = await processor._translate_text(chinese_text, 'zh', 'en')
48
+
49
+ print(f"Chinese: {chinese_text}")
50
+ print(f"English: {result}")
51
+
52
+ return result
53
+
54
+ if __name__ == "__main__":
55
+ asyncio.run(test_llm_translation())
56
+ ```
57
+
58
+ #### 1.3 Debug in Your IDE
59
+
60
+ - **Cursor/VSCode**: Set breakpoints, inspect variables, step through code
61
+ - **Print statements**: Use `print()` for quick debugging
62
+ - **Logging**: Use Python's `logging` module for better debugging
63
+
64
+ ```python
65
+ import logging
66
+ logging.basicConfig(level=logging.DEBUG)
67
+ logger = logging.getLogger(__name__)
68
+
69
+ # In your code
70
+ logger.debug(f"Translating text: {text[:50]}...")
71
+ logger.info(f"Model loaded on device: {device}")
72
+ logger.error(f"Translation failed: {error}")
73
+ ```
74
+
75
+ ---
76
+
77
+ ## Phase 2: Simulate HF Spaces Environment Locally
78
+
79
+ ### 2.1 Match HF Spaces Environment
80
+
81
+ HF Spaces uses:
82
+ - Python 3.10
83
+ - Standard Linux environment
84
+ - Limited resources (16GB RAM on free tier)
85
+
86
+ **Test with similar constraints**:
87
+
88
+ ```python
89
+ # Check memory usage
90
+ import psutil
91
+ import os
92
+
93
+ def check_memory():
94
+ process = psutil.Process(os.getpid())
95
+ memory_mb = process.memory_info().rss / 1024 / 1024
96
+ print(f"Memory usage: {memory_mb:.2f} MB")
97
+
98
+ if memory_mb > 14000: # Leave some headroom
99
+ print("⚠️ Warning: High memory usage!")
100
+ ```
101
+
102
+ ### 2.2 Test with CPU (Simulate Free Tier)
103
+
104
+ ```python
105
+ # Force CPU usage (like free tier)
106
+ import os
107
+ os.environ["CUDA_VISIBLE_DEVICES"] = "" # Disable GPU
108
+
109
+ # Test translation on CPU
110
+ # This will be slow but matches free tier behavior
111
+ ```
112
+
113
+ ### 2.3 Test with GPU (If Available)
114
+
115
+ ```python
116
+ # Use GPU if available (matches Pro tier)
117
+ import torch
118
+ device = "cuda" if torch.cuda.is_available() else "cpu"
119
+ print(f"Using device: {device}")
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Phase 3: Deploy to HF Spaces
125
+
126
+ ### 3.1 Push Code to Repository
127
+
128
+ ```bash
129
+ # Commit changes
130
+ git add document_processing_agent.py requirements.txt
131
+ git commit -m "Add Qwen2.5 LLM translation support"
132
+ git push origin hf-gradio
133
+ ```
134
+
135
+ ### 3.2 Deploy to HF Spaces
136
+
137
+ ```bash
138
+ # Push to HF Spaces
139
+ git push huggingface hf-gradio:main --force
140
+ ```
141
+
142
+ ### 3.3 Monitor Build & Logs
143
+
144
+ **HF Spaces provides**:
145
+ - **Build Logs**: See installation progress
146
+ - **Runtime Logs**: See application output
147
+ - **Error Messages**: See what went wrong
148
+
149
+ **Access Logs**:
150
+ 1. Go to your Space: https://huggingface.co/spaces/NextDrought/worship
151
+ 2. Click "Logs" tab
152
+ 3. View real-time output
153
+
154
+ ---
155
+
156
+ ## Debugging Strategies
157
+
158
+ ### Strategy 1: Local First (Recommended)
159
+
160
+ **Advantages**:
161
+ - ✅ Fast iteration (no build time)
162
+ - ✅ Full IDE debugging support
163
+ - ✅ Can test multiple scenarios quickly
164
+ - ✅ No resource limits
165
+
166
+ **Workflow**:
167
+ ```
168
+ 1. Write code locally
169
+ 2. Test with sample data
170
+ 3. Debug in IDE
171
+ 4. Fix issues
172
+ 5. Repeat until working
173
+ 6. Deploy to HF Spaces
174
+ ```
175
+
176
+ ### Strategy 2: Use HF Spaces Logs
177
+
178
+ **When to use**:
179
+ - Production issues
180
+ - Environment-specific problems
181
+ - Verifying deployment
182
+
183
+ **How to use**:
184
+ ```python
185
+ # Add detailed logging
186
+ import logging
187
+ import sys
188
+
189
+ logging.basicConfig(
190
+ level=logging.DEBUG,
191
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
192
+ handlers=[
193
+ logging.StreamHandler(sys.stdout) # Goes to HF Spaces logs
194
+ ]
195
+ )
196
+
197
+ logger = logging.getLogger(__name__)
198
+
199
+ # Use throughout your code
200
+ logger.info("Loading translation model...")
201
+ logger.debug(f"Model name: {model_name}")
202
+ logger.error(f"Translation failed: {error}", exc_info=True)
203
+ ```
204
+
205
+ ### Strategy 3: Test Mode Flag
206
+
207
+ Add a test mode to your app:
208
+
209
+ ```python
210
+ # app.py
211
+ TEST_MODE = os.getenv("TEST_MODE", "false").lower() == "true"
212
+
213
+ if TEST_MODE:
214
+ # Show detailed errors in UI
215
+ demo = gr.Blocks(title="Worship Program Generator (TEST MODE)")
216
+ # ... add error display components
217
+ else:
218
+ # Production mode - hide errors
219
+ demo = gr.Blocks(title="Worship Program Generator")
220
+ ```
221
+
222
+ ---
223
+
224
+ ## Common Debugging Scenarios
225
+
226
+ ### Scenario 1: Model Loading Fails
227
+
228
+ **Local Debugging**:
229
+ ```python
230
+ try:
231
+ model = AutoModelForCausalLM.from_pretrained(model_name)
232
+ except Exception as e:
233
+ print(f"Error loading model: {e}")
234
+ import traceback
235
+ traceback.print_exc()
236
+ # Check: Internet connection, model name, disk space
237
+ ```
238
+
239
+ **HF Spaces Debugging**:
240
+ - Check build logs for download errors
241
+ - Check runtime logs for loading errors
242
+ - Verify model name is correct
243
+
244
+ ### Scenario 2: Out of Memory
245
+
246
+ **Local Debugging**:
247
+ ```python
248
+ import torch
249
+ print(f"CUDA available: {torch.cuda.is_available()}")
250
+ print(f"CUDA memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
251
+
252
+ # Monitor memory
253
+ import psutil
254
+ process = psutil.Process()
255
+ print(f"Memory: {process.memory_info().rss / 1e9:.2f} GB")
256
+ ```
257
+
258
+ **HF Spaces Debugging**:
259
+ - Check logs for OOM errors
260
+ - Use smaller model or quantization
261
+ - Request GPU tier (more memory)
262
+
263
+ ### Scenario 3: Translation Quality Issues
264
+
265
+ **Local Debugging**:
266
+ ```python
267
+ # Test with known good/bad examples
268
+ test_cases = [
269
+ ("今天天气很好", "The weather is nice today"),
270
+ ("我们要祷告", "We need to pray"),
271
+ # ... more test cases
272
+ ]
273
+
274
+ for chinese, expected in test_cases:
275
+ result = await translate(chinese)
276
+ print(f"Input: {chinese}")
277
+ print(f"Expected: {expected}")
278
+ print(f"Got: {result}")
279
+ print(f"Match: {result.lower() == expected.lower()}")
280
+ print("---")
281
+ ```
282
+
283
+ ---
284
+
285
+ ## IDE Setup Recommendations
286
+
287
+ ### Cursor/VSCode Configuration
288
+
289
+ **`.vscode/launch.json`** (for debugging):
290
+ ```json
291
+ {
292
+ "version": "0.2.0",
293
+ "configurations": [
294
+ {
295
+ "name": "Python: Current File",
296
+ "type": "python",
297
+ "request": "launch",
298
+ "program": "${file}",
299
+ "console": "integratedTerminal",
300
+ "justMyCode": true,
301
+ "env": {
302
+ "TRANSLATION_METHOD": "llm",
303
+ "CUDA_VISIBLE_DEVICES": "" // Force CPU for testing
304
+ }
305
+ },
306
+ {
307
+ "name": "Python: Test Translation",
308
+ "type": "python",
309
+ "request": "launch",
310
+ "program": "${workspaceFolder}/test_llm_translation.py",
311
+ "console": "integratedTerminal",
312
+ "justMyCode": false
313
+ }
314
+ ]
315
+ }
316
+ ```
317
+
318
+ **`.vscode/settings.json`**:
319
+ ```json
320
+ {
321
+ "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
322
+ "python.linting.enabled": true,
323
+ "python.linting.pylintEnabled": false,
324
+ "python.linting.flake8Enabled": true,
325
+ "python.formatting.provider": "black"
326
+ }
327
+ ```
328
+
329
+ ---
330
+
331
+ ## Quick Reference: Debugging Commands
332
+
333
+ ### Local Testing
334
+
335
+ ```bash
336
+ # Run test script
337
+ python test_llm_translation.py
338
+
339
+ # Run app locally
340
+ python app.py
341
+
342
+ # Check memory usage
343
+ python -c "import psutil; print(f'{psutil.virtual_memory().used / 1e9:.2f} GB used')"
344
+
345
+ # Test with specific environment variable
346
+ TRANSLATION_METHOD=llm python app.py
347
+ ```
348
+
349
+ ### HF Spaces Debugging
350
+
351
+ ```bash
352
+ # View logs (via HF website)
353
+ # Go to: https://huggingface.co/spaces/NextDrought/worship/logs
354
+
355
+ # Check build status
356
+ # Go to: https://huggingface.co/spaces/NextDrought/worship
357
+
358
+ # View files (if needed)
359
+ # Go to: https://huggingface.co/spaces/NextDrought/worship/files
360
+ ```
361
+
362
+ ---
363
+
364
+ ## Best Practices
365
+
366
+ ### ✅ DO
367
+
368
+ 1. **Develop locally first** - Much faster iteration
369
+ 2. **Use version control** - Commit working code before deploying
370
+ 3. **Add logging** - Helps debug production issues
371
+ 4. **Test with sample data** - Verify before deploying
372
+ 5. **Use environment variables** - Easy to toggle features
373
+
374
+ ### ❌ DON'T
375
+
376
+ 1. **Don't develop directly on HF Spaces** - Too slow
377
+ 2. **Don't skip local testing** - Wastes build time
378
+ 3. **Don't ignore error messages** - They tell you what's wrong
379
+ 4. **Don't deploy untested code** - Breaks production
380
+
381
+ ---
382
+
383
+ ## Troubleshooting Guide
384
+
385
+ ### Issue: Model won't load locally
386
+
387
+ **Solutions**:
388
+ - Check internet connection (needs to download model)
389
+ - Verify model name is correct
390
+ - Check disk space (models are large)
391
+ - Try smaller model first
392
+
393
+ ### Issue: Out of memory locally
394
+
395
+ **Solutions**:
396
+ - Use quantization (4-bit)
397
+ - Use smaller model (0.5B instead of 1.5B)
398
+ - Close other applications
399
+ - Use CPU instead of GPU
400
+
401
+ ### Issue: Works locally but fails on HF Spaces
402
+
403
+ **Solutions**:
404
+ - Check HF Spaces logs for specific error
405
+ - Verify all dependencies in requirements.txt
406
+ - Check memory limits (use quantization)
407
+ - Verify model name is accessible on HF Hub
408
+
409
+ ### Issue: Slow performance on HF Spaces
410
+
411
+ **Solutions**:
412
+ - Request GPU tier (free tier available)
413
+ - Use quantization to reduce memory
414
+ - Implement batch processing
415
+ - Cache translations
416
+
417
+ ---
418
+
419
+ ## Summary
420
+
421
+ **You don't need IDE connection to HF Spaces.** Instead:
422
+
423
+ 1. ✅ **Develop locally** - Use Cursor/VSCode with full debugging
424
+ 2. ✅ **Test locally** - Verify everything works
425
+ 3. ✅ **Deploy to HF Spaces** - Push code via git
426
+ 4. ✅ **Monitor logs** - Use HF Spaces web interface
427
+ 5. ✅ **Iterate** - Fix issues locally, redeploy
428
+
429
+ This workflow is:
430
+ - **Faster** - No build time during development
431
+ - **More efficient** - Full IDE features
432
+ - **More reliable** - Test before deploying
433
+ - **Standard practice** - How most developers work
434
+
435
+ ---
436
+
437
+ **Next Steps**:
438
+ 1. Set up local test script
439
+ 2. Implement Qwen2.5 translation locally
440
+ 3. Test and debug in your IDE
441
+ 4. Once working, deploy to HF Spaces
442
+
LLM_TRANSLATION_FEASIBILITY.md ADDED
@@ -0,0 +1,499 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM Translation Feasibility Analysis
2
+ ## Using Qwen/Kimi Models on Hugging Face Spaces
3
+
4
+ **Date**: 2025-11-12
5
+ **Purpose**: Analyze feasibility of replacing OPUS-MT with LLM-based translation (Qwen/Kimi) on HF Spaces
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ **Current State**: Using Helsinki-NLP OPUS-MT (small NMT model, ~500MB, CPU-friendly)
12
+ **Proposed**: Replace with LLM models (Qwen2.5 or Kimi) for better translation quality
13
+ **Verdict**: **FEASIBLE** with considerations - Qwen2.5 recommended, Kimi not available on HF
14
+
15
+ ---
16
+
17
+ ## 1. Current Translation Setup
18
+
19
+ ### 1.1 OPUS-MT Implementation
20
+
21
+ ```python
22
+ # Current model: Helsinki-NLP/opus-mt-zh-en
23
+ Model Size: ~500MB
24
+ Device: CPU (auto-detects CUDA if available)
25
+ Speed: ~1-2 seconds per paragraph on CPU
26
+ Memory: ~500MB RAM
27
+ Quality: Good for general text, struggles with:
28
+ - Domain-specific terminology (religious texts)
29
+ - Context-dependent translations
30
+ - Long-form content with cross-paragraph context
31
+ ```
32
+
33
+ ### 1.2 Current Limitations
34
+
35
+ - **Quality Issues**:
36
+ - Loses nuance in religious/formal language
37
+ - No cross-paragraph context awareness
38
+ - May mistranslate idioms and cultural references
39
+
40
+ - **Performance**:
41
+ - Sequential processing (slow for large documents)
42
+ - No batching capability
43
+
44
+ - **Context Loss**:
45
+ - Each paragraph translated independently
46
+ - No document-level understanding
47
+
48
+ ---
49
+
50
+ ## 2. LLM Options Analysis
51
+
52
+ ### 2.1 Qwen2.5 Models (Recommended ✅)
53
+
54
+ #### Available Models on Hugging Face
55
+
56
+ | Model | Size | Parameters | Memory (CPU) | Memory (GPU) | Speed (CPU) | Speed (GPU) | Quality |
57
+ |-------|------|------------|--------------|-------------|-------------|-------------|---------|
58
+ | **Qwen2.5-0.5B-Instruct** | ~1GB | 0.5B | ~2GB | ~1GB | Slow | Fast | Good |
59
+ | **Qwen2.5-1.5B-Instruct** | ~3GB | 1.5B | ~4GB | ~2GB | Very Slow | Fast | Better |
60
+ | **Qwen2.5-7B-Instruct** | ~14GB | 7B | ~16GB | ~8GB | Not feasible | Fast | Excellent |
61
+ | **Qwen2.5-14B-Instruct** | ~28GB | 14B | ~32GB | ~16GB | Not feasible | Fast | Excellent |
62
+
63
+ #### Recommended: Qwen2.5-1.5B-Instruct
64
+
65
+ **Why**:
66
+ - ✅ Small enough for CPU inference (though slow)
67
+ - ✅ Better quality than OPUS-MT
68
+ - ✅ Supports Chinese-English translation
69
+ - ✅ Available on Hugging Face Hub
70
+ - ✅ Can use quantization (4-bit/8-bit) to reduce memory
71
+
72
+ **Hugging Face Model Card**: `Qwen/Qwen2.5-1.5B-Instruct`
73
+
74
+ #### Implementation Example
75
+
76
+ ```python
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+ import torch
79
+
80
+ class LLMTranslator:
81
+ def __init__(self, model_name="Qwen/Qwen2.5-1.5B-Instruct"):
82
+ # Load model with quantization for CPU
83
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
84
+
85
+ # Option 1: Full precision (requires GPU or lots of RAM)
86
+ # self.model = AutoModelForCausalLM.from_pretrained(
87
+ # model_name,
88
+ # torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
89
+ # )
90
+
91
+ # Option 2: Quantized (recommended for CPU)
92
+ from transformers import BitsAndBytesConfig
93
+ quantization_config = BitsAndBytesConfig(
94
+ load_in_4bit=True,
95
+ bnb_4bit_compute_dtype=torch.float16
96
+ )
97
+ self.model = AutoModelForCausalLM.from_pretrained(
98
+ model_name,
99
+ quantization_config=quantization_config if not torch.cuda.is_available() else None,
100
+ device_map="auto"
101
+ )
102
+
103
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
104
+
105
+ async def translate(self, chinese_text: str) -> str:
106
+ prompt = f"""You are a professional translator specializing in religious and formal texts.
107
+ Translate the following Chinese text to English. Maintain the meaning, tone, and style.
108
+
109
+ Chinese text:
110
+ {chinese_text}
111
+
112
+ English translation:"""
113
+
114
+ inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
115
+
116
+ with torch.no_grad():
117
+ outputs = self.model.generate(
118
+ **inputs,
119
+ max_new_tokens=512,
120
+ temperature=0.3, # Lower temperature for more consistent translation
121
+ do_sample=True,
122
+ pad_token_id=self.tokenizer.eos_token_id
123
+ )
124
+
125
+ response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
126
+ # Extract translation (remove prompt)
127
+ translation = response.split("English translation:")[-1].strip()
128
+ return translation
129
+ ```
130
+
131
+ ### 2.2 Kimi Models (Not Available ❌)
132
+
133
+ **Status**: Kimi is Moonshot AI's proprietary model, **NOT available on Hugging Face Hub**
134
+
135
+ **Alternatives**:
136
+ - Use Moonshot AI API (paid service)
137
+ - Use similar open-source models (Qwen, Llama, etc.)
138
+
139
+ **If using Moonshot API**:
140
+ ```python
141
+ import aiohttp
142
+
143
+ async def translate_with_kimi_api(text: str, api_key: str) -> str:
144
+ async with aiohttp.ClientSession() as session:
145
+ async with session.post(
146
+ "https://api.moonshot.cn/v1/chat/completions",
147
+ headers={"Authorization": f"Bearer {api_key}"},
148
+ json={
149
+ "model": "moonshot-v1-8k",
150
+ "messages": [
151
+ {"role": "system", "content": "You are a professional translator."},
152
+ {"role": "user", "content": f"Translate to English: {text}"}
153
+ ]
154
+ }
155
+ ) as response:
156
+ result = await response.json()
157
+ return result["choices"][0]["message"]["content"]
158
+ ```
159
+
160
+ **Note**: Requires API key and has usage costs.
161
+
162
+ ---
163
+
164
+ ## 3. Resource Requirements Comparison
165
+
166
+ ### 3.1 Memory Requirements
167
+
168
+ | Model | CPU RAM | GPU VRAM | HF Spaces Compatible |
169
+ |-------|---------|----------|---------------------|
170
+ | **OPUS-MT** (current) | ~500MB | N/A | ✅ Yes (CPU) |
171
+ | **Qwen2.5-0.5B** | ~2GB | ~1GB | ✅ Yes (CPU slow, GPU fast) |
172
+ | **Qwen2.5-1.5B** | ~4GB | ~2GB | ⚠️ CPU very slow, GPU recommended |
173
+ | **Qwen2.5-7B** | ~16GB | ~8GB | ❌ CPU not feasible, GPU required |
174
+ | **Qwen2.5-1.5B (4-bit)** | ~2.5GB | ~1GB | ✅ Yes (CPU acceptable) |
175
+
176
+ ### 3.2 Hugging Face Spaces Hardware Options
177
+
178
+ | Tier | CPU | RAM | GPU | Cost |
179
+ |------|-----|-----|-----|------|
180
+ | **Free (CPU)** | 2 vCPU | 16GB | None | Free |
181
+ | **Free (GPU T4)** | 2 vCPU | 16GB | T4 (16GB) | Free (limited hours) |
182
+ | **Pro (CPU)** | 4 vCPU | 32GB | None | $9/month |
183
+ | **Pro (GPU)** | 4 vCPU | 32GB | T4/A10G | $9/month |
184
+
185
+ **Recommendation**:
186
+ - **Free GPU tier**: Use Qwen2.5-1.5B with 4-bit quantization
187
+ - **CPU-only**: Use Qwen2.5-0.5B or stick with OPUS-MT
188
+
189
+ ---
190
+
191
+ ## 4. Performance Comparison
192
+
193
+ ### 4.1 Speed Comparison (Estimated)
194
+
195
+ | Model | CPU (per paragraph) | GPU (per paragraph) | Batch Processing |
196
+ |-------|---------------------|---------------------|------------------|
197
+ | **OPUS-MT** | 1-2 seconds | 0.5 seconds | ❌ No |
198
+ | **Qwen2.5-0.5B** | 5-10 seconds | 1-2 seconds | ✅ Yes |
199
+ | **Qwen2.5-1.5B** | 15-30 seconds | 2-3 seconds | ✅ Yes |
200
+ | **Qwen2.5-1.5B (4-bit)** | 8-15 seconds | 1-2 seconds | ✅ Yes |
201
+
202
+ **Note**: LLMs can process multiple paragraphs in batch, potentially faster overall.
203
+
204
+ ### 4.2 Quality Comparison
205
+
206
+ | Aspect | OPUS-MT | Qwen2.5-1.5B | Qwen2.5-7B |
207
+ |--------|---------|--------------|------------|
208
+ | **General Translation** | Good | Better | Excellent |
209
+ | **Religious Terminology** | Fair | Good | Excellent |
210
+ | **Context Awareness** | None | Good | Excellent |
211
+ | **Idioms/Cultural** | Poor | Good | Excellent |
212
+ | **Formal Tone** | Fair | Good | Excellent |
213
+
214
+ ---
215
+
216
+ ## 5. Implementation Feasibility
217
+
218
+ ### 5.1 Code Changes Required
219
+
220
+ **Minimal Changes Needed**:
221
+
222
+ 1. **Update `_get_translation_model()` method**:
223
+ ```python
224
+ def _get_translation_model(self):
225
+ """Lazy load LLM translation model"""
226
+ if self._translation_model is None:
227
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
228
+
229
+ model_name = "Qwen/Qwen2.5-1.5B-Instruct"
230
+
231
+ # Use quantization for CPU/memory efficiency
232
+ quantization_config = BitsAndBytesConfig(
233
+ load_in_4bit=True,
234
+ bnb_4bit_compute_dtype=torch.float16
235
+ )
236
+
237
+ self._translation_tokenizer = AutoTokenizer.from_pretrained(model_name)
238
+ self._translation_model = AutoModelForCausalLM.from_pretrained(
239
+ model_name,
240
+ quantization_config=quantization_config,
241
+ device_map="auto"
242
+ )
243
+ self._translation_model.eval()
244
+
245
+ return self._translation_model, self._translation_tokenizer, self.device
246
+ ```
247
+
248
+ 2. **Update `_translate_text()` method**:
249
+ ```python
250
+ async def _translate_text(self, text: str, source_lang: str = 'zh', target_lang: str = 'en') -> str | None:
251
+ """Translate using LLM"""
252
+ if source_lang != 'zh' or target_lang != 'en':
253
+ return None
254
+
255
+ model, tokenizer, device = self._get_translation_model()
256
+
257
+ prompt = f"""Translate the following Chinese text to English. Maintain meaning and tone.
258
+
259
+ Chinese: {text}
260
+ English:"""
261
+
262
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
263
+
264
+ with torch.no_grad():
265
+ outputs = model.generate(
266
+ **inputs,
267
+ max_new_tokens=512,
268
+ temperature=0.3,
269
+ do_sample=True
270
+ )
271
+
272
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
273
+ translation = response.split("English:")[-1].strip()
274
+ return translation if translation else None
275
+ ```
276
+
277
+ 3. **Update `requirements.txt`**:
278
+ ```txt
279
+ # Add for quantization support
280
+ bitsandbytes # For 4-bit quantization
281
+ accelerate # For efficient model loading
282
+ ```
283
+
284
+ ### 5.2 Backward Compatibility
285
+
286
+ **Strategy**: Keep OPUS-MT as fallback
287
+
288
+ ```python
289
+ TRANSLATION_METHOD = os.getenv("TRANSLATION_METHOD", "llm") # "llm" or "opus"
290
+
291
+ if TRANSLATION_METHOD == "llm":
292
+ # Use Qwen2.5
293
+ else:
294
+ # Use OPUS-MT (current implementation)
295
+ ```
296
+
297
+ ---
298
+
299
+ ## 6. Cost Analysis
300
+
301
+ ### 6.1 Hugging Face Spaces
302
+
303
+ | Option | Cost | Limitations |
304
+ |--------|------|-------------|
305
+ | **Free CPU** | $0 | Slow, limited hours |
306
+ | **Free GPU** | $0 | Limited GPU hours/month |
307
+ | **Pro** | $9/month | More GPU hours, better performance |
308
+
309
+ ### 6.2 Model Download
310
+
311
+ - **First Load**: Downloads model (~3GB for Qwen2.5-1.5B)
312
+ - **Subsequent Loads**: Uses cache (fast)
313
+ - **Storage**: Model stored in HF cache (not counted against Space storage)
314
+
315
+ ### 6.3 API Alternatives (If Not Using Direct Model)
316
+
317
+ | Service | Cost | Quality |
318
+ |---------|------|---------|
319
+ | **OpenAI GPT-4** | $0.03/1K tokens | Excellent |
320
+ | **Moonshot Kimi** | ~$0.01/1K tokens | Excellent |
321
+ | **HF Inference API** | Free tier available | Good |
322
+
323
+ ---
324
+
325
+ ## 7. Recommended Implementation Plan
326
+
327
+ ### Phase 1: Proof of Concept (Week 1)
328
+ 1. ✅ Test Qwen2.5-0.5B on local machine
329
+ 2. ✅ Compare quality with OPUS-MT
330
+ 3. ✅ Measure performance (speed, memory)
331
+
332
+ ### Phase 2: Integration (Week 2)
333
+ 1. ✅ Add LLM translation option to codebase
334
+ 2. ✅ Implement fallback mechanism (LLM → OPUS-MT)
335
+ 3. ✅ Add environment variable toggle
336
+ 4. ✅ Test on HF Spaces (free GPU tier)
337
+
338
+ ### Phase 3: Optimization (Week 3)
339
+ 1. ✅ Implement batch processing
340
+ 2. ✅ Add caching for repeated translations
341
+ 3. ✅ Optimize prompts for better quality
342
+ 4. ✅ Monitor performance and adjust
343
+
344
+ ### Phase 4: Production (Week 4)
345
+ 1. ✅ Deploy to HF Spaces Pro (if needed)
346
+ 2. ✅ Monitor usage and costs
347
+ 3. ✅ Gather user feedback
348
+ 4. ✅ Iterate on improvements
349
+
350
+ ---
351
+
352
+ ## 8. Specific Recommendations
353
+
354
+ ### 8.1 For Hugging Face Spaces Deployment
355
+
356
+ **Recommended Setup**:
357
+ ```python
358
+ # Use Qwen2.5-1.5B with 4-bit quantization
359
+ MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
360
+ USE_QUANTIZATION = True # Reduces memory by 4x
361
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
362
+ ```
363
+
364
+ **Space Configuration** (in README.md):
365
+ ```yaml
366
+ ---
367
+ sdk: gradio
368
+ hardware: t4-small # Request GPU for better performance
369
+ ---
370
+ ```
371
+
372
+ ### 8.2 Prompt Engineering
373
+
374
+ **Optimized Prompt for Religious Texts**:
375
+ ```python
376
+ TRANSLATION_PROMPT = """You are a professional translator specializing in Christian religious texts and sermons.
377
+
378
+ Translate the following Chinese text to English. Requirements:
379
+ 1. Maintain the religious terminology accurately
380
+ 2. Preserve the formal and respectful tone
381
+ 3. Keep the structure and formatting
382
+ 4. Translate idioms and cultural references appropriately
383
+
384
+ Chinese text:
385
+ {text}
386
+
387
+ English translation:"""
388
+ ```
389
+
390
+ ### 8.3 Batch Processing
391
+
392
+ **Process Multiple Paragraphs Together**:
393
+ ```python
394
+ async def translate_paragraphs_batch(self, paragraphs: List[str]) -> List[str]:
395
+ """Translate multiple paragraphs in one LLM call"""
396
+ combined_text = "\n\n".join([f"Paragraph {i+1}: {p}" for i, p in enumerate(paragraphs)])
397
+
398
+ prompt = f"""Translate the following Chinese paragraphs to English.
399
+ Maintain the paragraph structure.
400
+
401
+ {combined_text}
402
+
403
+ English translation (keep paragraph structure):"""
404
+
405
+ # Single LLM call for all paragraphs
406
+ translation = await self._translate_with_llm(prompt)
407
+
408
+ # Split back into paragraphs
409
+ return translation.split("\n\n")
410
+ ```
411
+
412
+ **Benefits**:
413
+ - Faster (one call instead of N calls)
414
+ - Better context awareness
415
+ - More consistent terminology
416
+
417
+ ---
418
+
419
+ ## 9. Risks & Mitigations
420
+
421
+ ### 9.1 Risks
422
+
423
+ | Risk | Impact | Probability | Mitigation |
424
+ |------|--------|-------------|------------|
425
+ | **Memory OOM** | High | Medium | Use quantization, smaller model |
426
+ | **Slow Performance** | Medium | High (CPU) | Use GPU, batch processing |
427
+ | **Quality Issues** | Low | Low | Test prompts, fine-tune if needed |
428
+ | **Cost Overruns** | Low | Low | Free tier sufficient for testing |
429
+ | **Model Availability** | Low | Low | Multiple model options available |
430
+
431
+ ### 9.2 Fallback Strategy
432
+
433
+ ```python
434
+ try:
435
+ # Try LLM translation
436
+ translation = await self._translate_with_llm(text)
437
+ except Exception as e:
438
+ print(f"LLM translation failed: {e}, falling back to OPUS-MT")
439
+ # Fallback to OPUS-MT
440
+ translation = await self._translate_with_opus(text)
441
+ ```
442
+
443
+ ---
444
+
445
+ ## 10. Conclusion
446
+
447
+ ### 10.1 Feasibility Verdict
448
+
449
+ **✅ FEASIBLE** - Using Qwen2.5 models directly on Hugging Face Spaces is feasible with:
450
+
451
+ 1. **Recommended Model**: Qwen2.5-1.5B-Instruct with 4-bit quantization
452
+ 2. **Hardware**: Free GPU tier (T4) or Pro tier for better performance
453
+ 3. **Implementation**: Moderate complexity (~2-3 days development)
454
+ 4. **Cost**: Free (using HF Spaces free GPU tier)
455
+
456
+ ### 10.2 Key Advantages
457
+
458
+ - ✅ **Better Quality**: Significant improvement over OPUS-MT
459
+ - ✅ **Context Awareness**: Can understand cross-paragraph context
460
+ - ✅ **Domain Adaptation**: Better handling of religious terminology
461
+ - ✅ **Batch Processing**: Can translate multiple paragraphs together
462
+ - ✅ **Free**: No API costs when using direct model hosting
463
+
464
+ ### 10.3 Next Steps
465
+
466
+ 1. **Immediate**: Test Qwen2.5-0.5B locally to validate approach
467
+ 2. **Short-term**: Implement Qwen2.5-1.5B with quantization
468
+ 3. **Long-term**: Consider fine-tuning on religious text corpus
469
+
470
+ ### 10.4 Alternative: Hybrid Approach
471
+
472
+ **Best of Both Worlds**:
473
+ - Use LLM for main content translation (better quality)
474
+ - Use OPUS-MT for quick translations (prayer points, announcements)
475
+ - Balance quality vs. speed
476
+
477
+ ---
478
+
479
+ ## Appendix A: Code Implementation Template
480
+
481
+ See `document_processing_agent.py` for current implementation.
482
+ New LLM-based implementation can be added as alternative method.
483
+
484
+ ## Appendix B: Model Comparison Table
485
+
486
+ | Feature | OPUS-MT | Qwen2.5-0.5B | Qwen2.5-1.5B | Qwen2.5-7B |
487
+ |---------|---------|--------------|--------------|------------|
488
+ | **Quality** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
489
+ | **Speed (CPU)** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐ |
490
+ | **Speed (GPU)** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
491
+ | **Memory** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
492
+ | **Context** | ⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
493
+
494
+ ---
495
+
496
+ **Document Version**: 1.0
497
+ **Last Updated**: 2025-11-12
498
+ **Status**: Ready for Implementation
499
+