Spaces:
Sleeping
Sleeping
Peter Yang commited on
Commit ·
02fb1fa
1
Parent(s): 1c3d743
Add documentation for translation truncation fixes
Browse files- TRANSLATION_FIX_SUMMARY.md +154 -0
TRANSLATION_FIX_SUMMARY.md
ADDED
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Translation Fix Summary: Incomplete Qwen2.5 Translations
|
| 2 |
+
|
| 3 |
+
**Date**: 2025-11-12
|
| 4 |
+
**Issue**: Qwen2.5 translations were incomplete - only translating first sentence of long paragraphs
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## Problem
|
| 9 |
+
|
| 10 |
+
The user reported that Qwen2.5 was only translating the first sentence of long paragraphs. For example:
|
| 11 |
+
|
| 12 |
+
**Input (Chinese)**:
|
| 13 |
+
```
|
| 14 |
+
作为光明的子女,顺从圣灵的引领,自然就会结出圣灵的果子。相反,如果生活中仍充满嫉妒,恶毒,虚伪,诡诈,就显明还活在黑暗里,并不是光明的子女。约翰一书 1:6我们若说是与神相交,却仍在黑暗里行,就是说谎话,不行真理了。这就提醒我们,要常常反思,我是否结出了光明的果子?是否行出了良善,公义,诚实呢?如果结出了光明的果子,感谢赞美主!如果所行的是黑暗,就要及时悔改。求主赐下智慧,加添力量,帮助我们活出光明的子女该有的生活,结出光明的果子。
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
**Output (English)** - Only first sentence:
|
| 18 |
+
```
|
| 19 |
+
As children of light, we naturally follow the guidance of the Holy Spirit and will therefore bear fruit from the Spirit.
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
---
|
| 23 |
+
|
| 24 |
+
## Root Causes
|
| 25 |
+
|
| 26 |
+
1. **Input Truncation**: `max_length=512` tokens was too small for long paragraphs
|
| 27 |
+
2. **Output Limitation**: `max_new_tokens=200` was insufficient for complete translations
|
| 28 |
+
3. **Aggressive Cleanup**: Cleanup logic was cutting off translations at the first sentence ending
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Fixes Applied
|
| 33 |
+
|
| 34 |
+
### 1. Increased Input Token Limit ✅
|
| 35 |
+
|
| 36 |
+
**File**: `document_processing_agent.py::_translate_text_qwen()`
|
| 37 |
+
|
| 38 |
+
**Change**: Increased `max_length` from 512 to 1024 tokens
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
max_input_length = 1024 # Increased from 512 to handle longer paragraphs
|
| 42 |
+
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_input_length).to(device)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### 2. Dynamic Output Token Limit ✅
|
| 46 |
+
|
| 47 |
+
**Change**: Made `max_new_tokens` dynamic based on input length
|
| 48 |
+
|
| 49 |
+
```python
|
| 50 |
+
# Calculate approximate token count for input text
|
| 51 |
+
input_tokens = len(text) # Rough estimate
|
| 52 |
+
|
| 53 |
+
# For very long paragraphs, we need to increase max_new_tokens proportionally
|
| 54 |
+
# Estimate: Chinese to English translation is roughly 1:1.5 ratio
|
| 55 |
+
estimated_output_tokens = int(input_tokens * 1.5)
|
| 56 |
+
max_new_tokens = min(max(estimated_output_tokens + 100, 300), 800) # At least 300, up to 800 tokens
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
**Benefits**:
|
| 60 |
+
- Short paragraphs: Minimum 300 tokens (sufficient for 1-2 sentences)
|
| 61 |
+
- Long paragraphs: Up to 800 tokens (sufficient for 4-5 sentences)
|
| 62 |
+
- Prevents truncation while avoiding excessive generation
|
| 63 |
+
|
| 64 |
+
### 3. Less Aggressive Cleanup ✅
|
| 65 |
+
|
| 66 |
+
**Change**: Modified cleanup logic to preserve full translations
|
| 67 |
+
|
| 68 |
+
**Before**: Cut off at first sentence ending
|
| 69 |
+
```python
|
| 70 |
+
first_period = translation.find('.')
|
| 71 |
+
if first_period > 0:
|
| 72 |
+
translation = translation[:first_period + 1].strip()
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
**After**: Only clean up trailing explanatory text, preserve full translation
|
| 76 |
+
```python
|
| 77 |
+
# Don't aggressively cut off at first sentence - let the full translation through
|
| 78 |
+
# Only clean up if there's clearly extra explanatory text after the translation
|
| 79 |
+
|
| 80 |
+
# Remove trailing explanatory text that starts with common phrases
|
| 81 |
+
trailing_markers = [
|
| 82 |
+
" If you", " Note:", " Here is", " The translation",
|
| 83 |
+
" Translation:", " Chinese:", " English:"
|
| 84 |
+
]
|
| 85 |
+
for marker in trailing_markers:
|
| 86 |
+
idx = translation.find(marker)
|
| 87 |
+
if idx > len(translation) * 0.5: # Only if marker is in second half (likely trailing)
|
| 88 |
+
translation = translation[:idx].strip()
|
| 89 |
+
|
| 90 |
+
# Ensure translation ends with proper punctuation
|
| 91 |
+
if translation and not translation[-1] in ['.', '!', '?', '"', "'"]:
|
| 92 |
+
# Try to find the last complete sentence (only if near the end)
|
| 93 |
+
last_period = translation.rfind('.')
|
| 94 |
+
if last_period > len(translation) * 0.8: # Only if within last 20%
|
| 95 |
+
translation = translation[:last_period + 1].strip()
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
**Benefits**:
|
| 99 |
+
- Preserves complete multi-sentence translations
|
| 100 |
+
- Only removes clearly trailing explanatory text
|
| 101 |
+
- Handles edge cases better (quotes, multiple sentences)
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Testing
|
| 106 |
+
|
| 107 |
+
A test script `test_full_translation.py` has been created to verify the fixes:
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
python3 test_full_translation.py
|
| 111 |
+
```
|
| 112 |
+
|
| 113 |
+
This script:
|
| 114 |
+
1. Translates the DOCX file (`2025-09-28-MQD-RCCA-sript-for-translator.docx`)
|
| 115 |
+
2. Generates a worship program with PDF (`RCCA-worship-bulletin-2025-11-09.pdf`)
|
| 116 |
+
3. Creates markdown and DOCX output files
|
| 117 |
+
4. Verifies that translations are complete
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
## Expected Results
|
| 122 |
+
|
| 123 |
+
After the fix, the same paragraph should translate to:
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
As children of light, following the guidance of the Holy Spirit, we will naturally bear the fruit of the Spirit. On the contrary, if our lives are still filled with jealousy, malice, hypocrisy, and deceit, it shows that we are still living in darkness and are not children of light. 1 John 1:6 says, "If we claim to have fellowship with him and yet walk in the darkness, we lie and do not live out the truth." This reminds us to constantly reflect: Have I borne the fruit of light? Have I acted with goodness, righteousness, and honesty? If we have borne the fruit of light, praise the Lord! If what we do is darkness, we must repent in time. We ask the Lord to grant wisdom and add strength to help us live the life that children of light should have and bear the fruit of light.
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
**Key Improvements**:
|
| 130 |
+
- ✅ Complete translation of all sentences
|
| 131 |
+
- ✅ Includes scripture reference (1 John 1:6)
|
| 132 |
+
- ✅ Includes all questions and statements
|
| 133 |
+
- ✅ Proper punctuation and formatting
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## Files Modified
|
| 138 |
+
|
| 139 |
+
- `document_processing_agent.py` - Fixed `_translate_text_qwen()` method
|
| 140 |
+
- `test_full_translation.py` - Created test script
|
| 141 |
+
|
| 142 |
+
---
|
| 143 |
+
|
| 144 |
+
## Next Steps
|
| 145 |
+
|
| 146 |
+
1. **Test locally**: Run `test_full_translation.py` to verify fixes
|
| 147 |
+
2. **Check bilingual file**: Verify `*_bilingual.txt` contains full translations
|
| 148 |
+
3. **Verify worship program**: Check that markdown and DOCX files have complete message sections
|
| 149 |
+
4. **Deploy**: Push changes to Hugging Face Spaces for testing
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
**Status**: ✅ **Fixes applied - ready for testing**
|
| 154 |
+
|