Peter Yang commited on
Commit
02fb1fa
·
1 Parent(s): 1c3d743

Add documentation for translation truncation fixes

Browse files
Files changed (1) hide show
  1. TRANSLATION_FIX_SUMMARY.md +154 -0
TRANSLATION_FIX_SUMMARY.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Translation Fix Summary: Incomplete Qwen2.5 Translations
2
+
3
+ **Date**: 2025-11-12
4
+ **Issue**: Qwen2.5 translations were incomplete - only translating first sentence of long paragraphs
5
+
6
+ ---
7
+
8
+ ## Problem
9
+
10
+ The user reported that Qwen2.5 was only translating the first sentence of long paragraphs. For example:
11
+
12
+ **Input (Chinese)**:
13
+ ```
14
+ 作为光明的子女,顺从圣灵的引领,自然就会结出圣灵的果子。相反,如果生活中仍充满嫉妒,恶毒,虚伪,诡诈,就显明还活在黑暗里,并不是光明的子女。约翰一书 1:6我们若说是与神相交,却仍在黑暗里行,就是说谎话,不行真理了。这就提醒我们,要常常反思,我是否结出了光明的果子?是否行出了良善,公义,诚实呢?如果结出了光明的果子,感谢赞美主!如果所行的是黑暗,就要及时悔改。求主赐下智慧,加添力量,帮助我们活出光明的子女该有的生活,结出光明的果子。
15
+ ```
16
+
17
+ **Output (English)** - Only first sentence:
18
+ ```
19
+ As children of light, we naturally follow the guidance of the Holy Spirit and will therefore bear fruit from the Spirit.
20
+ ```
21
+
22
+ ---
23
+
24
+ ## Root Causes
25
+
26
+ 1. **Input Truncation**: `max_length=512` tokens was too small for long paragraphs
27
+ 2. **Output Limitation**: `max_new_tokens=200` was insufficient for complete translations
28
+ 3. **Aggressive Cleanup**: Cleanup logic was cutting off translations at the first sentence ending
29
+
30
+ ---
31
+
32
+ ## Fixes Applied
33
+
34
+ ### 1. Increased Input Token Limit ✅
35
+
36
+ **File**: `document_processing_agent.py::_translate_text_qwen()`
37
+
38
+ **Change**: Increased `max_length` from 512 to 1024 tokens
39
+
40
+ ```python
41
+ max_input_length = 1024 # Increased from 512 to handle longer paragraphs
42
+ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=max_input_length).to(device)
43
+ ```
44
+
45
+ ### 2. Dynamic Output Token Limit ✅
46
+
47
+ **Change**: Made `max_new_tokens` dynamic based on input length
48
+
49
+ ```python
50
+ # Calculate approximate token count for input text
51
+ input_tokens = len(text) # Rough estimate
52
+
53
+ # For very long paragraphs, we need to increase max_new_tokens proportionally
54
+ # Estimate: Chinese to English translation is roughly 1:1.5 ratio
55
+ estimated_output_tokens = int(input_tokens * 1.5)
56
+ max_new_tokens = min(max(estimated_output_tokens + 100, 300), 800) # At least 300, up to 800 tokens
57
+ ```
58
+
59
+ **Benefits**:
60
+ - Short paragraphs: Minimum 300 tokens (sufficient for 1-2 sentences)
61
+ - Long paragraphs: Up to 800 tokens (sufficient for 4-5 sentences)
62
+ - Prevents truncation while avoiding excessive generation
63
+
64
+ ### 3. Less Aggressive Cleanup ✅
65
+
66
+ **Change**: Modified cleanup logic to preserve full translations
67
+
68
+ **Before**: Cut off at first sentence ending
69
+ ```python
70
+ first_period = translation.find('.')
71
+ if first_period > 0:
72
+ translation = translation[:first_period + 1].strip()
73
+ ```
74
+
75
+ **After**: Only clean up trailing explanatory text, preserve full translation
76
+ ```python
77
+ # Don't aggressively cut off at first sentence - let the full translation through
78
+ # Only clean up if there's clearly extra explanatory text after the translation
79
+
80
+ # Remove trailing explanatory text that starts with common phrases
81
+ trailing_markers = [
82
+ " If you", " Note:", " Here is", " The translation",
83
+ " Translation:", " Chinese:", " English:"
84
+ ]
85
+ for marker in trailing_markers:
86
+ idx = translation.find(marker)
87
+ if idx > len(translation) * 0.5: # Only if marker is in second half (likely trailing)
88
+ translation = translation[:idx].strip()
89
+
90
+ # Ensure translation ends with proper punctuation
91
+ if translation and not translation[-1] in ['.', '!', '?', '"', "'"]:
92
+ # Try to find the last complete sentence (only if near the end)
93
+ last_period = translation.rfind('.')
94
+ if last_period > len(translation) * 0.8: # Only if within last 20%
95
+ translation = translation[:last_period + 1].strip()
96
+ ```
97
+
98
+ **Benefits**:
99
+ - Preserves complete multi-sentence translations
100
+ - Only removes clearly trailing explanatory text
101
+ - Handles edge cases better (quotes, multiple sentences)
102
+
103
+ ---
104
+
105
+ ## Testing
106
+
107
+ A test script `test_full_translation.py` has been created to verify the fixes:
108
+
109
+ ```bash
110
+ python3 test_full_translation.py
111
+ ```
112
+
113
+ This script:
114
+ 1. Translates the DOCX file (`2025-09-28-MQD-RCCA-sript-for-translator.docx`)
115
+ 2. Generates a worship program with PDF (`RCCA-worship-bulletin-2025-11-09.pdf`)
116
+ 3. Creates markdown and DOCX output files
117
+ 4. Verifies that translations are complete
118
+
119
+ ---
120
+
121
+ ## Expected Results
122
+
123
+ After the fix, the same paragraph should translate to:
124
+
125
+ ```
126
+ As children of light, following the guidance of the Holy Spirit, we will naturally bear the fruit of the Spirit. On the contrary, if our lives are still filled with jealousy, malice, hypocrisy, and deceit, it shows that we are still living in darkness and are not children of light. 1 John 1:6 says, "If we claim to have fellowship with him and yet walk in the darkness, we lie and do not live out the truth." This reminds us to constantly reflect: Have I borne the fruit of light? Have I acted with goodness, righteousness, and honesty? If we have borne the fruit of light, praise the Lord! If what we do is darkness, we must repent in time. We ask the Lord to grant wisdom and add strength to help us live the life that children of light should have and bear the fruit of light.
127
+ ```
128
+
129
+ **Key Improvements**:
130
+ - ✅ Complete translation of all sentences
131
+ - ✅ Includes scripture reference (1 John 1:6)
132
+ - ✅ Includes all questions and statements
133
+ - ✅ Proper punctuation and formatting
134
+
135
+ ---
136
+
137
+ ## Files Modified
138
+
139
+ - `document_processing_agent.py` - Fixed `_translate_text_qwen()` method
140
+ - `test_full_translation.py` - Created test script
141
+
142
+ ---
143
+
144
+ ## Next Steps
145
+
146
+ 1. **Test locally**: Run `test_full_translation.py` to verify fixes
147
+ 2. **Check bilingual file**: Verify `*_bilingual.txt` contains full translations
148
+ 3. **Verify worship program**: Check that markdown and DOCX files have complete message sections
149
+ 4. **Deploy**: Push changes to Hugging Face Spaces for testing
150
+
151
+ ---
152
+
153
+ **Status**: ✅ **Fixes applied - ready for testing**
154
+