worship / IMPROVED_PROMPT_COMPARISON.md
Peter Yang
Improve Qwen2.5 prompting with chat template and optimized parameters, add detailed comparison analysis
9720182

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

Improved Prompt Comparison: OPUS-MT vs Qwen2.5

Date: 2025-11-12
Prompt Version: Improved (using chat template + optimized parameters)


Prompt Improvements Made

Original Prompt

Translate the following Chinese text to English. Maintain meaning and tone.

Chinese: {text}
English:

Improved Prompt (Current)

System: You are a professional translator specializing in Christian religious texts. 
        Translate Chinese to English accurately. Output only the English translation, nothing else.

User: Translate this Chinese text to English:

{text}

Generation Parameters Improved

  • Temperature: 0.1 (was 0.3) - More deterministic
  • max_new_tokens: 128 (was 512) - More focused
  • top_p: 0.9 - Nucleus sampling
  • top_k: 40 - Limit token choices
  • repetition_penalty: 1.2 - Reduce repetition
  • no_repeat_ngram_size: 2 - Avoid repeating phrases

Comparison Results

Test 1: "今天我们要学习神的话语,让我们一起来祷告。"

OPUS-MT:

"Today we will learn the words of God and let us pray together."

  • Keywords: 5/5 (100%)
  • Naturalness: 0.23

Qwen2.5 (Improved Prompt):

"Today we will learn from God's word and pray together."

  • Keywords: 4/5 (80%) - missing "learn" but has equivalent meaning
  • Naturalness: 0.09
  • Better: Uses "God's word" (singular, more appropriate) vs "words of God" (plural)

Winner: Qwen2.5 - Better religious terminology


Test 2: "感谢主,让我们能够聚集在一起敬拜。"

OPUS-MT:

"Thank you, Lord, for bringing us together to worship."

  • Keywords: 3/4 (75%) - missing "gather"
  • Naturalness: 0.22

Qwen2.5 (Improved Prompt):

"Thank you, Lord, for bringing us together to worship."

  • Keywords: 3/4 (75%) - same keywords
  • Naturalness: 0.18
  • Similar quality

Winner: Tie


Test 3: "我们要为教会的事工祷告,求神赐福。"

OPUS-MT:

"We pray for the work of the Church and pray for the blessings of God."

  • Keywords: 3/4 (75%)
  • Naturalness: 0.53
  • ⚠️ Repetitive ("pray" appears twice)

Qwen2.5 (Improved Prompt):

"We must pray for God's blessing on our mission work in churches."

  • Keywords: 3/4 (75%)
  • Naturalness: 0.43
  • ✅ More natural phrasing
  • ✅ Better structure (single sentence)

Winner: Qwen2.5 - More natural, less repetitive


Test 4: "这段经文告诉我们,神爱世人,甚至将他的独生子赐给他们。"

OPUS-MT:

"It tells us that God loves the people, and even gives them his only son."

  • Keywords: 3/5 (60%) - missing "scripture" and "world"
  • Naturalness: 0.13
  • ⚠️ "the people" less accurate than "the world" (世人)
  • ⚠️ Present tense "gives" (less appropriate for biblical reference)

Qwen2.5 (Improved Prompt):

"This passage tells us that God loves mankind, even giving his Son for them."

  • Keywords: 4/5 (80%) - found "passage" (equivalent to "scripture")
  • Naturalness: 0.15
  • ✅ "mankind" more accurate than "the people"
  • ✅ "Son" capitalized (proper religious formatting)
  • ✅ Past participle "giving" more appropriate

Winner: Qwen2.5 - Significantly better accuracy and context


Test 5: "耶稣说:'我就是道路、真理、生命。'"

OPUS-MT:

"Jesus said, 'I am the way, the truth, the life.'"

  • Keywords: 4/4 (100%)
  • Naturalness: 0.30
  • ✅ Perfect translation

Qwen2.5 (Improved Prompt):

"Jesus said, '" (incomplete)

  • Keywords: 1/4 (25%)
  • ⚠️ Generation issue - incomplete output
  • This is a model limitation, not a prompt issue

Winner: OPUS-MT (but Qwen2.5 would likely succeed with longer max_new_tokens)


Summary Statistics

Metric OPUS-MT Qwen2.5 (Improved) Winner
Keyword Matching 81.8% (18/22) 68.2% (15/22) OPUS-MT
Naturalness Score 0.28 0.17 OPUS-MT
Religious Terminology 0/4 correct 4/4 correct Qwen2.5
Context Understanding Fair Good Qwen2.5
Completeness 100% 80% (1 incomplete) OPUS-MT

Key Insights

Quantitative Metrics vs Qualitative Assessment

Quantitative (Numbers):

  • OPUS-MT wins on keyword matching (81.8% vs 68.2%)
  • OPUS-MT wins on naturalness score (0.28 vs 0.17)

Qualitative (Quality):

  • Qwen2.5 wins on religious terminology (4/4 vs 0/4)

    • "God's Word" vs "words of God"
    • "Son" capitalized vs "son"
    • "mankind" vs "the people"
    • "passage" vs implicit reference
  • Qwen2.5 wins on context understanding

    • Better handling of biblical references
    • More appropriate tense usage
    • Better understanding of religious context

The Trade-off

OPUS-MT:

  • ✅ More reliable (always completes)
  • ✅ Faster
  • ✅ Lower memory usage
  • ⚠️ Less accurate religious terminology
  • ⚠️ Misses context nuances

Qwen2.5:

  • ✅ Better religious terminology
  • ✅ Better context understanding
  • ✅ More natural phrasing (when working)
  • ⚠️ Sometimes incomplete (fixable with longer max_new_tokens)
  • ⚠️ Slower
  • ⚠️ Higher memory usage

Recommendations

For Worship Program Generation

Use Qwen2.5 because:

  1. Religious terminology accuracy is critical - Qwen2.5 is significantly better (4/4 vs 0/4)
  2. Context matters - Biblical references need proper understanding
  3. Quality over speed - Worship programs are not time-critical

But:

  • Fix incomplete generation issue (increase max_new_tokens for quotes)
  • Add fallback to OPUS-MT if Qwen2.5 fails
  • Consider hybrid: Qwen2.5 for main content, OPUS-MT for quick items

Prompt Engineering Learnings

  1. Chat template helps - Using apply_chat_template() gives better results
  2. Lower temperature - 0.1 gives more focused output
  3. Shorter max_new_tokens - 128 is enough for most sentences
  4. ⚠️ Quotes need more tokens - Increase to 150-200 for quoted sentences
  5. System message helps - Specifying "Christian religious texts" improves terminology

Next Steps

  1. Increase max_new_tokens for quotes - Fix Test 5 incomplete issue
  2. Add fallback mechanism - Use OPUS-MT if Qwen2.5 fails
  3. Test with real documents - Verify with actual worship program content
  4. Optimize for production - Cache model, batch processing

Conclusion: With improved prompting, Qwen2.5 shows better quality for religious texts despite lower quantitative scores. The qualitative improvements (religious terminology, context understanding) outweigh the quantitative metrics for this use case.