Japanese text output quality degradation compared to base model (Qwen3-VL-30B-A3B-Instruct)

#4
by miminashi - opened

Summary

I've identified significant Japanese text quality issues in bu-30b-a3b-preview when used with browser-use's system prompt and DOM state context. Through controlled experiments, I confirmed that the fine-tuned model shows severe degradation in Japanese output quality compared to the base model (Qwen3-VL-30B-A3B-Instruct) under browser-use operating conditions.

Environment

  • Inference: llama.cpp (llama-server)
  • Quantization: Q8_0 (via bartowski/browser-use_bu-30b-a3b-preview-GGUF)
  • Hardware: AMD Radeon Instinct MI25 x4 / NVIDIA Tesla P100 x4
  • Context size: 24576-32768 tokens
  • Version: Latest as of 2025-12-25

Experiment Design

I conducted three phases of testing to isolate the cause:

Phase 1: Direct API Requests (Baseline)

Simple Japanese text repetition tasks without any system prompt:

  • Task: Output "ใ—ใใ‚Œใ†ใ„" (Shigure Ui) 5 times
  • Task: Output "ๆ–‡้ณฅใจๆšฎใ‚‰ใ—ใฆใ„ใพใ™" (I live with a Java sparrow) 5 times

Results: Both base model and bu-30b achieve 100% accuracy

Phase 2: Real-world Browser-Use Task

Web browsing task: Search "ใ—ใใ‚Œใ†ใ„" on Google, visit 3 sites, extract summaries.
15 test runs with bu-30b-a3b-preview.

Results: Significant Japanese text corruption observed

Phase 3: Controlled Condition Tests

Added browser-use-specific conditions incrementally:

  1. Exp1: Add browser-use system prompt only
  2. Exp2: Add long context (~3,000 tokens of English text)
  3. Exp3: Add DOM extraction text (Japanese Wikipedia-style content)

Results

Accuracy Comparison (5 runs ร— 5 repetitions = 25 samples per test)

Condition Base Model (dakuten) Base Model (kanji) bu-30b (dakuten) bu-30b (kanji)
Direct API 100% 100% 100% 100%
+ System Prompt 100% 92% 28% 28%
+ Long Context 92% 76% 0% 8%
+ DOM State 96% 88% 0% 0%

Summary Statistics

Model Average Accuracy (Exp1-3) Range
Base (Qwen3-VL-30B-A3B-Instruct) 90.7% 76-100%
bu-30b-a3b-preview 9.3% 0-28%

Error Patterns Observed

1. Dakuten (Voiced Consonant Mark) Errors

Expected: ใ—ใใ‚Œใ†ใ„ (Shigure Ui)

Actual Output Frequency Error Type
ใ—ใใ‚Œใ†ใ„ 73% Missing dakuten (ใโ†’ใ)
ใ—ใ ใ‚Œใ†ใ„ 7% Wrong character (ใโ†’ใ )
ใ—ใคใ‚Œใ†ใ„ 7% Wrong character (ใโ†’ใค)
ใ—ใฃใใ‚Œใ†ใ„ 7% Extra ใฃ + missing dakuten

2. Kanji Conversion Errors

Expected: ๆ–‡้ณฅใจๆšฎใ‚‰ใ—ใฆใ„ใพใ™ (I live with a Java sparrow)

Actual Output Error
ๆ–‡้ณฅใจๆตใ‚‰ใ—ใฆใ„ใพใ™ ๆšฎโ†’ๆต (wrong kanji)
ๆ–‡้ณฅใจๆตฆใ‚‰ใ—ใฆใ„ใพใ™ ๆšฎโ†’ๆตฆ (wrong kanji)
ๆ–‡้ณฅใจใใ‚‰ใ—ใฆใ„ใพใ™ ๆšฎโ†’ใ (kanji to hiragana)
ๆ–‡้ณฅใจใ‚‰ใพใ—ใฆใ„ใพใ™ ๆšฎใ‚‰โ†’ใ‚‰ใพ (corruption)

3. Completely Garbled Output (from real browser-use runs)

  • "222ไธ‡ไบบใฎๅ‹•็”ปใ‚’่ฟ‘ๆœŸ็š„ใซ้ธใฟๅฏ่ƒฝใชใƒ†ใƒผใƒณใƒ‡ใƒผใƒˆใฎไธญ"
  • "ใ‚คใƒฉใ‚นใƒˆใƒฌใƒผใ‚ฟใƒผใจๅ€‰็”ปๅฎถใงใ‚ใ‚‹ใƒ†ใƒผใ‚ทใƒกใƒ‰ใฎ16ๆญณ๏ผˆไปฎ๏ผ‰ใงใŠ็ตŒๆŽจใใ‚’ๅ‹•ใ„ใงใ„ใ‚‹"

These are completely meaningless in Japanese.

Analysis

Key Finding

The base model (Qwen3-VL-30B-A3B-Instruct) maintains 76-100% accuracy across all conditions, while bu-30b-a3b-preview drops to 0-28% when browser-use system prompt and DOM context are added.

This strongly suggests that the fine-tuning process degraded the model's Japanese language capabilities, particularly when operating in the structured output format required by browser-use.

Possible Causes

  1. Training data predominantly English: The fine-tuning dataset may have been mostly English browser automation examples
  2. JSON output format interference: Training to output structured JSON may have disrupted Japanese token generation
  3. Prompt sensitivity: The model may have become overly sensitive to specific prompt structures, causing instability in Japanese generation

Reproduction Steps

  1. Load bu-30b-a3b-preview with llama.cpp
  2. Use the browser-use system prompt:
You are a browser-use agent. You automate browser tasks by outputting structured JSON actions.
...
  1. Add Japanese DOM content to the user message
  2. Request Japanese text output
  3. Compare with base model (Qwen3-VL-30B-A3B-Instruct) under same conditions

Requests

  1. Could you confirm the language distribution of the fine-tuning dataset? Was Japanese (or other non-English languages) included?

  2. Are there plans to improve multilingual support? Japanese is widely used in browser automation for Japanese websites.

  3. Any recommended workarounds? For example:

    • Using base model with custom prompts?
    • Specific inference parameters that might help?

Appendix: Test Methodology

  • Server: llama-server with OpenAI-compatible API
  • Temperature: 0.7
  • Max tokens: 512
  • Test runs: 5 runs per condition, 5 repetitions per run = 25 samples
  • Evaluation: Exact string match counting

Thank you for developing this specialized browser automation model! I hope this feedback helps improve multilingual support in future versions.

Sign up or log in to comment