Japanese text output quality degradation compared to base model (Qwen3-VL-30B-A3B-Instruct)
Summary
I've identified significant Japanese text quality issues in bu-30b-a3b-preview when used with browser-use's system prompt and DOM state context. Through controlled experiments, I confirmed that the fine-tuned model shows severe degradation in Japanese output quality compared to the base model (Qwen3-VL-30B-A3B-Instruct) under browser-use operating conditions.
Environment
- Inference: llama.cpp (llama-server)
- Quantization: Q8_0 (via bartowski/browser-use_bu-30b-a3b-preview-GGUF)
- Hardware: AMD Radeon Instinct MI25 x4 / NVIDIA Tesla P100 x4
- Context size: 24576-32768 tokens
- Version: Latest as of 2025-12-25
Experiment Design
I conducted three phases of testing to isolate the cause:
Phase 1: Direct API Requests (Baseline)
Simple Japanese text repetition tasks without any system prompt:
- Task: Output "ใใใใใ" (Shigure Ui) 5 times
- Task: Output "ๆ้ณฅใจๆฎใใใฆใใพใ" (I live with a Java sparrow) 5 times
Results: Both base model and bu-30b achieve 100% accuracy
Phase 2: Real-world Browser-Use Task
Web browsing task: Search "ใใใใใ" on Google, visit 3 sites, extract summaries.
15 test runs with bu-30b-a3b-preview.
Results: Significant Japanese text corruption observed
Phase 3: Controlled Condition Tests
Added browser-use-specific conditions incrementally:
- Exp1: Add browser-use system prompt only
- Exp2: Add long context (~3,000 tokens of English text)
- Exp3: Add DOM extraction text (Japanese Wikipedia-style content)
Results
Accuracy Comparison (5 runs ร 5 repetitions = 25 samples per test)
| Condition | Base Model (dakuten) | Base Model (kanji) | bu-30b (dakuten) | bu-30b (kanji) |
|---|---|---|---|---|
| Direct API | 100% | 100% | 100% | 100% |
| + System Prompt | 100% | 92% | 28% | 28% |
| + Long Context | 92% | 76% | 0% | 8% |
| + DOM State | 96% | 88% | 0% | 0% |
Summary Statistics
| Model | Average Accuracy (Exp1-3) | Range |
|---|---|---|
| Base (Qwen3-VL-30B-A3B-Instruct) | 90.7% | 76-100% |
| bu-30b-a3b-preview | 9.3% | 0-28% |
Error Patterns Observed
1. Dakuten (Voiced Consonant Mark) Errors
Expected: ใใใใใ (Shigure Ui)
| Actual Output | Frequency | Error Type |
|---|---|---|
| ใใใใใ | 73% | Missing dakuten (ใโใ) |
| ใใ ใใใ | 7% | Wrong character (ใโใ ) |
| ใใคใใใ | 7% | Wrong character (ใโใค) |
| ใใฃใใใใ | 7% | Extra ใฃ + missing dakuten |
2. Kanji Conversion Errors
Expected: ๆ้ณฅใจๆฎใใใฆใใพใ (I live with a Java sparrow)
| Actual Output | Error |
|---|---|
| ๆ้ณฅใจๆตใใใฆใใพใ | ๆฎโๆต (wrong kanji) |
| ๆ้ณฅใจๆตฆใใใฆใใพใ | ๆฎโๆตฆ (wrong kanji) |
| ๆ้ณฅใจใใใใฆใใพใ | ๆฎโใ (kanji to hiragana) |
| ๆ้ณฅใจใใพใใฆใใพใ | ๆฎใโใใพ (corruption) |
3. Completely Garbled Output (from real browser-use runs)
- "222ไธไบบใฎๅ็ปใ่ฟๆ็ใซ้ธใฟๅฏ่ฝใชใใผใณใใผใใฎไธญ"
- "ใคใฉในใใฌใผใฟใผใจๅ็ปๅฎถใงใใใใผใทใกใใฎ16ๆญณ๏ผไปฎ๏ผใงใ็ตๆจใใๅใใงใใ"
These are completely meaningless in Japanese.
Analysis
Key Finding
The base model (Qwen3-VL-30B-A3B-Instruct) maintains 76-100% accuracy across all conditions, while bu-30b-a3b-preview drops to 0-28% when browser-use system prompt and DOM context are added.
This strongly suggests that the fine-tuning process degraded the model's Japanese language capabilities, particularly when operating in the structured output format required by browser-use.
Possible Causes
- Training data predominantly English: The fine-tuning dataset may have been mostly English browser automation examples
- JSON output format interference: Training to output structured JSON may have disrupted Japanese token generation
- Prompt sensitivity: The model may have become overly sensitive to specific prompt structures, causing instability in Japanese generation
Reproduction Steps
- Load bu-30b-a3b-preview with llama.cpp
- Use the browser-use system prompt:
You are a browser-use agent. You automate browser tasks by outputting structured JSON actions.
...
- Add Japanese DOM content to the user message
- Request Japanese text output
- Compare with base model (Qwen3-VL-30B-A3B-Instruct) under same conditions
Requests
Could you confirm the language distribution of the fine-tuning dataset? Was Japanese (or other non-English languages) included?
Are there plans to improve multilingual support? Japanese is widely used in browser automation for Japanese websites.
Any recommended workarounds? For example:
- Using base model with custom prompts?
- Specific inference parameters that might help?
Appendix: Test Methodology
- Server: llama-server with OpenAI-compatible API
- Temperature: 0.7
- Max tokens: 512
- Test runs: 5 runs per condition, 5 repetitions per run = 25 samples
- Evaluation: Exact string match counting
Thank you for developing this specialized browser automation model! I hope this feedback helps improve multilingual support in future versions.