A newer version of the Gradio SDK is available:
6.5.1
title: Mecari Morpheme Analyzer
emoji: 🧩
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.37.2
app_file: app.py
pinned: false
Mecari (Japanese Morphological Analysis with Graph Neural Networks)
Demo
You can try Mecari in https://huggingface.co/spaces/zbller/Mecari
Overview
Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.
Graph
The graph is built from MeCab morpheme candidates.
Annotation
Annotations are created by matching morpheme candidates to gold labels. Annotations serve as the training targets (supervision) during learning.
+: Candidate that exactly matches the gold.-: Any other candidate that overlaps by 1+ character with a+candidate.?: All other candidates (ignored during training).
Training
Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.
Inference
Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.
Results (KWDLC test)
- Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
- MeCab (JUMANDIC) baseline: Seg F1 0.9677, POS F1 0.9465
The GATv2 model trained with this repository (current code and configs/gatv2.yaml) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.
Environmental Setup
Tested Environment
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Python: 3.11.3
- PyTorch: 2.2.2+cu121
- CUDA (runtime): 12.1 (cu121)
- MeCab (binary): 0.996
- JUMANDIC:
/var/lib/mecab/dic/juman-utf8
MeCab Setup (Ubuntu 24.04)
- Install packages (includes the JUMANDIC dictionary)
sudo apt update
sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
- Verify installation
mecab -v # e.g., mecab of 0.996
test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"
Project Setup
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install dependencies
uv venv
source .venv/bin/activate
uv sync
Running on Hugging Face Spaces (CPU)
- Use Python 3.11 in Space settings (or metadata).
- Add
requirements.txtandpackages.txtfrom this repo to the Space root.requirements.txtpinsnumpy<2and CPU wheels for PyTorch 2.2.x and PyG.packages.txtinstalls MeCab and JUMANDIC via apt.
- Rebuild the Space and verify:
python -c "import numpy, torch; print(numpy.__version__, torch.__version__)"shows NumPy 1.26.x and Torch 2.2.x.
Quickstart (Morphological analysis)
# Analyze a single sentence with the bundled sample model
python infer.py --text "東京都の外国人参政権"
# Interactive mode
python infer.py
# After training, specify an experiment to use a custom model
python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."
Note
- When no experiment is specified, the model at
sample_model/is loaded by default.
Train by yourself
KWDLC Setup (Required)
cd /path/to/Mecari
git clone --depth 1 https://github.com/ku-nlp/KWDLC
- Training requires KWDLC (non‑KWDLC training is not supported at the moment).
- Splits strictly follow the official
dev.id/test.idfiles.
Preprocessing
python preprocess.py --config configs/gatv2.yaml
Training
python train.py --config configs/gatv2.yaml
- Outputs are saved under
experiments/<name>/. - The bundled model was trained with the current codebase and configuration (
configs/gatv2.yaml).
Evaluation
python evaluate.py --max-samples 50 \
--experiment gatv2_YYYYMMDD_HHMMSS
License
CC BY‑NC 4.0 (non‑commercial use only)
Acknowledgments
- [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
- [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022.
Disclaimer
- Independent academic implementation for educational and research purposes.
- Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
- Not affiliated with, endorsed by, or connected to Google or its subsidiaries.
Purpose
- Academic research
- Education
- Technical skill development
- Understanding of NLP techniques