--- title: Mecari Morpheme Analyzer emoji: 🧩 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.37.2 app_file: app.py pinned: false --- # Mecari (Japanese Morphological Analysis with Graph Neural Networks) ## Demo You can try Mecari in https://huggingface.co/spaces/zbller/Mecari ## Overview Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.

Overview

### Graph The graph is built from MeCab morpheme candidates. ### Annotation Annotations are created by matching morpheme candidates to gold labels. Annotations serve as the training targets (supervision) during learning. - `+`: Candidate that exactly matches the gold. - `-`: Any other candidate that overlaps by 1+ character with a `+` candidate. - `?`: All other candidates (ignored during training). ### Training Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs. ### Inference Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path. ### Results (KWDLC test) - Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562 - MeCab (JUMANDIC) baseline: Seg F1 0.9677, POS F1 0.9465 The GATv2 model trained with this repository (current code and `configs/gatv2.yaml`) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy. ## Environmental Setup ### Tested Environment - OS: Ubuntu 24.04.3 LTS (Noble Numbat) - Python: 3.11.3 - PyTorch: 2.2.2+cu121 - CUDA (runtime): 12.1 (cu121) - MeCab (binary): 0.996 - JUMANDIC: `/var/lib/mecab/dic/juman-utf8` ### MeCab Setup (Ubuntu 24.04) 1) Install packages (includes the JUMANDIC dictionary) ```bash sudo apt update sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8 ``` 2) Verify installation ```bash mecab -v # e.g., mecab of 0.996 test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK" ``` ### Project Setup ```bash # Install uv if needed curl -LsSf https://astral.sh/uv/install.sh | sh # Create venv and install dependencies uv venv source .venv/bin/activate uv sync ``` ### Running on Hugging Face Spaces (CPU) - Use Python 3.11 in Space settings (or metadata). - Add `requirements.txt` and `packages.txt` from this repo to the Space root. - `requirements.txt` pins `numpy<2` and CPU wheels for PyTorch 2.2.x and PyG. - `packages.txt` installs MeCab and JUMANDIC via apt. - Rebuild the Space and verify: - `python -c "import numpy, torch; print(numpy.__version__, torch.__version__)"` shows NumPy 1.26.x and Torch 2.2.x. ## Quickstart (Morphological analysis) ```bash # Analyze a single sentence with the bundled sample model python infer.py --text "東京都の外国人参政権" # Interactive mode python infer.py # After training, specify an experiment to use a custom model python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..." ``` Note - When no experiment is specified, the model at `sample_model/` is loaded by default. ## Train by yourself ### KWDLC Setup (Required) ```bash cd /path/to/Mecari git clone --depth 1 https://github.com/ku-nlp/KWDLC ``` - Training requires KWDLC (non‑KWDLC training is not supported at the moment). - Splits strictly follow the official `dev.id` / `test.id` files. ### Preprocessing ```bash python preprocess.py --config configs/gatv2.yaml ``` ### Training ```bash python train.py --config configs/gatv2.yaml ``` - Outputs are saved under `experiments//`. - The bundled model was trained with the current codebase and configuration (`configs/gatv2.yaml`). ### Evaluation ```bash python evaluate.py --max-samples 50 \ --experiment gatv2_YYYYMMDD_HHMMSS ``` ## License CC BY‑NC 4.0 (non‑commercial use only) ## Acknowledgments - [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23) - [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022. ## Disclaimer - Independent academic implementation for educational and research purposes. - Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation. - Not affiliated with, endorsed by, or connected to Google or its subsidiaries. ## Purpose - Academic research - Education - Technical skill development - Understanding of NLP techniques