File size: 4,935 Bytes
4150c2c 34c8a90 4150c2c 34c8a90 4150c2c 34c8a90 a5597da 34c8a90 a5597da 34c8a90 a5597da 34c8a90 a5597da 34c8a90 a5597da 34c8a90 a5597da 34c8a90 a5597da 34c8a90 28076b1 34c8a90 a5597da 34c8a90 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
title: Mecari Morpheme Analyzer
emoji: 🧩
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.37.2
app_file: app.py
pinned: false
---
# Mecari (Japanese Morphological Analysis with Graph Neural Networks)
## Demo
You can try Mecari in https://huggingface.co/spaces/zbller/Mecari
## Overview
Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.
<p align="center">
<img src="abst.png" alt="Overview" width="70%" />
<!-- Adjust width (e.g., 60%, 50%, or px) as desired -->
</p>
### Graph
The graph is built from MeCab morpheme candidates.
### Annotation
Annotations are created by matching morpheme candidates to gold labels.
Annotations serve as the training targets (supervision) during learning.
- `+`: Candidate that exactly matches the gold.
- `-`: Any other candidate that overlaps by 1+ character with a `+` candidate.
- `?`: All other candidates (ignored during training).
### Training
Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.
### Inference
Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.
### Results (KWDLC test)
- Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
- MeCab (JUMANDIC) baseline: Seg F1 0.9677, POS F1 0.9465
The GATv2 model trained with this repository (current code and `configs/gatv2.yaml`) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.
## Environmental Setup
### Tested Environment
- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Python: 3.11.3
- PyTorch: 2.2.2+cu121
- CUDA (runtime): 12.1 (cu121)
- MeCab (binary): 0.996
- JUMANDIC: `/var/lib/mecab/dic/juman-utf8`
### MeCab Setup (Ubuntu 24.04)
1) Install packages (includes the JUMANDIC dictionary)
```bash
sudo apt update
sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
```
2) Verify installation
```bash
mecab -v # e.g., mecab of 0.996
test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"
```
### Project Setup
```bash
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create venv and install dependencies
uv venv
source .venv/bin/activate
uv sync
```
### Running on Hugging Face Spaces (CPU)
- Use Python 3.11 in Space settings (or metadata).
- Add `requirements.txt` and `packages.txt` from this repo to the Space root.
- `requirements.txt` pins `numpy<2` and CPU wheels for PyTorch 2.2.x and PyG.
- `packages.txt` installs MeCab and JUMANDIC via apt.
- Rebuild the Space and verify:
- `python -c "import numpy, torch; print(numpy.__version__, torch.__version__)"` shows NumPy 1.26.x and Torch 2.2.x.
## Quickstart (Morphological analysis)
```bash
# Analyze a single sentence with the bundled sample model
python infer.py --text "東京都の外国人参政権"
# Interactive mode
python infer.py
# After training, specify an experiment to use a custom model
python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."
```
Note
- When no experiment is specified, the model at `sample_model/` is loaded by default.
## Train by yourself
### KWDLC Setup (Required)
```bash
cd /path/to/Mecari
git clone --depth 1 https://github.com/ku-nlp/KWDLC
```
- Training requires KWDLC (non‑KWDLC training is not supported at the moment).
- Splits strictly follow the official `dev.id` / `test.id` files.
### Preprocessing
```bash
python preprocess.py --config configs/gatv2.yaml
```
### Training
```bash
python train.py --config configs/gatv2.yaml
```
- Outputs are saved under `experiments/<name>/`.
- The bundled model was trained with the current codebase and configuration (`configs/gatv2.yaml`).
### Evaluation
```bash
python evaluate.py --max-samples 50 \
--experiment gatv2_YYYYMMDD_HHMMSS
```
## License
CC BY‑NC 4.0 (non‑commercial use only)
## Acknowledgments
- [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
- [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022.
## Disclaimer
- Independent academic implementation for educational and research purposes.
- Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
- Not affiliated with, endorsed by, or connected to Google or its subsidiaries.
## Purpose
- Academic research
- Education
- Technical skill development
- Understanding of NLP techniques
|