File size: 4,935 Bytes
4150c2c
34c8a90
 
 
 
4150c2c
34c8a90
4150c2c
 
 
 
34c8a90
 
a5597da
 
34c8a90
a5597da
34c8a90
a5597da
34c8a90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5597da
34c8a90
 
 
 
 
 
a5597da
 
34c8a90
 
 
 
 
 
 
 
a5597da
34c8a90
 
 
 
 
 
 
 
 
 
 
 
 
 
a5597da
34c8a90
 
 
 
 
 
 
 
 
 
 
28076b1
 
 
 
 
 
 
 
34c8a90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a5597da
 
34c8a90
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: Mecari Morpheme Analyzer
emoji: 🧩
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.37.2
app_file: app.py
pinned: false
---

# Mecari (Japanese Morphological Analysis with Graph Neural Networks)

## Demo
You can try Mecari in https://huggingface.co/spaces/zbller/Mecari

## Overview

Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.

<p align="center">
  <img src="abst.png" alt="Overview" width="70%" />
  <!-- Adjust width (e.g., 60%, 50%, or px) as desired -->
  
</p>

### Graph
The graph is built from MeCab morpheme candidates.

### Annotation
Annotations are created by matching morpheme candidates to gold labels.
Annotations serve as the training targets (supervision) during learning.
- `+`: Candidate that exactly matches the gold.
- `-`: Any other candidate that overlaps by 1+ character with a `+` candidate.
- `?`: All other candidates (ignored during training).

### Training
Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.

### Inference
Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.

### Results (KWDLC test)

- Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
- MeCab (JUMANDIC) baseline:   Seg F1 0.9677, POS F1 0.9465

The GATv2 model trained with this repository (current code and `configs/gatv2.yaml`) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.

## Environmental Setup
### Tested Environment

- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
- Python: 3.11.3
- PyTorch: 2.2.2+cu121
- CUDA (runtime): 12.1 (cu121)
- MeCab (binary): 0.996
- JUMANDIC: `/var/lib/mecab/dic/juman-utf8`

### MeCab Setup (Ubuntu 24.04)
1) Install packages (includes the JUMANDIC dictionary)

```bash
sudo apt update
sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
```

2) Verify installation

```bash
mecab -v                       # e.g., mecab of 0.996
test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"
```

### Project Setup

```bash
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv and install dependencies
uv venv
source .venv/bin/activate
uv sync
```

### Running on Hugging Face Spaces (CPU)
- Use Python 3.11 in Space settings (or metadata).
- Add `requirements.txt` and `packages.txt` from this repo to the Space root.
  - `requirements.txt` pins `numpy<2` and CPU wheels for PyTorch 2.2.x and PyG.
  - `packages.txt` installs MeCab and JUMANDIC via apt.
- Rebuild the Space and verify:
  - `python -c "import numpy, torch; print(numpy.__version__, torch.__version__)"` shows NumPy 1.26.x and Torch 2.2.x.

## Quickstart (Morphological analysis)

```bash
# Analyze a single sentence with the bundled sample model
python infer.py --text "東京都の外国人参政権"

# Interactive mode
python infer.py

# After training, specify an experiment to use a custom model
python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."
```

Note
- When no experiment is specified, the model at `sample_model/` is loaded by default.

## Train by yourself
### KWDLC Setup (Required)

```bash
cd /path/to/Mecari
git clone --depth 1 https://github.com/ku-nlp/KWDLC
```

- Training requires KWDLC (non‑KWDLC training is not supported at the moment).
- Splits strictly follow the official `dev.id` / `test.id` files.


### Preprocessing

```bash
python preprocess.py --config configs/gatv2.yaml
```

### Training

```bash
python train.py --config configs/gatv2.yaml
```

- Outputs are saved under `experiments/<name>/`.
- The bundled model was trained with the current codebase and configuration (`configs/gatv2.yaml`).

### Evaluation

```bash
python evaluate.py --max-samples 50 \
  --experiment gatv2_YYYYMMDD_HHMMSS
```


## License

CC BY‑NC 4.0 (non‑commercial use only)

## Acknowledgments
- [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
- [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022.

## Disclaimer
- Independent academic implementation for educational and research purposes.
- Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
- Not affiliated with, endorsed by, or connected to Google or its subsidiaries.

## Purpose
- Academic research
- Education
- Technical skill development
- Understanding of NLP techniques