Spaces:

zbller
/

Mecari

Running

App Files Files Community

Mecari / README.md

zbller

Upload folder using huggingface_hub

a5597da verified 5 months ago

preview code

raw

history blame contribute delete

4.94 kB

	---
	title: Mecari Morpheme Analyzer
	emoji: 🧩
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 4.37.2
	app_file: app.py
	pinned: false
	---

	# Mecari (Japanese Morphological Analysis with Graph Neural Networks)

	## Demo
	You can try Mecari in https://huggingface.co/spaces/zbller/Mecari

	## Overview

	Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.

	<p align="center">
	<img src="abst.png" alt="Overview" width="70%" />
	<!-- Adjust width (e.g., 60%, 50%, or px) as desired -->

	</p>

	### Graph
	The graph is built from MeCab morpheme candidates.

	### Annotation
	Annotations are created by matching morpheme candidates to gold labels.
	Annotations serve as the training targets (supervision) during learning.
	- `+`: Candidate that exactly matches the gold.
	- `-`: Any other candidate that overlaps by 1+ character with a `+` candidate.
	- `?`: All other candidates (ignored during training).

	### Training
	Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.

	### Inference
	Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.

	### Results (KWDLC test)

	- Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
	- MeCab (JUMANDIC) baseline: Seg F1 0.9677, POS F1 0.9465

	The GATv2 model trained with this repository (current code and `configs/gatv2.yaml`) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.

	## Environmental Setup
	### Tested Environment

	- OS: Ubuntu 24.04.3 LTS (Noble Numbat)
	- Python: 3.11.3
	- PyTorch: 2.2.2+cu121
	- CUDA (runtime): 12.1 (cu121)
	- MeCab (binary): 0.996
	- JUMANDIC: `/var/lib/mecab/dic/juman-utf8`

	### MeCab Setup (Ubuntu 24.04)
	1) Install packages (includes the JUMANDIC dictionary)

	```bash
	sudo apt update
	sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
	```

	2) Verify installation

	```bash
	mecab -v # e.g., mecab of 0.996
	test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"
	```

	### Project Setup

	```bash
	# Install uv if needed
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Create venv and install dependencies
	uv venv
	source .venv/bin/activate
	uv sync
	```

	### Running on Hugging Face Spaces (CPU)
	- Use Python 3.11 in Space settings (or metadata).
	- Add `requirements.txt` and `packages.txt` from this repo to the Space root.
	- `requirements.txt` pins `numpy<2` and CPU wheels for PyTorch 2.2.x and PyG.
	- `packages.txt` installs MeCab and JUMANDIC via apt.
	- Rebuild the Space and verify:
	- `python -c "import numpy, torch; print(numpy.__version__, torch.__version__)"` shows NumPy 1.26.x and Torch 2.2.x.

	## Quickstart (Morphological analysis)

	```bash
	# Analyze a single sentence with the bundled sample model
	python infer.py --text "東京都の外国人参政権"

	# Interactive mode
	python infer.py

	# After training, specify an experiment to use a custom model
	python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."
	```

	Note
	- When no experiment is specified, the model at `sample_model/` is loaded by default.

	## Train by yourself
	### KWDLC Setup (Required)

	```bash
	cd /path/to/Mecari
	git clone --depth 1 https://github.com/ku-nlp/KWDLC
	```

	- Training requires KWDLC (non‑KWDLC training is not supported at the moment).
	- Splits strictly follow the official `dev.id` / `test.id` files.


	### Preprocessing

	```bash
	python preprocess.py --config configs/gatv2.yaml
	```

	### Training

	```bash
	python train.py --config configs/gatv2.yaml
	```

	- Outputs are saved under `experiments/<name>/`.
	- The bundled model was trained with the current codebase and configuration (`configs/gatv2.yaml`).

	### Evaluation

	```bash
	python evaluate.py --max-samples 50 \
	--experiment gatv2_YYYYMMDD_HHMMSS
	```


	## License

	CC BY‑NC 4.0 (non‑commercial use only)

	## Acknowledgments
	- [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
	- [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022.

	## Disclaimer
	- Independent academic implementation for educational and research purposes.
	- Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
	- Not affiliated with, endorsed by, or connected to Google or its subsidiaries.

	## Purpose
	- Academic research
	- Education
	- Technical skill development
	- Understanding of NLP techniques