Mecari / README.md
zbller's picture
Upload folder using huggingface_hub
a5597da verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Mecari Morpheme Analyzer
emoji: 🧩
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.37.2
app_file: app.py
pinned: false

Mecari (Japanese Morphological Analysis with Graph Neural Networks)

Demo

You can try Mecari in https://huggingface.co/spaces/zbller/Mecari

Overview

Mecari [1] is Google's GNN(Graph Neural Network)‑based Japanese morphological analyzer. It supports training from partially annotated graphs (only '+'/'-' where available; '?' is ignored) and aims for fast training and inference.

Overview

Graph

The graph is built from MeCab morpheme candidates.

Annotation

Annotations are created by matching morpheme candidates to gold labels. Annotations serve as the training targets (supervision) during learning.

  • +: Candidate that exactly matches the gold.
  • -: Any other candidate that overlaps by 1+ character with a + candidate.
  • ?: All other candidates (ignored during training).

Training

Nodes are featurized with JUMAN++‑style unigram features, edges are modeled as undirected (bidirectional), and a GATv2 [2] is trained on the resulting graphs.

Inference

Use the model’s node scores and run Viterbi to search the optimal non‑overlapping path.

Results (KWDLC test)

  • Trained model (sample_model): Seg F1 0.9725, POS F1 0.9562
  • MeCab (JUMANDIC) baseline: Seg F1 0.9677, POS F1 0.9465

The GATv2 model trained with this repository (current code and configs/gatv2.yaml) using the official KWDLC split outperforms MeCab on both segmentation and POS accuracy.

Environmental Setup

Tested Environment

  • OS: Ubuntu 24.04.3 LTS (Noble Numbat)
  • Python: 3.11.3
  • PyTorch: 2.2.2+cu121
  • CUDA (runtime): 12.1 (cu121)
  • MeCab (binary): 0.996
  • JUMANDIC: /var/lib/mecab/dic/juman-utf8

MeCab Setup (Ubuntu 24.04)

  1. Install packages (includes the JUMANDIC dictionary)
sudo apt update
sudo apt install -y mecab mecab-utils libmecab-dev mecab-jumandic-utf8
  1. Verify installation
mecab -v                       # e.g., mecab of 0.996
test -d /var/lib/mecab/dic/juman-utf8 && echo "JUMANDIC OK"

Project Setup

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create venv and install dependencies
uv venv
source .venv/bin/activate
uv sync

Running on Hugging Face Spaces (CPU)

  • Use Python 3.11 in Space settings (or metadata).
  • Add requirements.txt and packages.txt from this repo to the Space root.
    • requirements.txt pins numpy<2 and CPU wheels for PyTorch 2.2.x and PyG.
    • packages.txt installs MeCab and JUMANDIC via apt.
  • Rebuild the Space and verify:
    • python -c "import numpy, torch; print(numpy.__version__, torch.__version__)" shows NumPy 1.26.x and Torch 2.2.x.

Quickstart (Morphological analysis)

# Analyze a single sentence with the bundled sample model
python infer.py --text "東京都の外国人参政権"

# Interactive mode
python infer.py

# After training, specify an experiment to use a custom model
python infer.py --experiment gatv2_YYYYMMDD_HHMMSS --text "..."

Note

  • When no experiment is specified, the model at sample_model/ is loaded by default.

Train by yourself

KWDLC Setup (Required)

cd /path/to/Mecari
git clone --depth 1 https://github.com/ku-nlp/KWDLC
  • Training requires KWDLC (non‑KWDLC training is not supported at the moment).
  • Splits strictly follow the official dev.id / test.id files.

Preprocessing

python preprocess.py --config configs/gatv2.yaml

Training

python train.py --config configs/gatv2.yaml
  • Outputs are saved under experiments/<name>/.
  • The bundled model was trained with the current codebase and configuration (configs/gatv2.yaml).

Evaluation

python evaluate.py --max-samples 50 \
  --experiment gatv2_YYYYMMDD_HHMMSS

License

CC BY‑NC 4.0 (non‑commercial use only)

Acknowledgments

  • [1] “Data processing for Japanese text‑to‑pronunciation models”, Gleb Mazovetskiy, Taku Kudo, NLP2024 Workshop on Japanese Language Resources, URL: https://jedworkshop.github.io/JLR2024/materials/b-2.pdf (pp. 19–23)
  • [2] "HOW ATTENTIVE ARE GRAPH ATTENTION NETWORKS?.", Graph architecture: Brody, Shaked, Uri Alon, and Eran Yahav, 10th International Conference on Learning Representations, ICLR 2022. 2022.

Disclaimer

  • Independent academic implementation for educational and research purposes.
  • Core concepts (graph‑based morpheme boundary annotation) follow the published work; implementation details and code structure are our interpretation.
  • Not affiliated with, endorsed by, or connected to Google or its subsidiaries.

Purpose

  • Academic research
  • Education
  • Technical skill development
  • Understanding of NLP techniques