GAWA - Gaussian-Weighted Abstraction for Word Architecture

Diagram arsitektur model GAWA

Model Overview

GAWA is a word-level morphological autoencoder. It maps a word (sequence of characters) into a single dense vector (eword) using a Gaussian positional prior, then reconstructs the word with an autoregressive decoder.

Why this matters:

  • No subword vocabulary: works with any character-based language.
  • Handles unseen words: char-based encoding avoids UNK issues.
  • Compact sequence length: one word -> one vector.

What This Repo Provides

  • A pretrained checkpoint (see checkpoints/ on the model page)
  • Python utilities to encode and decode words
  • CLI utilities for training, evaluation, and encoding
  • Source code: https://github.com/AiRukua/gawa

Installation

pip install gawa

How To Use

from gawa import GAWAModel

# Load from Hugging Face Hub
model = GAWAModel.from_pretrained("AiRukua/gawa")

# Encode / decode directly from model
kept_words, embs = model.encode_words(["makan", "memakan", "makanan"])
kept_words, recs = model.decode_words(["makan", "memakan", "makanan"])

Intended Use

  • Produce word embeddings for downstream models
  • Reconstruct words for qualitative evaluation
  • Explore morphology-aware word representations
  • Exchange your BPE with GAWA

Limitations

  • Reconstruction quality depends on training data and config.
  • Very long or rare character patterns may be filtered by max_word_len.
  • This model focuses on word-level encoding; it does not model full sentences by itself.

Training Data

This model was trained on ~8.2 million unique words extracted from Indo4B: https://huggingface.co/datasets/taufiqdp/Indo4B).

Training Details

  • Decoder training: 2 epochs
  • Accuracy: 94%
  • Dataset: ~8.2 million words extracted from Indo4B
  • Training time: ~12 hours
  • Hardware: NVIDIA T4 (Kaggle)

License

MIT License. See LICENSE in the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support