GAWA - Gaussian-Weighted Abstraction for Word Architecture
Model Overview
GAWA is a word-level morphological autoencoder. It maps a word (sequence of characters)
into a single dense vector (eword) using a Gaussian positional prior, then reconstructs
the word with an autoregressive decoder.
Why this matters:
- No subword vocabulary: works with any character-based language.
- Handles unseen words: char-based encoding avoids UNK issues.
- Compact sequence length: one word -> one vector.
What This Repo Provides
- A pretrained checkpoint (see
checkpoints/on the model page) - Python utilities to encode and decode words
- CLI utilities for training, evaluation, and encoding
- Source code: https://github.com/AiRukua/gawa
Installation
pip install gawa
How To Use
from gawa import GAWAModel
# Load from Hugging Face Hub
model = GAWAModel.from_pretrained("AiRukua/gawa")
# Encode / decode directly from model
kept_words, embs = model.encode_words(["makan", "memakan", "makanan"])
kept_words, recs = model.decode_words(["makan", "memakan", "makanan"])
Intended Use
- Produce word embeddings for downstream models
- Reconstruct words for qualitative evaluation
- Explore morphology-aware word representations
- Exchange your BPE with GAWA
Limitations
- Reconstruction quality depends on training data and config.
- Very long or rare character patterns may be filtered by
max_word_len. - This model focuses on word-level encoding; it does not model full sentences by itself.
Training Data
This model was trained on ~8.2 million unique words extracted from Indo4B: https://huggingface.co/datasets/taufiqdp/Indo4B).
Training Details
- Decoder training: 2 epochs
- Accuracy: 94%
- Dataset: ~8.2 million words extracted from Indo4B
- Training time: ~12 hours
- Hardware: NVIDIA T4 (Kaggle)
License
MIT License. See LICENSE in the repository.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
