GAWA - Gaussian-Weighted Abstraction for Word Architecture

Model Overview

GAWA is a word-level morphological autoencoder. It maps a word (sequence of characters) into a single dense vector (eword) using a Gaussian positional prior, then reconstructs the word with an autoregressive decoder.

Why this matters:

No subword vocabulary: works with any character-based language.
Handles unseen words: char-based encoding avoids UNK issues.
Compact sequence length: one word -> one vector.

What This Repo Provides

A pretrained checkpoint (see checkpoints/ on the model page)
Python utilities to encode and decode words
CLI utilities for training, evaluation, and encoding
Source code: https://github.com/AiRukua/gawa

Installation

pip install gawa

How To Use

from gawa import GAWAModel

# Load from Hugging Face Hub
model = GAWAModel.from_pretrained("AiRukua/gawa")

# Encode / decode directly from model
kept_words, embs = model.encode_words(["makan", "memakan", "makanan"])
kept_words, recs = model.decode_words(["makan", "memakan", "makanan"])

Intended Use

Produce word embeddings for downstream models
Reconstruct words for qualitative evaluation
Explore morphology-aware word representations
Exchange your BPE with GAWA

Limitations

Reconstruction quality depends on training data and config.
Very long or rare character patterns may be filtered by max_word_len.
This model focuses on word-level encoding; it does not model full sentences by itself.

Training Data

This model was trained on ~8.2 million unique words extracted from Indo4B: https://huggingface.co/datasets/taufiqdp/Indo4B).

Training Details

Decoder training: 2 epochs
Accuracy: 94%
Dataset: ~8.2 million words extracted from Indo4B
Training time: ~12 hours
Hardware: NVIDIA T4 (Kaggle)

License

MIT License. See LICENSE in the repository.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support