SG Transit Prefix Encoder

A lightweight (~110k params) character-level Siamese encoder that maps messy keyboard input to official Singapore bus-stop and MRT station names.

Built for on-device autocomplete in a mobile transit app.

What it handles

User types	Matches
`bed`	Bedok
`choachukang` (no spaces)	Choa Chu Kang
`cck`, `amk`, `ns4`	the right station
`block 209`	Blk 209
`rd`, `stn`, `int`	Road, Station, Interchange
`bedik` (typo)	Bedok

Architecture

Tokenizer: custom character-level, fixed alphabet of 47 chars (a-z, 0-9, space, '-.&/,). Max length = 40 chars.
Encoder: Embedding(48) -> 1-layer BiGRU(96) -> masked mean-pool -> Linear(128). Shared (Siamese) weights for query and candidate.
Training: in-batch InfoNCE contrastive loss over 260k (input, official) pairs generated from LTA DataMall bus stops and MRT/LRT station data.

Usage

Two approaches are available, both suitable for Android / mobile apps.

Option A: embeddings.bin (no ML runtime required)

Just embeddings.bin + embeddings_names.json. Precomputed vectors — no TorchScript, no model inference. Ideal when you want a simple exact/prefix lookup without running a neural net on every keystroke.

from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")

embeddings.bin layout: uint32 N, uint32 D (little-endian), then N * D float32.

Option B: TorchScript model (full embedding inference)

prefix_encoder_scripted.pt + char_vocab.json + busstops_and_stations_words_prefixes.csv. Run the encoder on-device via PyTorch Mobile / LibTorch for real-time embedding inference and typo-tolerant search.

from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"
model_path = hf_hub_download(REPO, "prefix_encoder_scripted.pt")
vocab_path = hf_hub_download(REPO, "char_vocab.json")
csv_path  = hf_hub_download(REPO, "busstops_and_stations_words_prefixes.csv")

Python example (full pipeline)

import json, struct
import torch
import numpy as np
from char_encoder import CharEncoder, CharTokenizer
from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"

# Load model and tokenizer from Hugging Face
model = CharEncoder.from_pretrained(REPO)
tokenizer = CharTokenizer.from_pretrained(REPO)

# Load precomputed embeddings + name lookup
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")

with open(bin_path, "rb") as f:
    n, d = struct.unpack("<II", f.read(8))
    embeddings = np.frombuffer(f.read(), dtype=np.float32).reshape(n, d)

with open(names_path, encoding="utf-8") as f:
    display_to_index = json.load(f)

# Build alias -> (official_name, index) lookup
idx_to_name = {}
alias_to_officials = {}
for disp, idx in display_to_index.items():
    parts = [p.strip() for p in disp.split(":") if p.strip()]
    if len(parts) == 2:
        alias, official = parts
        alias_to_officials.setdefault(alias.lower(), []).append(official)
    idx = int(idx)
    if idx not in idx_to_name and len(parts) == 2:
        idx_to_name[idx] = parts[1]

# Encode a query
def search(query, top_k=5):
    enc = tokenizer.encode_one(query)
    with torch.no_grad():
        q = model(enc["input_ids"].unsqueeze(0), enc["attention_mask"].unsqueeze(0))
    q = q.squeeze(0).numpy()
    q = q / np.linalg.norm(q)
    sims = embeddings @ q   # cosine similarity (embeddings are already normalized)
    order = np.argsort(sims)[::-1]
    seen = set()
    results = []
    for i in order:
        name = idx_to_name.get(i)
        if name and name not in seen:
            seen.add(name)
            results.append((name, float(sims[i])))
            if len(results) >= top_k:
                break
    return results

# Try it
for q in ["choachukang", "cck", "block 209", "blk 209", "bedok"]:
    print(f"\n{q}:")
    for name, score in search(q):
        print(f"  {name:<30} {score:.4f}")

Files

File	Purpose
`prefix_encoder.pt`	Trained model weights
`prefix_encoder_scripted.pt`	TorchScript export (for Android / PyTorch Mobile)
`char_vocab.json`	Tokenizer vocabulary (45-char alphabet)
`char_encoder.py`	Model + tokenizer Python definition
`config.json`	Hyperparameters
`embeddings.bin`	Precomputed float32 vectors for all officials
`embeddings_names.json`	Alias → index lookup for the above
`busstops_and_stations_words_prefixes.csv`	Training data (alias → official name pairs)

What you need on Android

App approach	Files needed
Simple lookup (no ML)	`embeddings.bin` + `embeddings_names.json`
Full inference (typo-tolerant)	`prefix_encoder_scripted.pt` + `char_vocab.json` + `busstops_and_stations_words_prefixes.csv`

Inference strategies

Both approaches use a three-tier lookup:

Strategy	Trigger	What it does
1. Exact alias	`amk`, `cck`, `blk 209`, `ns4`	Returns all mapped officials with score 1.0. No vector math.
2. Restricted embedding	prefix/substring match (`bed`, `choachu`)	Narrows candidates to officials whose aliases contain the query, then ranks by cosine similarity.
3. Full embedding search	typo / unknown (`tauseng` → `Tai Seng`)	Encodes the query through the model and searches all precomputed embeddings.

Strategy 1 + 2 work with Option A (embeddings.bin only). Strategy 3 requires Option B (TorchScript model).

Data

Trained on Singapore bus stops from LTA DataMall and MRT/LRT station data. The training CSV is generated by applying deterministic augmentation rules (prefixes, joined words, initialisms, abbreviation expansion/reduction, vowel-drop, and keyboard-adjacent typos) to ~5.6k official names, producing ~260k positive pairs.

Acknowledgments

Data sourced from the Land Transport Authority (LTA) of Singapore.

Downloads last month: 57