SG Transit Prefix Encoder

A lightweight (~110k params) character-level Siamese encoder that maps messy keyboard input to official Singapore bus-stop and MRT station names.

Built for on-device autocomplete in a mobile transit app.

What it handles

User types Matches
bed Bedok
choachukang (no spaces) Choa Chu Kang
cck, amk, ns4 the right station
block 209 Blk 209
rd, stn, int Road, Station, Interchange
bedik (typo) Bedok

Architecture

  • Tokenizer: custom character-level, fixed alphabet of 47 chars (a-z, 0-9, space, '-.&/,). Max length = 40 chars.
  • Encoder: Embedding(48) -> 1-layer BiGRU(96) -> masked mean-pool -> Linear(128). Shared (Siamese) weights for query and candidate.
  • Training: in-batch InfoNCE contrastive loss over 260k (input, official) pairs generated from LTA DataMall bus stops and MRT/LRT station data.

Usage

Two approaches are available, both suitable for Android / mobile apps.

Option A: embeddings.bin (no ML runtime required)

Just embeddings.bin + embeddings_names.json. Precomputed vectors โ€” no TorchScript, no model inference. Ideal when you want a simple exact/prefix lookup without running a neural net on every keystroke.

from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")

embeddings.bin layout: uint32 N, uint32 D (little-endian), then N * D float32.

Option B: TorchScript model (full embedding inference)

prefix_encoder_scripted.pt + char_vocab.json + busstops_and_stations_words_prefixes.csv. Run the encoder on-device via PyTorch Mobile / LibTorch for real-time embedding inference and typo-tolerant search.

from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"
model_path = hf_hub_download(REPO, "prefix_encoder_scripted.pt")
vocab_path = hf_hub_download(REPO, "char_vocab.json")
csv_path  = hf_hub_download(REPO, "busstops_and_stations_words_prefixes.csv")

Python example (full pipeline)

import json, struct
import torch
import numpy as np
from char_encoder import CharEncoder, CharTokenizer
from huggingface_hub import hf_hub_download

REPO = "Luke-Yong/sg-transit-prefix-encoder"

# Load model and tokenizer from Hugging Face
model = CharEncoder.from_pretrained(REPO)
tokenizer = CharTokenizer.from_pretrained(REPO)

# Load precomputed embeddings + name lookup
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")

with open(bin_path, "rb") as f:
    n, d = struct.unpack("<II", f.read(8))
    embeddings = np.frombuffer(f.read(), dtype=np.float32).reshape(n, d)

with open(names_path, encoding="utf-8") as f:
    display_to_index = json.load(f)

# Build alias -> (official_name, index) lookup
idx_to_name = {}
alias_to_officials = {}
for disp, idx in display_to_index.items():
    parts = [p.strip() for p in disp.split(":") if p.strip()]
    if len(parts) == 2:
        alias, official = parts
        alias_to_officials.setdefault(alias.lower(), []).append(official)
    idx = int(idx)
    if idx not in idx_to_name and len(parts) == 2:
        idx_to_name[idx] = parts[1]

# Encode a query
def search(query, top_k=5):
    enc = tokenizer.encode_one(query)
    with torch.no_grad():
        q = model(enc["input_ids"].unsqueeze(0), enc["attention_mask"].unsqueeze(0))
    q = q.squeeze(0).numpy()
    q = q / np.linalg.norm(q)
    sims = embeddings @ q   # cosine similarity (embeddings are already normalized)
    order = np.argsort(sims)[::-1]
    seen = set()
    results = []
    for i in order:
        name = idx_to_name.get(i)
        if name and name not in seen:
            seen.add(name)
            results.append((name, float(sims[i])))
            if len(results) >= top_k:
                break
    return results

# Try it
for q in ["choachukang", "cck", "block 209", "blk 209", "bedok"]:
    print(f"\n{q}:")
    for name, score in search(q):
        print(f"  {name:<30} {score:.4f}")

Files

File Purpose
prefix_encoder.pt Trained model weights
prefix_encoder_scripted.pt TorchScript export (for Android / PyTorch Mobile)
char_vocab.json Tokenizer vocabulary (45-char alphabet)
char_encoder.py Model + tokenizer Python definition
config.json Hyperparameters
embeddings.bin Precomputed float32 vectors for all officials
embeddings_names.json Alias โ†’ index lookup for the above
busstops_and_stations_words_prefixes.csv Training data (alias โ†’ official name pairs)

What you need on Android

App approach Files needed
Simple lookup (no ML) embeddings.bin + embeddings_names.json
Full inference (typo-tolerant) prefix_encoder_scripted.pt + char_vocab.json + busstops_and_stations_words_prefixes.csv

Inference strategies

Both approaches use a three-tier lookup:

Strategy Trigger What it does
1. Exact alias amk, cck, blk 209, ns4 Returns all mapped officials with score 1.0. No vector math.
2. Restricted embedding prefix/substring match (bed, choachu) Narrows candidates to officials whose aliases contain the query, then ranks by cosine similarity.
3. Full embedding search typo / unknown (tauseng โ†’ Tai Seng) Encodes the query through the model and searches all precomputed embeddings.

Strategy 1 + 2 work with Option A (embeddings.bin only). Strategy 3 requires Option B (TorchScript model).

Data

Trained on Singapore bus stops from LTA DataMall and MRT/LRT station data. The training CSV is generated by applying deterministic augmentation rules (prefixes, joined words, initialisms, abbreviation expansion/reduction, vowel-drop, and keyboard-adjacent typos) to ~5.6k official names, producing ~260k positive pairs.

Acknowledgments

Data sourced from the Land Transport Authority (LTA) of Singapore.

Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support