SG Transit Prefix Encoder
A lightweight (~110k params) character-level Siamese encoder that maps messy keyboard input to official Singapore bus-stop and MRT station names.
Built for on-device autocomplete in a mobile transit app.
What it handles
| User types | Matches |
|---|---|
bed |
Bedok |
choachukang (no spaces) |
Choa Chu Kang |
cck, amk, ns4 |
the right station |
block 209 |
Blk 209 |
rd, stn, int |
Road, Station, Interchange |
bedik (typo) |
Bedok |
Architecture
- Tokenizer: custom character-level, fixed alphabet of 47 chars (
a-z,0-9, space,'-.&/,). Max length = 40 chars. - Encoder:
Embedding(48) -> 1-layer BiGRU(96) -> masked mean-pool -> Linear(128). Shared (Siamese) weights for query and candidate. - Training: in-batch InfoNCE contrastive loss over 260k
(input, official)pairs generated from LTA DataMall bus stops and MRT/LRT station data.
Usage
Two approaches are available, both suitable for Android / mobile apps.
Option A: embeddings.bin (no ML runtime required)
Just embeddings.bin + embeddings_names.json. Precomputed vectors โ no
TorchScript, no model inference. Ideal when you want a simple exact/prefix
lookup without running a neural net on every keystroke.
from huggingface_hub import hf_hub_download
REPO = "Luke-Yong/sg-transit-prefix-encoder"
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")
embeddings.bin layout: uint32 N, uint32 D (little-endian), then N * D float32.
Option B: TorchScript model (full embedding inference)
prefix_encoder_scripted.pt + char_vocab.json + busstops_and_stations_words_prefixes.csv.
Run the encoder on-device via PyTorch Mobile / LibTorch for real-time embedding
inference and typo-tolerant search.
from huggingface_hub import hf_hub_download
REPO = "Luke-Yong/sg-transit-prefix-encoder"
model_path = hf_hub_download(REPO, "prefix_encoder_scripted.pt")
vocab_path = hf_hub_download(REPO, "char_vocab.json")
csv_path = hf_hub_download(REPO, "busstops_and_stations_words_prefixes.csv")
Python example (full pipeline)
import json, struct
import torch
import numpy as np
from char_encoder import CharEncoder, CharTokenizer
from huggingface_hub import hf_hub_download
REPO = "Luke-Yong/sg-transit-prefix-encoder"
# Load model and tokenizer from Hugging Face
model = CharEncoder.from_pretrained(REPO)
tokenizer = CharTokenizer.from_pretrained(REPO)
# Load precomputed embeddings + name lookup
bin_path = hf_hub_download(REPO, "embeddings.bin")
names_path = hf_hub_download(REPO, "embeddings_names.json")
with open(bin_path, "rb") as f:
n, d = struct.unpack("<II", f.read(8))
embeddings = np.frombuffer(f.read(), dtype=np.float32).reshape(n, d)
with open(names_path, encoding="utf-8") as f:
display_to_index = json.load(f)
# Build alias -> (official_name, index) lookup
idx_to_name = {}
alias_to_officials = {}
for disp, idx in display_to_index.items():
parts = [p.strip() for p in disp.split(":") if p.strip()]
if len(parts) == 2:
alias, official = parts
alias_to_officials.setdefault(alias.lower(), []).append(official)
idx = int(idx)
if idx not in idx_to_name and len(parts) == 2:
idx_to_name[idx] = parts[1]
# Encode a query
def search(query, top_k=5):
enc = tokenizer.encode_one(query)
with torch.no_grad():
q = model(enc["input_ids"].unsqueeze(0), enc["attention_mask"].unsqueeze(0))
q = q.squeeze(0).numpy()
q = q / np.linalg.norm(q)
sims = embeddings @ q # cosine similarity (embeddings are already normalized)
order = np.argsort(sims)[::-1]
seen = set()
results = []
for i in order:
name = idx_to_name.get(i)
if name and name not in seen:
seen.add(name)
results.append((name, float(sims[i])))
if len(results) >= top_k:
break
return results
# Try it
for q in ["choachukang", "cck", "block 209", "blk 209", "bedok"]:
print(f"\n{q}:")
for name, score in search(q):
print(f" {name:<30} {score:.4f}")
Files
| File | Purpose |
|---|---|
prefix_encoder.pt |
Trained model weights |
prefix_encoder_scripted.pt |
TorchScript export (for Android / PyTorch Mobile) |
char_vocab.json |
Tokenizer vocabulary (45-char alphabet) |
char_encoder.py |
Model + tokenizer Python definition |
config.json |
Hyperparameters |
embeddings.bin |
Precomputed float32 vectors for all officials |
embeddings_names.json |
Alias โ index lookup for the above |
busstops_and_stations_words_prefixes.csv |
Training data (alias โ official name pairs) |
What you need on Android
| App approach | Files needed |
|---|---|
| Simple lookup (no ML) | embeddings.bin + embeddings_names.json |
| Full inference (typo-tolerant) | prefix_encoder_scripted.pt + char_vocab.json + busstops_and_stations_words_prefixes.csv |
Inference strategies
Both approaches use a three-tier lookup:
| Strategy | Trigger | What it does |
|---|---|---|
| 1. Exact alias | amk, cck, blk 209, ns4 |
Returns all mapped officials with score 1.0. No vector math. |
| 2. Restricted embedding | prefix/substring match (bed, choachu) |
Narrows candidates to officials whose aliases contain the query, then ranks by cosine similarity. |
| 3. Full embedding search | typo / unknown (tauseng โ Tai Seng) |
Encodes the query through the model and searches all precomputed embeddings. |
Strategy 1 + 2 work with Option A (embeddings.bin only). Strategy 3 requires Option B (TorchScript model).
Data
Trained on Singapore bus stops from LTA DataMall and MRT/LRT station data. The training CSV is generated by applying deterministic augmentation rules (prefixes, joined words, initialisms, abbreviation expansion/reduction, vowel-drop, and keyboard-adjacent typos) to ~5.6k official names, producing ~260k positive pairs.
Acknowledgments
Data sourced from the Land Transport Authority (LTA) of Singapore.
- Downloads last month
- 57