Upload folder using huggingface_hub

24bee9e verified 5 months ago

8.98 kB

license: apache-2.0
library_name: transformers
tags:
  - multimodal
  - swipe-keyboard
  - gesture-recognition
  - text-prediction
  - character-prediction
  - embeddings
  - feature-extraction
language:
  - en
datasets:
  - futo-org/swipe.futo.org
metrics:
  - accuracy

SwipeALot Base Model

Multimodal, multi-objective transformer for swipe keyboard prediction. Trained on the futo-org/swipe.futo.org dataset.

This model is trained with the following objectives:

Masked character prediction (MLM)
Masked path prediction
Text length prediction (CLS token)
Path/text embedding (SEP token, contrastive + Matryoshka@ 64, 128, 384, 768)

This model should be further fine-tuned for a specific task, if not using the embedding mode. For example, length prediction can be significantly improved in a single task setting.

Quick Start

from datasets import load_dataset
from transformers import AutoModel, AutoProcessor
import torch

# Load model
model = AutoModel.from_pretrained("dleemiller/SwipeALot-base", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("dleemiller/SwipeALot-base", trust_remote_code=True)
model.eval()

# Load sample
dataset = load_dataset("futo-org/swipe.futo.org", split="test[:10]")
item = dataset[4]

# Preprocess swipe path using processor methods
# 1. Normalize timestamps (x,y already normalized in futo dataset)
normalized = processor.normalize_coordinates(item["data"], item["canvas_width"], item["canvas_height"])

# 2. Resample to fixed length (max_path_len=128)
#    - Pads with zeros if path < 128 points
#    - Interpolates if path > 128 points
path_coords, _ = processor.sample_path_points(normalized, processor.max_path_len)
path = torch.tensor([path_coords], dtype=torch.float32)

# Get predictions
inputs = processor(path_coords=path, text=None, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# Length prediction
predicted_length = outputs.length_logits.argmax(dim=-1).item()
print(f"Predicted word length: {predicted_length}")

Model Details

Architecture: Transformer encoder (768-dim, 12 layers, 12 heads)
Parameters: 87M
Training Data: futo-org/swipe.futo.org dataset
Max Path Length: 128 points (paths are interpolated down or padded up to this length)
Max Word Length: 48 characters (words are truncated or padded to this length)
Vocab Size: 43 (a-z, 0-9, special tokens)

Input Constraints:

Path coordinates must be normalized to [0, 1] range for x, y
Timestamps must be normalized to [0, 1] range
Paths longer than 128 points are downsampled using linear interpolation
Text longer than 48 characters is truncated with EOS preserved

Capabilities

1. Character Prediction

Predict characters from swipe paths with partial text context.

Trained via masked language modeling with a sophisticated pairwise masking strategy that creates two augmented views of each input for contrastive learning. Training uses focal loss to focus on hard-to-predict characters and frequency-based weighting to handle character imbalance (rare letters like 'z' vs common letters like 'e').

Pairwise Masking Strategy:

Inverted Mode (80%): Asymmetric augmentation pairs
- Query view: Heavy masking (50-70% of path points and characters randomly masked) with gradients
- Key view: Light masking (10-20% of path points and characters randomly masked) with stop gradient
- Teaches robust representations invariant to noise and occlusion
Modality Mode (20%): Cross-modal alignment pairs
- Query view: Text fully masked, path visible (teaches path → semantic representation) with gradients
- Key view: Path fully masked, text visible (provides alignment target) with stop gradient
- Teaches correspondence between path geometry and text meaning

2. Length Prediction

Predict word length from swipe path alone.

Trained as an auxiliary task where the CLS token aggregates path information to predict word length (0-48 characters). This helps the model learn geometric properties of swipe gestures that correlate with word length, such as path extent and complexity.

Length supervision occurs only during modality mode when text attention is fully zeroed (10% of training batches: 20% modality mode × 50% zero-attention probability). This trains the model to predict length from path geometry alone without any text length cues. Uses 10% of the total loss weight to encourage learning without dominating the primary objectives.

3. Path Reconstruction

Reconstruct missing path coordinates.

Trained via masked path prediction as part of the pairwise masking strategy. During inverted mode (80% of batches), path points are randomly masked at 50-70% for heavy augmentation and 10-20% for light augmentation. During modality mode (20% of batches), either all path points are masked (key view) or none are masked (query view). The model learns to reconstruct spatial-temporal structure from partial path information and text context, teaching it the geometric and temporal patterns of swipe gestures. Uses 50% of the character prediction loss weight, making it a significant secondary objective.

4. Embedding Extraction

Extract fixed-size embeddings for similarity search.

Dimension: 768

Trained via contrastive learning where the SEP token produces fixed-size embeddings for path-text pairs. The pairwise masking strategy is central to embedding training:

Inverted mode (80%): Pulls embeddings of heavily-masked and lightly-masked versions of the same input close together, teaching invariance to noise and occlusion
Modality mode (20%): Pulls embeddings of path-only and text-only views of the same word close together, teaching cross-modal alignment between gesture geometry and semantic meaning

The contrastive loss (15% weight, temperature 0.07) pulls matching pairs together in embedding space while pushing non-matches apart. Uses Matryoshka embeddings to create nested representations at multiple dimensions (64, 128, 384, 768), with stronger weight on lower-dimensional representations (2.0×, 1.5×, 1.0×, 1.0×) to ensure the first 64 dimensions are highly informative on their own.

Usage Examples

Masked Character Prediction

# Process full word, then manually mask positions
inputs = processor(path_coords=path, text="hello", return_tensors="pt")
mask_token_id = processor.tokenizer.mask_token_id
char_ids = inputs["input_ids"][0].tolist()
char_ids[2] = mask_token_id  # Mask 'l' at position 2
inputs["input_ids"] = torch.tensor([char_ids], dtype=torch.long)

# Model predicts masked character from path + context

Full Word Reconstruction

# Process word, then mask all character positions
inputs = processor(path_coords=path, text="hello", return_tensors="pt")
char_ids = inputs["input_ids"][0].tolist()
mask_token_id = processor.tokenizer.mask_token_id
masked_ids = [mask_token_id if cid != 0 else 0 for cid in char_ids]
inputs["input_ids"] = torch.tensor([masked_ids], dtype=torch.long)

# Predict from path only

Length Prediction

inputs = processor(path_coords=path, text=None, return_tensors="pt")
predicted_length = outputs.length_logits.argmax(dim=-1).item()

Performance Metrics

Evaluated on 49,970 test samples:

Task	Metric	Score
Masked Prediction (30%)	Character Accuracy	96.1%
	Top-3 Accuracy	97.6%
	Word Accuracy	94.3%
Full Reconstruction (100%)	Character Accuracy	93.1%
	Word Accuracy	76.7%
Length Prediction	Exact Accuracy	93.2%
	Within ±1	98.9%
	Within ±2	99.8%
Path Reconstruction	MSE (masked)	0.000697

Model Outputs

outputs = model(**inputs)

# Available outputs:
outputs.char_logits       # [batch, seq_len, vocab_size] - Character predictions
outputs.length_logits     # [batch, max_length] - Length predictions
outputs.path_logits       # [batch, seq_len, 3] - Path coordinate predictions
outputs.pooler_output     # [batch, d_model] - SEP token embeddings for similarity
outputs.last_hidden_state # [batch, seq_len, d_model] - Hidden representations

Citation

@software{swipealot2025,
  title={SwipeALot: Multimodal Swipe Keyboard Transformer},
  author={Lee Miller},
  year={2025},
  url={https://huggingface.co/dleemiller/SwipeALot-base}
}

License

Apache 2.0