File size: 7,644 Bytes
345ea55 9393187 345ea55 fbc41ee 345ea55 fbc41ee 345ea55 9393187 345ea55 8359464 345ea55 7c00d97 345ea55 7c00d97 fbc41ee 7c00d97 345ea55 9393187 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | ---
license: other
license_name: futo-model-weights-license-1.0
license_link: LICENSE.md
library_name: executorch
tags:
- swipe-typing
- gesture-typing
- shape-writing
- keyboard
- on-device
- executorch
- ctc
- mobile
datasets:
- futo-org/swipe.futo.org
- futo-org/swipe-negatives
---
# FUTO Swipe
Mobile-oriented models for decoding swipe gestures into text.
<p align="center">
<img src="https://huggingface.co/futo-org/futo-swipe/resolve/main/animations/swipe_demo_computer.gif" width="640" alt="Swipe decode of the word 'computer'">
</p>
See the paper [(*coming soon*)](https://huggingface.co/futo-org/futo-swipe)
for more details.
## Models
This repository contains 3 CNN models that compose together.
Only the encoder is required. The decoder and language model are
additional refinements, leveraging specific layout and language information.
The encoder can decode for **any** keyboard layout,
while the decoder is English/QWERTY-only and the language model is English-only.
| Model | Codename | Role | Params | Size (fp32) |
|-------|----------|------|-------:|------------:|
| **Encoder** | `honorable_sturgeon` | Maps a swipe trajectory to per-timestep character emissions. Layout-agnostic — works on any keyboard supplied at runtime. | 635 K | 2.65 MB |
| **Decoder** | `magic_macaw` | (Optional) per-layout refinement over the encoder's frozen features. Lifts top-k where layout-specific training data exists (English here). | 304 K | 1.25 MB |
| **Context LM** | `hungry_jellyfish` | (Optional) next-word and beam-rerank language model that blends sentence context into candidate ranking. | 1.5M | 6.25 MB |
### Encoder — `honorable_sturgeon`
A 1D temporal convolutional network (TCN) reads the raw `(x, y)` touch trajectory
and emits a 64-coefficient spectral pattern and a scalar
"intention" gate for each timestep. Per-key character scores are read off by evaluating a
fixed cosine (DCT) basis at the layout key centers. Switching layouts on
device requires no retraining, just a different key-coordinate tensor.
### Decoder (English/QWERTY) — `magic_macaw`
A small DFSMN decoder over the frozen encoder features. It refines the
character distribution on specific layouts.
Currently we only have data for training an English/QWERTY decoder.
### Context LM (English) — `hungry_jellyfish`
A causal DFSMN language model with a hash embedding for large vocabularies.
It can supply a rerank score and perform next-word prediction. This model
can be used with or without a decoder.
## Getting started
The example below demonstrates the encoder on CPU (x86) with the ExecuTorch
runtime and greedy-decodes a swipe into characters. Note that greedy decoding
is fairly inaccurate and should generally be improved by constraining to a
lexicon (eg. trie, WFST).
```python
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from executorch.runtime import Runtime
# QWERTY letter centers in the normalized [0,1] keyboard frame. This is the
# only layout-specific input (can swap in another layout's key coordinates to
# decode a keyboard the encoder never saw at training time).
QWERTY = {
"a": (0.10, 0.500), "b": (0.60, 0.833), "c": (0.40, 0.833), "d": (0.30, 0.500),
"e": (0.25, 0.167), "f": (0.40, 0.500), "g": (0.50, 0.500), "h": (0.60, 0.500),
"i": (0.75, 0.167), "j": (0.70, 0.500), "k": (0.80, 0.500), "l": (0.90, 0.500),
"m": (0.80, 0.833), "n": (0.70, 0.833), "o": (0.85, 0.167), "p": (0.95, 0.167),
"q": (0.05, 0.167), "r": (0.35, 0.167), "s": (0.20, 0.500), "t": (0.45, 0.167),
"u": (0.65, 0.167), "v": (0.50, 0.833), "w": (0.15, 0.167), "x": (0.30, 0.833),
"y": (0.55, 0.167), "z": (0.20, 0.833),
}
LETTERS = sorted(QWERTY)
MAX_KEYS = 64 # export-time padding bound
# A real swipe for the word "computer": normalized x, y and timestamps (ms).
PX = [0.4141, 0.4478, 0.5, 0.5741, 0.6599, 0.7256, 0.7744, 0.8098, 0.8485, 0.867,
0.8737, 0.8653, 0.8418, 0.8182, 0.8098, 0.7963, 0.7946, 0.8081, 0.8418, 0.8704,
0.9057, 0.9259, 0.9545, 0.9697, 0.968, 0.9529, 0.9141, 0.8468, 0.7811, 0.7273,
0.6869, 0.6616, 0.6582, 0.6431, 0.6061, 0.5572, 0.5067, 0.4663, 0.4495, 0.4461,
0.4411, 0.4192, 0.3872, 0.362, 0.3283, 0.2795, 0.2391, 0.2323, 0.2407, 0.2593,
0.2879, 0.3249, 0.3468, 0.3569]
PY = [0.8991, 0.858, 0.7876, 0.6702, 0.5352, 0.4237, 0.3357, 0.2653, 0.1655, 0.142,
0.142, 0.2183, 0.3709, 0.588, 0.7347, 0.8462, 0.8697, 0.811, 0.6115, 0.4707,
0.3122, 0.2066, 0.1303, 0.1068, 0.1068, 0.1068, 0.1185, 0.1596, 0.1772, 0.1772,
0.1772, 0.189, 0.189, 0.189, 0.1831, 0.189, 0.189, 0.189, 0.189, 0.189,
0.1831, 0.1831, 0.1831, 0.1831, 0.1831, 0.1948, 0.189, 0.1948, 0.189, 0.189,
0.189, 0.1831, 0.1831, 0.1831]
PT = [0.0, 100, 149, 197, 246, 297, 348, 399, 449, 498, 548, 598, 648, 698, 749, 799,
849, 949, 999, 1047, 1100, 1152, 1197, 1248, 1314, 1364, 1414, 1465, 1515, 1565,
1614, 1666, 1715, 1851, 1898, 1951, 1998, 2049, 2097, 2165, 2231, 2279, 2331,
2382, 2431, 2481, 2532, 2584, 2649, 2700, 2751, 2798, 2848, 2899]
def resample(px, py, pt, T=64):
"""Resample a variable-length trajectory to T evenly-spaced points -> [2, T]."""
x, y, t = map(np.asarray, (px, py, pt))
t = t - t[0]
if t[-1] > 1e-3: # uniform 60 Hz resample, then to T points
n60 = max(2, round(t[-1] / (1000.0 / 60.0)) + 1)
tt = np.linspace(0.0, t[-1], n60)
x, y = np.interp(tt, t, x), np.interp(tt, t, y)
idx = np.linspace(0, len(x) - 1, T)
rx = np.interp(idx, np.arange(len(x)), x)
ry = np.interp(idx, np.arange(len(y)), y)
return np.stack([rx, ry], axis=0).astype(np.float32)
def greedy_ctc(log_emissions):
"""Collapse the per-timestep argmax into a string (blank is the last class)."""
blank = log_emissions.shape[-1] - 1
out, prev = [], -1
for c in log_emissions[0].argmax(axis=-1):
if c != prev and c != blank and c < len(LETTERS):
out.append(LETTERS[c])
prev = c
return "".join(out)
# Load the encoder .pte and run one forward pass.
pte = hf_hub_download("futo-org/futo-swipe", "honorable_sturgeon/model_fp32.pte")
encoder = Runtime.get().load_program(pte).load_method("forward")
features = torch.from_numpy(resample(PX, PY, PT)[None]) # [1, 2, 64]
keys = torch.zeros(1, MAX_KEYS, 2) # [1, 64, 2]
mask = torch.zeros(1, MAX_KEYS, dtype=torch.bool) # [1, 64]
for i, ch in enumerate(LETTERS):
keys[0, i] = torch.tensor(QWERTY[ch])
mask[0, i] = True
log_emissions, coefficients, lambda_ = encoder.execute((features, keys, mask))
print("greedy decode:", greedy_ctc(log_emissions.numpy())) # -> "computer"
```
Example output:
```
greedy decode: computer
```
### Encoder inputs and outputs
| | Tensor | Shape | Meaning |
|---|--------|-------|---------|
| **input** | `features` | `[1, 2, 64]` | Swipe trajectory `(x, y)` resampled to 64 points |
| **input** | `layout_keys` | `[1, 64, 2]` | Per-key `(x, y)` centers, padded to 64 keys |
| **input** | `layout_mask` | `[1, 64]` | Boolean mask of valid keys |
| **output** | `log_emissions` | `[1, 32, 65]` | Log-probabilities over 64 keys + blank |
| **output** | `coefficients` | `[1, 32, 64]` | Spectral coefficients (decoder features) |
| **output** | `lambda` | `[1, 32, 1]` | *Intention* gate (decoder features) |
The output time dimension is 32, half the 64 input points. The encoder
applies a 2× temporal downsample (a stride-2 adapter) inside the network, so
the 64 trajectory steps become 32 emission steps.
## License
Released under the [FUTO Model Weights License 1.0](LICENSE.md). |