dleemiller
/

SwipeALot-base

@@ -98,28 +98,41 @@ print(f"Predicted word length: {predicted_length}")
 ### 1. Character Prediction
 Predict characters from swipe paths with partial text context.
-**Use Case**: Autocorrection, suggestion ranking
 ### 2. Length Prediction
 Predict word length from swipe path alone.
-**Accuracy**: 89% exact, 96% within ±1
-**Use Case**: Pre-filtering candidate words
 ### 3. Path Reconstruction
 Reconstruct missing path coordinates.
-**MSE**: 0.005 on masked points
-**Use Case**: Noise reduction, gesture smoothing
 ### 4. Embedding Extraction
 Extract fixed-size embeddings for similarity search.
 **Dimension**: 768
-**Use Case**: Similar gesture search, deduplication
 ## Usage Examples
@@ -158,19 +171,19 @@ predicted_length = outputs.length_logits.argmax(dim=-1).item()
 ## Performance Metrics
-Evaluated on 200 test samples:
 | Task | Metric | Score |
 |------|--------|-------|
-| Masked Prediction (30%) | Character Accuracy | 98.7% |
-| | Top-3 Accuracy | 100% |
-| | Word Accuracy | 97.3% |
-| Full Reconstruction (100%) | Character Accuracy | 94% |
-| | Word Accuracy | 83.7% |
-| Length Prediction | Exact Accuracy | 89% |
-| | Within ±1 | 96% |
-| | Within ±2 | 99% |
-| Path Reconstruction | MSE (masked) | 0.005 |
 ## Model Outputs

 ### 1. Character Prediction
 Predict characters from swipe paths with partial text context.
+Trained via masked language modeling with a sophisticated pairwise masking strategy that creates two augmented views of each input for contrastive learning. Training uses focal loss to focus on hard-to-predict characters and frequency-based weighting to handle character imbalance (rare letters like 'z' vs common letters like 'e').
+**Pairwise Masking Strategy:**
+- **Inverted Mode (80%)**: Asymmetric augmentation pairs
+  - Query view: Heavy masking (50-70% of path points and characters randomly masked) with gradients
+  - Key view: Light masking (10-20% of path points and characters randomly masked) with stop gradient
+  - Teaches robust representations invariant to noise and occlusion
+- **Modality Mode (20%)**: Cross-modal alignment pairs
+  - Query view: Text fully masked, path visible (teaches path → semantic representation) with gradients
+  - Key view: Path fully masked, text visible (provides alignment target) with stop gradient
+  - Teaches correspondence between path geometry and text meaning
 ### 2. Length Prediction
 Predict word length from swipe path alone.
+Trained as an auxiliary task where the CLS token aggregates path information to predict word length (0-48 characters). This helps the model learn geometric properties of swipe gestures that correlate with word length, such as path extent and complexity.
+Length supervision occurs only during modality mode when text attention is fully zeroed (10% of training batches: 20% modality mode × 50% zero-attention probability). This trains the model to predict length from path geometry alone without any text length cues. Uses 10% of the total loss weight to encourage learning without dominating the primary objectives.
 ### 3. Path Reconstruction
 Reconstruct missing path coordinates.
+Trained via masked path prediction as part of the pairwise masking strategy. During inverted mode (80% of batches), path points are randomly masked at 50-70% for heavy augmentation and 10-20% for light augmentation. During modality mode (20% of batches), either all path points are masked (key view) or none are masked (query view). The model learns to reconstruct spatial-temporal structure from partial path information and text context, teaching it the geometric and temporal patterns of swipe gestures. Uses 50% of the character prediction loss weight, making it a significant secondary objective.
 ### 4. Embedding Extraction
 Extract fixed-size embeddings for similarity search.
 **Dimension**: 768
+Trained via contrastive learning where the SEP token produces fixed-size embeddings for path-text pairs. The pairwise masking strategy is central to embedding training:
+- **Inverted mode (80%)**: Pulls embeddings of heavily-masked and lightly-masked versions of the same input close together, teaching invariance to noise and occlusion
+- **Modality mode (20%)**: Pulls embeddings of path-only and text-only views of the same word close together, teaching cross-modal alignment between gesture geometry and semantic meaning
+The contrastive loss (15% weight, temperature 0.07) pulls matching pairs together in embedding space while pushing non-matches apart. Uses Matryoshka embeddings to create nested representations at multiple dimensions (64, 128, 384, 768), with stronger weight on lower-dimensional representations (2.0×, 1.5×, 1.0×, 1.0×) to ensure the first 64 dimensions are highly informative on their own.
 ## Usage Examples
 ## Performance Metrics
+Evaluated on 49,970 test samples:
 | Task | Metric | Score |
 |------|--------|-------|
+| Masked Prediction (30%) | Character Accuracy | 96.1% |
+| | Top-3 Accuracy | 97.6% |
+| | Word Accuracy | 94.3% |
+| Full Reconstruction (100%) | Character Accuracy | 93.1% |
+| | Word Accuracy | 76.7% |
+| Length Prediction | Exact Accuracy | 93.2% |
+| | Within ±1 | 98.9% |
+| | Within ±2 | 99.8% |
+| Path Reconstruction | MSE (masked) | 0.000697 |
 ## Model Outputs