# From "Play My Workout Playlist" to a Real Android Tap Plan

**How a 3B-parameter model turns messy phone requests into replayable UI automation — without shipping your life to a cloud API.**

*Built for the [Build Small Hackathon](https://huggingface.co/build-small-hackathon) — Backyard AI track, sponsored by Modal.*

**Published on Hugging Face:** [From "Play My Workout Playlist" to a Real Android Tap Plan](https://huggingface.co/blog/build-small-hackathon/android-skill-router)

---

## Table of contents

1. [The problem](#the-problem-with-phone-automation-today)
2. [The architecture](#the-architecture-classify--route--replay)
3. [Recording trajectories](#step-1-record-real-ui-flows-on-android)
4. [Training the classifier](#step-2-train-a-tiny-classifier-not-a-general-agent)
5. [Synthetic data at scale](#step-3-synthetic-data-at-scale)
6. [Deployment and demo](#step-4-deploy-inference-on-modal-demo-on-gradio)
7. [Evaluation and benchmarks](#evaluation-how-we-measure-generalization)
8. [Why this approach works](#why-this-approach-works)
9. [Parameterized replay](#parameterized-replay-classify--bind--replay)
10. [Try it yourself](#try-it-yourself)

---

## The problem with phone automation today

You say: *"text mom on whatsapp i'm on my way."*

A voice assistant might reply with a web search, a generic "I can't do that," or a cloud API call that only works if WhatsApp cooperates. What you actually want is simpler and more direct: open WhatsApp, find Mom, type the message, send it.

That gap — between **natural language** and **deterministic UI actions on a real device** — is what **Android Skill Router** is built to close.

### Why cloud agents fall short for personal automation

Most phone automation today follows one of two paths:

| Approach | Strength | Weakness |
| --- | --- | --- |
| **Cloud voice assistants** | Understand broad language | Can't tap your apps; privacy concerns; needs network |
| **Macro/script tools** | Deterministic replay | Require exact trigger phrases; no natural language |
| **Vision-based agents** | Flexible | Slow, expensive, hallucinate UI coordinates |

Android Skill Router takes a third path: **a small local classifier that understands messy language, paired with pre-recorded UI trajectories that an accessibility runtime replays exactly.**

The core insight:

> You don't need a 70B frontier model to *do* the tapping. You need a 3B model to understand *what you mean*, then hand off to a fixed replay plan.

```
"play my workout playlist"
    → spotify_play_playlist
    → trajectories/spotify_play_playlist.json
    → Pocket Automator replays taps on device
```

This is the classifier layer of the **[Pocket Automator](https://github.com/kriyanshii/pocket-automator)** stack: record once on your phone, route forever with a tiny local model.

---

## The architecture: classify → route → replay

The system has three layers, each deliberately small and composable.

```mermaid
flowchart LR
    A[Natural language prompt] --> B[Fine-tuned Qwen2.5-3B]
    B --> C["Structured intent\n{skill, parameters}"]
    C --> D[Skill Router]
    D --> E[Trajectory JSON]
    E --> F[Pocket Automator replay]
```

### Layer 1: Intent classifier

A fine-tuned **Qwen2.5-3B-Instruct** model receives a user prompt and returns structured JSON:

```json
{
  "skill": "whatsapp_send_message",
  "parameters": {
    "contact": "mom",
    "message": "i'm on my way"
  }
}
```

The model handles slang, typos, incomplete phrasing, and app disambiguation (WhatsApp vs Gmail vs Slack). It never invents UI steps — only picks from 15 known skills and extracts parameter slots.

### Layer 2: Skill router

A deterministic lookup table maps skill names to trajectory files:

```python
SKILL_TO_TRAJECTORY = {
    "whatsapp_send_message": "trajectories/whatsapp_send_message.json",
    "spotify_play_playlist": "trajectories/spotify_play_playlist.json",
    # ... 15 skills total
}
```

If the model returns `whatsapp_send_message`, the router loads `trajectories/whatsapp_send_message.json`. No guessing, no hallucination. If the skill doesn't exist or the file is missing, the system fails loudly with a clear error.

The router also includes **defensive parsing**: skill aliases (`send_whatsapp` → `whatsapp_send_message`), JSON extraction from noisy model output, and keyword fallbacks when the model returns an unknown label.

### Layer 3: Trajectory replay

Each trajectory is a JSON file exported from **[Pocket Automator](https://github.com/kriyanshii/pocket-automator)** — an Android accessibility recorder. It contains:

- A **task description** (the original human intent)
- The **target app package** (`com.whatsapp`, `com.spotify.music`, etc.)
- A sequence of **steps**, each with a full UI tree snapshot and an action

Example step from a WhatsApp trajectory:

```json
{
  "timestamp": 4024,
  "screen": { /* full accessibility tree */ },
  "action": {
    "type": "click",
    "resourceId": "com.motorola.launcher3:id/icon",
    "contentDescription": "WhatsApp",
    "path": [0, 0, 0, 0, 2, 0, 0]
  },
  "packageName": "com.motorola.launcher3"
}
```

Action types include `click`, `set_text`, and scroll gestures. Pocket Automator resolves nodes at replay time using resource IDs, content descriptions, and tree paths — so minor UI changes don't break the flow.

### The separation of concerns

| Component | Responsibility | Can fail? |
| --- | --- | --- |
| Language model | Understand intent | Gracefully — fallbacks exist |
| Skill router | Map intent → file | Never — deterministic lookup |
| Trajectory | Store ground-truth UI steps | Never — fixed recording |
| Pocket Automator | Execute on device | Only if UI changed drastically |

This is the design bet: **language understanding is fuzzy; automation must be exact.**

---

## Step 1: Record real UI flows on Android

Every skill starts on hardware you own. No synthetic UI trees, no emulated taps — real recordings from a real Motorola device.

### Pocket Automator: the Android recorder

**[Pocket Automator](https://github.com/kriyanshii/pocket-automator)** is an Android accessibility app that:

1. **Records** taps, text input, and scrolls while you use any app
2. **Captures** the full accessibility tree at each step (node IDs, bounds, class names, text)
3. **Exports** recordings as JSON for training pipelines
4. **Replays** saved recordings with smart node resolution

Requirements: Android 10+ (API 29), accessibility service enabled, overlay permission.

### The recording workflow

1. Open Pocket Automator and tap **Record**
2. Name your task (e.g., "message hi to biraj on WhatsApp")
3. Perform the task naturally on your phone
4. Stop recording from the floating overlay
5. Export the JSON to your development machine
6. Place it in `trajectories/` and run `scripts/generate_skill_dataset.py`

The script reads each trajectory's `task` and `app` fields, derives a snake_case skill name, and writes `data/skills.jsonl`:

```json
{"skill": "whatsapp_send_message", "task": "message hi to biraj on WhatsApp"}
{"skill": "spotify_play_playlist", "task": "play liked songs playlist from Spotify"}
{"skill": "create_alarm", "task": "create alarm for 7 am tomorrow"}
```

Skill name derivation uses app package and task keywords — WhatsApp tasks become `whatsapp_send_message`, Spotify pause tasks become `spotify_pause`, and so on.

### The 15 skills

| Skill | App | Example task |
| --- | --- | --- |
| `create_alarm` | Clock | Set alarm for 7 am tomorrow |
| `calendar_create_event` | Calendar | Create event tomorrow 4 pm |
| `wifi_enable` | Settings | Enable Wi-Fi |
| `bluetooth_enable` | Settings | Turn on Bluetooth |
| `whatsapp_send_message` | WhatsApp | Message a contact |
| `gmail_send_email` | Gmail | Send email to recipient |
| `slack_open_channel` | Slack | Open a channel |
| `spotify_play_playlist` | Spotify | Play a playlist |
| `spotify_search_play` | Spotify | Search and play music |
| `spotify_pause` | Spotify | Pause playback |
| `uber_request_ride` | Uber | Request ride to destination |
| `youtube_search` | YouTube | Search for videos |
| `linkedin_search_person` | LinkedIn | Search for a person |
| `contacts_search` | Contacts | Find a contact |
| `camera_take_photo` | Camera | Take a picture |

Each trajectory file is large (often 5,000+ lines) because it includes the full accessibility tree at every step. That's intentional — replay engines need rich node metadata to resolve targets reliably.

### Why real recordings matter

Synthetic UI automation data is brittle. Real recordings capture:

- **Launcher states** — how your home screen looks with your app icons
- **Keyboard transitions** — when the soft keyboard appears during text input
- **Scroll positions** — where list items sit after scrolling
- **Timing** — natural pauses between actions

These details can't be generated. They're the ground truth that makes replay work on your specific device.

---

## Step 2: Train a tiny classifier, not a general agent

The model is **[Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** — deliberately under 4B parameters for the Build Small Hackathon's *Tiny Titan* achievement.

### Why 3B is enough

The classification task is narrow:

- **15 skill labels** (not open-ended tool use)
- **Structured JSON output** (not free-form text)
- **Parameter slot-filling** (contact, message, time — not reasoning chains)

A 3B instruct model already understands apps, contacts, times, and natural language phrasing. Fine-tuning teaches it *your* skill taxonomy and output format — not general Android knowledge.

### Training configuration

Training runs on **Modal** GPUs via `modal_apps/train_modal.py`:

| Hyperparameter | Value |
| --- | --- |
| Base model | Qwen2.5-3B-Instruct |
| Method | 4-bit QLoRA + SFT (Unsloth) |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Target modules | q/k/v/o_proj, gate/up/down_proj |
| Epochs | 5 |
| Batch size | 8 |
| Learning rate | 2e-4 |
| Optimizer | AdamW 8-bit |
| Max sequence length | 2048 |
| GPU | Modal A10G |

The training pipeline:

1. Upload `data/train_intent.jsonl` to a Modal Volume
2. Load base model in 4-bit quantization
3. Apply QLoRA adapters to attention and MLP layers
4. Format examples with Qwen 2.5 chat template
5. Train with TRL's `SFTTrainer`
6. Save LoRA adapter to `/model/adapter`
7. Save merged 16-bit model to `/model/merged`

```bash
python scripts/generate_intent_dataset.py
modal run modal_apps/train_modal.py --dataset train_intent.jsonl
modal volume get android-dataset-model adapter ./trained_model/adapter
```

### V1 → V2: from labels to intents

**V1 (skill classification only)** mapped prompts to a skill name:

```
"play my workout playlist" → {"skill": "spotify_play_playlist"}
```

Training data: ~510 examples in `data/train.jsonl` (~30 variations per skill).

**V2 (structured intent extraction)** adds parameter slot-filling:

```
"text mom on whatsapp i'm on my way"
→ {"skill": "whatsapp_send_message", "parameters": {"contact": "mom", "message": "i'm on my way"}}
```

Training data: ~15,000 examples in `data/train_intent.jsonl` (~1,000 per skill).

### Parameter schemas

Each skill declares its parameters in `data/skill_schemas.json`:

```json
{
  "whatsapp_send_message": {
    "description": "Send a WhatsApp message to a contact",
    "parameters": {
      "contact": {"type": "string", "required": true},
      "message": {"type": "string", "required": true}
    }
  },
  "create_alarm": {
    "description": "Set an alarm at a specific time",
    "parameters": {
      "time": {"type": "string", "required": true},
      "day": {"type": "string", "required": false}
    }
  },
  "wifi_enable": {
    "description": "Enable Wi-Fi on the device",
    "parameters": {}
  }
}
```

Skills with no variable inputs (`wifi_enable`, `bluetooth_enable`, `spotify_pause`, `camera_take_photo`) return empty parameter objects.

### The system prompt

The model receives a tight, deterministic instruction:

```
You extract structured Android automation intents from natural language.
Reply with JSON only: {"skill": "<skill_name>", "parameters": {<extracted_fields>}}.
Pick exactly one skill. Extract all relevant parameters mentioned in the request
(contact names, messages, times, destinations, channel names, search queries, etc.).
Use an empty object for parameters when the skill needs none.
Use the app or action named in the request (contacts, Gmail, Slack, YouTube, etc.)
to pick the correct skill.
```

No chain-of-thought. No tool descriptions. No examples in the prompt. Just JSON.

### Training example format

Each row in `train_intent.jsonl` is a three-turn chat:

```json
{
  "messages": [
    {"role": "system", "content": "You extract structured Android automation intents..."},
    {"role": "user", "content": "whatsapp message Vikram see you tonight"},
    {"role": "assistant", "content": "{\"skill\":\"whatsapp_send_message\",\"parameters\":{\"contact\":\"Vikram\",\"message\":\"see you tonight\"}}"}
  ]
}
```

The assistant always responds with compact JSON — no markdown fences, no explanation.

---

## Step 3: Synthetic data at scale

Fifteen real trajectories can't train a robust classifier alone. The project generates **~15,000 synthetic SFT examples** locally via `scripts/generate_intent_dataset.py`.

### How data generation works

The generator follows a four-step pipeline:

```
skill_schemas.json + skills.jsonl
        ↓
   Entity pools (contacts, messages, times, destinations...)
        ↓
   Template variations (24+ templates per skill)
        ↓
   train_intent.jsonl (~1000 examples/skill)
   eval_intent_prompts.json (~6 held-out prompts/skill)
```

### Entity pools

Realistic but synthetic entities ensure diversity without privacy concerns:

| Pool | Examples |
| --- | --- |
| **Contacts** | Ri, Biraj, Mom, Parag Shah, grandma, my roommate |
| **Messages** | "see you soon", "running late", "project update attached" |
| **Alarm times** | 5 am, 6:30 am, 7 am, noon, 10 pm |
| **Alarm days** | today, tomorrow, monday, next friday |
| **Destinations** | airport, train station, home, office |
| **Playlists** | workout, liked songs, chill vibes, focus |
| **Channels** | engineering, general, data contributors |
| **Search queries** | pasta recipes, jazz music, ghibli food |

### Template variations

Each skill has 15–30 prompt templates with placeholder slots:

**WhatsApp templates:**
```
"message {message} to {contact} on whatsapp"
"text {contact} {message} on whatsapp"
"whatsapp {contact} saying {message}"
"ping {contact} on whatsapp with {message}"
```

**Alarm templates:**
```
"create alarm for {time} {day}"
"wake me up at {time} {day}"
"set a {time} alarm for {day}"
"{time} alarm {day} please"
```

**Uber templates:**
```
"get an uber to {destination}"
"uber me to {destination}"
"book a cab to {destination} via uber"
```

Templates are crossed with random entity samples to produce unique training pairs. The same intent can appear as:

- "set an alarm for 7 am tomorrow"
- "wake me up at seven tomorrow morning"
- "7am alarm pls"
- "please alarm 7 am tomorrow thanks"

### V1 training data (skill-only)

The earlier `scripts/generate_training_data.py` produces ~510 examples for V1 classification:

- 30 variations per skill from `skills.jsonl` task descriptions
- Guaranteed inclusion of Gradio demo prompts
- Regex-based parsing of task strings to derive alarm times, contacts, etc.

### Held-out evaluation sets

Two evaluation sets prevent overfitting to templates:

| File | Size | Purpose |
| --- | --- | --- |
| `data/eval_intent_prompts.json` | ~90 prompts | Structured eval during training |
| `data/pocket_benchmark_prompts.json` | 200 prompts | Real-world messy language benchmark |

The Pocket Automator benchmark is intentionally unlike training data — slang, typos, incomplete phrasing, conversational filler:

```
"yo set an alrm for like 5:45 tmrw morning pls"
"need to b up at 6ish on monday ngl"
"hit up zoe on whatsapp say im omw"
"wa msg marcus 'running 20 min late'"
"lowkey need 11:11 pm alarm tonight"
"deadass need alarm sunday noon"
```

Each benchmark case is tagged with `domain` (alarms, whatsapp, spotify...) and `styles` (slang, typo, incomplete, conversational). Prompts are filtered against training data to ensure zero overlap.

---

## Step 4: Deploy inference on Modal, demo on Gradio

### Modal inference API

Training and inference both run on **Modal** — serverless GPU infrastructure with persistent volumes.

`modal_apps/predict_api.py` deploys a FastAPI endpoint:

```bash
modal deploy modal_apps/predict_api.py
# → https://<workspace>--android-skill-predict-api-skillpredictor-web.modal.run
```

Architecture:

- **Container class** `SkillPredictor` loads the QLoRA model once via `@modal.enter()`
- **4-bit quantized** base model + LoRA adapter from Modal Volume
- **Greedy decoding** (`do_sample=False`) for deterministic JSON output
- **128 max new tokens** — enough for any intent JSON
- **5-minute scale-down window** — containers stay warm between requests

Request/response:

```bash
curl -X POST https://.../predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "text mom on whatsapp i am on my way"}'
```

```json
{
  "skill": "whatsapp_send_message",
  "parameters": {
    "contact": "mom",
    "message": "i am on my way"
  }
}
```

The API applies the same post-processing as local evaluation: JSON extraction, skill normalization, alias resolution, and keyword fallbacks.

### Gradio demo

The **Gradio demo** (`app.py`) is the hackathon submission UI, hosted on Hugging Face Spaces.

Flow:

1. User types a natural language prompt (or picks an example)
2. App POSTs to Modal `/predict` endpoint
3. Response is parsed: skill label, parameter tiles, confidence display
4. Skill router loads the matching trajectory from `trajectories/`
5. UI shows task description, app package, step count, and trajectory preview

Example prompts built into the demo:

- "play my workout playlist"
- "turn bluetooth on"
- "wake me up tomorrow morning"
- "send ri a message on whatsapp"
- "book an uber to the airport"

The Space doesn't ship model weights — inference stays on Modal. Only a `MODAL_PREDICT_URL` secret is needed.

### Local development

Three commands to run everything locally:

```bash
# 1. Generate training data
python scripts/generate_intent_dataset.py

# 2. Train on Modal GPU
modal run modal_apps/train_modal.py --dataset train_intent.jsonl

# 3. Deploy inference + run demo
modal deploy modal_apps/predict_api.py
export MODAL_PREDICT_URL="https://..."
python app.py
```

Evaluation can run locally on CPU/MPS if you download the adapter:

```bash
modal volume get android-dataset-model adapter ./trained_model/adapter
python -m src.evaluate_intent
```

---

## Evaluation: how we measure generalization

### Metrics

Three metrics capture different levels of correctness:

| Metric | Definition | What it measures |
| --- | --- | --- |
| **Skill accuracy** | Predicted skill matches expected | App/action disambiguation |
| **Parameter accuracy** | All expected parameters match (normalized) | Slot-filling quality |
| **Exact JSON match** | Skill + all parameters match exactly | End-to-end correctness |

Parameter matching uses normalized lowercase comparison — `"Mom"` matches `"mom"`, extra whitespace is stripped.

### Pocket Automator benchmark results

Evaluation on **200 held-out prompts** with slang, typos, and conversational phrasing:

| Metric | Score |
| --- | --- |
| **Skill accuracy** | 99.0% |
| **Parameter accuracy** | 86.0% |
| **Exact JSON match** | 85.5% |

The model almost never picks the wrong app or action. Parameter extraction is harder — preserving informal time expressions like `"6ish"` vs normalizing to `"6 am"` — but 86% is strong for a 3B model with no cloud fallback.

### Where errors happen

Parameter failures tend to cluster around:

- **Informal time expressions**: "6ish on monday" vs `"time": "6 am", "day": "monday"`
- **Abbreviated days**: "tmrw" vs "tomorrow morning"
- **Message truncation**: model drops filler words the benchmark expects verbatim
- **Contact nicknames**: "roomie" vs a full name

Skill errors (1%) mostly involve near-miss disambiguation — Spotify search-and-play vs play-playlist when the prompt is ambiguous.

### Evaluation commands

```bash
# On Modal GPU
modal run modal_apps/evaluate_intent_modal.py
modal run modal_apps/evaluate_pocket_benchmark_modal.py

# Locally
python -m src.evaluate_intent
python -m src.evaluate_pocket_benchmark
```

The pocket benchmark runner produces a confusion matrix, per-domain breakdown, and a failure report saved to `data/pocket_benchmark_report.txt`.

---

## Why this approach works

### 1. Local-first, privacy-preserving

A 3B model can run on-device (via llama.cpp, MLC, or similar) or on a small GPU. Your "text mom I'm running late" never needs to hit a frontier API. The entire inference stack fits in ~2GB of VRAM with 4-bit quantization.

### 2. Deterministic replay, not hallucinated taps

The model outputs a skill label and parameters. The trajectory is a fixed file recorded on a real device. No invented button coordinates, no drift between runs. If the model says `whatsapp_send_message`, you get the exact same tap sequence every time.

This is fundamentally different from vision-based agents that re-locate UI elements on every run and can click the wrong thing.

### 3. Cheap to extend

Adding a new skill is a repeatable pipeline:

1. Record one trajectory with Pocket Automator
2. Add parameter schema to `data/skill_schemas.json`
3. Add skill mapping to `src/skill_router.py`
4. Regenerate training data: `python scripts/generate_intent_dataset.py`
5. Fine-tune: `modal run modal_apps/train_modal.py --dataset train_intent.jsonl`

No prompt engineering session. No re-architecting the model. Just more data and another training run.

### 4. Separation of concerns

| Component | Responsibility | Swappable? |
| --- | --- | --- |
| Language model | Understand intent | Yes — any 3B instruct model |
| Skill router | Map intent → file | Yes — add skills without retraining |
| Pocket Automator | Execute UI steps | Yes — any accessibility replay engine |
| Trajectory JSON | Store ground truth | Yes — re-record when UI changes |

Each piece can be improved independently. Better model? Swap the adapter. UI changed? Re-record one trajectory. New app? Add a skill.

### 5. Designed for the "backyard"

This project targets **personal automation on hardware you own** — the Backyard AI track. It's not trying to automate every Android app in existence. It's trying to automate *your* apps, *your* flows, *your* phrasing, with a model small enough to run locally.

---

## Parameterized replay: classify → bind → replay

### The gap V2 closed

V2 extracts parameters at inference time:

```
"text mom on whatsapp i'm on my way"
→ {"contact": "mom", "message": "i'm on my way"}
```

Recorded trajectories still contain **fixed entities** — the WhatsApp export types `"Biraj"` and `"Hi"`. Without binding, replay ignores the model output.

### Slot-filling at replay time

**ParameterBinder** substitutes runtime values into trajectory steps before replay:

1. Load bindings from `data/skill_schemas.json` (which step maps to which parameter)
2. Rewrite `set_text` values and post-search `click` labels
3. Hand the bound trajectory to `ReplayPlanner` → `ReplayEngine`

This closes the loop:

```
Natural language → structured intent → parameterized replay on any device
```

**Validated end-to-end flow (WhatsApp on device):**

```
Modal /predict (or pasted JSON)
  → parameter dialog in Pocket Automator
  → ParameterBinder.apply(trajectory, parameters, bindings)
  → ReplayPlanner.plan → ReplayEngine.replay
  → WhatsApp taps with the extracted contact and message
```

The Gradio Space runs the same binding logic in Python (`src/parameter_binder.py`) so the trajectory JSON preview matches what replay will execute.

Bindings are defined per skill. **WhatsApp**, **Gmail**, and **YouTube** are supported in preview; Pocket Automator mirrors the schema on device.

### What's next

- **Self-contained exports** — embed `bindings` + `recordedParameters` in exported trajectory JSON (Phase B.8)
- **More skills** — Contacts, Calendar, Spotify search, etc.
- **On-device inference** — run the 3B model locally without Modal
- **Multi-step intents** — "set alarm and text mom I'll be late"
- **UI change detection** — alert when a trajectory needs re-recording

---

## Try it yourself

### Links

| Resource | URL |
| --- | --- |
| **Blog post** | [Hugging Face Blog — Android Skill Router](https://huggingface.co/blog/build-small-hackathon/android-skill-router) |
| **Live demo** | [android-skill-router on Hugging Face Spaces](https://huggingface.co/spaces/build-small-hackathon/android-skill-router) |
| **Demo video** | [YouTube Short](https://youtube.com/shorts/IQRHf7HfTDA) |
| **Pocket Automator** | [GitHub — Android recorder & replay](https://github.com/kriyanshii/pocket-automator) |
| **Social post** | [Twitter/X](https://x.com/kriyanshii/status/2066587828839141634) |

### Quick start

```bash
git clone https://github.com/kriyanshii/android-dataset.git
cd android-dataset

# Generate intent training data
python scripts/generate_intent_dataset.py

# Train on Modal (requires modal setup)
pip install modal && modal setup
modal run modal_apps/train_modal.py --dataset train_intent.jsonl

# Deploy inference API
modal deploy modal_apps/predict_api.py

# Run Gradio demo
pip install -r requirements.txt
export MODAL_PREDICT_URL="https://<your-modal-url>/predict"
python app.py
```

### Project layout

```
app.py                      # Gradio demo (hackathon submission UI)
data/
  skill_schemas.json        # Parameter definitions and trajectory bindings per skill
  skills.jsonl              # Canonical skill ↔ task mapping
  train_intent.jsonl        # ~15k SFT examples (generated locally)
  eval_intent_prompts.json  # Held-out intent eval set
  pocket_benchmark_prompts.json  # 200 real-world messy prompts
src/
  skill_router.py           # Skill name → trajectory JSON
  parameter_binder.py       # Runtime parameter → trajectory step substitution
  skill_utils.py              # JSON parsing, aliases, fallbacks
  classifier_prompt.py        # System prompts for V1 and V2
  evaluate_intent.py          # Local evaluation
  pocket_benchmark.py         # Benchmark metrics and reports
modal_apps/
  train_modal.py              # QLoRA fine-tuning on Modal GPU
  predict_api.py              # FastAPI inference endpoint
  evaluate_intent_modal.py    # GPU evaluation
  evaluate_pocket_benchmark_modal.py
scripts/
  generate_skill_dataset.py   # trajectories → skills.jsonl
  generate_intent_dataset.py  # schemas → train_intent.jsonl
  generate_pocket_benchmark.py
trajectories/                 # Pocket Automator exports (15 skills)
```

---

## TL;DR

**Android Skill Router** shows that personal phone automation doesn't require a 70B agent in the cloud.

1. **Record** UI flows once on your Android device with Pocket Automator
2. **Fine-tune** a 3B model to understand how you actually talk (slang, typos, and all)
3. **Route** to deterministic trajectories — no hallucinated taps
4. **Replay** through accessibility APIs on real hardware

Classify → route → replay. Small model, real hardware, backyard-scale AI that actually does something useful.

---

*Apache 2.0. Base model weights subject to [Qwen license](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).*