--- title: UsTwo API emoji: 💑 colorFrom: pink colorTo: purple sdk: docker app_port: 7860 pinned: false --- # UsTwo — your characters react to your calls > **CMU 11-775 Large Scale Multimedia Analysis** — Team Project > 3-person team: Juhyun (PO / Research) · Seungjae (ML) · Youngkyun (App) UsTwo takes a recorded phone call between two people (couple, friends, family) and runs **multimodal analysis (audio + text)** to understand how each speaker felt. It then produces three things inside a **React Native + FastAPI** mobile app: a **character reaction scene**, a **growing emotion garden**, and an **LLM-written recap card**. The goal isn't just emotion classification. It's to visualize *"what kind of moment did these two share on this call?"*

Home Recap card History

--- ## At a glance ``` [call recording .wav/.m4a] │ ▼ ┌──────────────────── Stage 1 (Seungjae) ────────────────────┐ │ pyannote 4.x (3.1) → WhisperX large-v3-turbo INT8 → ko/en │ │ speaker diarization ASR + forced alignment LID │ └──────────────────────────────────────────────────────────────┘ │ segments: [{speaker, start, end, text, lang}] ▼ ┌──────────────────── Stage 2 (Seungjae) ────────────────────┐ │ emotion2vec LoRA (ONNX) + KcELECTRA LoRA (ko) / DistilRoBERTa │ │ audio emotion (7-class) text emotion (7-class ko / 7-class en) │ │ │ │ fusion: per-language, per-class trained weights (v2) │ └─────────────────────────────────────────────────────────────┘ │ per-speaker emotion distribution ▼ ┌──────────────────── Stage 3 (Youngkyun) ───────────────────┐ │ character_mapping + garden_logic + recap_generator │ │ 9 pair interactions 5 levels · 4 moods Claude LLM │ └────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────── Stage 4 (Youngkyun) ───────────────────┐ │ FastAPI + SQLite ↔ React Native (Expo) │ │ 6 endpoints, async expo-router · SVG │ └────────────────────────────────────────────────────────────┘ ``` **End-to-end latency:** a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment. **Live server:** `https://bbbakery-ustwo-api.hf.space` --- ## Demo — the garden grows The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.

Level 1 — seedling Level 5 — full bloom

| Level | Threshold (cumulative interactions) | Visual | |-------|-------------------------------------|--------| | 1 | 0–2 | Two seedlings | | 2 | 3–7 | Grass + small flowers | | 3 | 8–14 | Trees + many flowers | | 4 | 15–24 | Lush garden + creatures | | 5 | 25+ | Sunset sky + butterflies, rabbits + full bloom | A **mood overlay** (`happy` · `neglected` · `recovering` · `conflict`) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns `neglected`; a recent call with heavy negative emotion pushes it to `conflict`. --- ## Stage 1–2 — ML Pipeline (Seungjae) ### Stage 1: Diarization + ASR | Component | Model | Config | |-----------|-------|--------| | VAD | Silero VAD | onset 0.5, min_speech 0.25s | | Diarization | pyannote-audio 4.x (3.1) | HF token required | | ASR | Faster-Whisper (large-v3-turbo) | INT8 quantized | | Forced alignment | WhisperX wav2vec2 | per-word timestamps | | Language ID | Whisper auto-detect + SenseVoice-Small | ko / en / unknown | **Output:** `Stage1Output` — `segments: [{speaker, start, end, text, lang}]`, `processing_info`, `models`. Entry point: `src/stage1/process.py`. ### Stage 2: Emotion Recognition (bilingual) | Channel | Model | Classes | |---------|-------|---------| | Audio emotion | `emotion2vec_plus_base` + **LoRA fine-tuned (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust | | Text emotion — Korean | **KcELECTRA-base-v2022 + PEFT LoRA (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust | | Text emotion — English | DistilRoBERTa (j-hartmann/emotion-english, zero-shot) | 7 Ekman + neutral | | Fusion | **Per-language, per-class weights trained via gradient descent** | EMOTION_FUSION_WEIGHTS_KO / _EN | ### Evaluation — English (RAVDESS, n=2,880) | Condition | Accuracy | Macro F1 | Latency | |-----------|----------|----------|---------| | Clean (studio) | **93.2%** | **0.932** | 250 ms | | Phone (PSTN simulation, 300–3400 Hz band-limit) | **71.5%** | **0.710** | 242 ms | | Degradation | −21.7 pp | −0.222 | — | - **Phone-robust emotions (F1 > 0.80):** anger (0.836), surprise (0.856) — high-energy acoustic cues survive the band-limit. - **Phone-degraded (largest drop):** joy (0.946 → 0.585, −0.361), sadness (0.868 → 0.559, −0.309) — subtler pitch/timbre cues degrade hardest. - **Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set):** 95.2% accuracy. Full report: [`docs/stage2/english-evaluation-report.md`](docs/stage2/english-evaluation-report.md) ### Evaluation — Korean (KcELECTRA fine-tuning) - Fine-tuned on the AI Hub *감성 대화 말뭉치* (emotion dialogue corpus) on Colab GPU. - Macro F1: **0.20 (base) → 0.65 (fine-tuned)**, a 3.25× improvement. - Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes. ### Fusion weight training (v2, 2026-04-19) Replaced greedy per-emotion grid search with **PyTorch gradient descent** training `w_a = sigmoid(α)` per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split. | Language | Training data | Samples | Val Macro F1 (v1 → v2) | |----------|---------------|---------|------------------------| | English | JL-Corpus + SAVEE + MELD + RAVDESS phone | 2,821 | 0.6295 → **0.7596 (+12.67%p)** | | Korean | AI Hub 263 val | 1,294 | 0.8736 → **0.8748 (tie)** | Biggest single win: English `fear` F1 rose from **0.04 → 0.67** (greedy's `audio_w=0.00` was a local trap). Per-language weight tables: `src/common/constants.py` (`EMOTION_FUSION_WEIGHTS_KO` / `_EN`). Details: [`docs/stage2/fusion-weights-english-grid-search.md`](docs/stage2/fusion-weights-english-grid-search.md). ### End-to-end test — MELD (Friends TV dialogue) We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server. | Metric | Result | |--------|--------| | Pipeline success | **8 / 8** files completed Stage 1 → 2 → 3 | | Exact top-1 emotion label match | **7 / 7** (one tie) | | Average processing time | ~2 min / file (HF Spaces CPU) | ### End-to-end test — 20Hours Korean demo (2026-04-20) Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo. | Metric | Result | |--------|--------| | Pipeline success | **7 / 7** files completed Stage 1 → 2 → 3 | | Intended-emotion match (one speaker) | **4 / 7** | | Source | `data/20hours_test/` + `scripts/test_20hours_e2e_server.py` | Note: 20Hours source is ASR-training data without emotion labels — `ground_truth.json` lists *intended demo emotions* (for graph visualization), not ground-truth annotations. --- ## Stage 3–4 — App + Server (Youngkyun) ### Stage 3: Reaction · Garden · Recap `src/stage3/process.py` takes the Stage 2 output and runs three independent modules: | Module | Role | Key logic | |--------|------|-----------| | `character_mapping.py` | Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4×4 matrix → one of 9 pair states + a giver role | `joy → up`, `anger → tense`, `surprise/fear/disgust` resolved via residual distribution; `up × down → comforting (giver=A)`, `tense × tense → back_turned`, `calm × calm → idle`, ... | | `character_mapping.py` (intensity) | Emits a `CharacterReaction.intensity` in `{1, 2, 3}` used by the app for healing-cycle thresholds | Default 2 → 3 cycles to heal; 1 → 4 cycles; 3 → 2 cycles | | `garden_logic.py` | Computes growth delta (0–3) from call quality | `positive_ratio ≥ 0.5` → +3, `pos ≥ 0.3 && neg < 0.3` → +2, `neg ≥ 0.5` → 0 | | `recap_generator.py` | Generates the narrative recap card via the Claude API | System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing — titles are call-specific, not combo templates. Rule-based template fallback when no API key. | **Mood resolution:** `neg ≥ 0.5` → `conflict`; `neg ≥ 0.3 && pos < 0.3` → `recovering`; `level ≤ 1 && pos < 0.3` → `neglected`; otherwise `happy`. **Level only goes up** — tough calls shift the tint, never the count. **Confidence gate:** a speaker whose top emotion probability is below 0.5 is demoted to `calm` for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle — the chip shows the dominant label, the particle shows the matrix cell after gating. ### Stage 4: FastAPI backend - **Framework:** FastAPI + SQLAlchemy + SQLite (local) · 4 tables: `calls`, `analysis_results`, `checkins`, `garden_state`. - **Async pipeline:** `POST /api/upload` → `POST /api/analyze?call_id=X` returns 202 Accepted + a background thread, and the client polls `GET /api/calls/{id}`. - **Mock path:** drop `data/samples/{call_id}_stage2.json` and the API skips Stage 1–2, running only Stage 3 — useful for E2E testing without the heavy ML deps. - **Endpoints (6):** `/api/upload`, `/api/analyze`, `/api/calls`, `/api/calls/{id}`, `/api/checkins`, `/api/garden`. `/api/calls` extracts `recap_card.title` from the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title. - **Tests:** 12 API tests on in-memory SQLite via pytest. ### React Native (Expo) app **Router:** `expo-router` — 3 tabs (`Us`, `History`, `Settings`) plus modal routes (`checkin`, `results/[callId]`). | Screen | Description | |--------|-------------| |

| **Onboarding** — paper-deck intro ("A garden for two") that sets the metaphor before the first call lands | |

| **Home (`Us`)** — live character scene + garden, recent-call feed, mailbox entry point | |

| **Seed bloom alert** — level-up moment; fires when the interaction count crosses a garden threshold, offering `View` / `Later` | |

| **Check-in** — 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy) | |

| **Results — emotion analysis** — `My mood` / `Their mood` chips, suggestion line in the garden-voice, `Emotional Landscape` wave (uplifting → heavy), and `Moments that mattered` per-speaker slices | |

| **Results — recap card** — call-specific LLM title + narrative + highlights + per-call `Our Garden` delta, followed by the `Was this accurate?` thumbs-up/down feedback loop | |

| **History** — `Our Emotional Flow` chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title | |

| **Settings** — language toggle (ko/en), developer mode, dev tools | **Character animation layers** (all built on `react-native-reanimated` + `react-native-svg`): 1. **Idle breathing** — per-mood rhythm profile (up / calm / down / tense), uniform `scaleAmp` 1.02, micro-bob, micro-sway (1.4s cycle). 2. **Eye blink** — 2.8–5.5s interval, 15% chance of a double-blink. 3. **Emotion transition** — joy = bounce + arms **raised**, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden — emotion reads via face, pupil, body pose, and idle rhythm instead. 4. **Pair-state body pose** — `comforting` giver moves 60% closer with no tilt; receiver gets a gentle sag + `pupilOffset` (head-lowering effect on eyes). 5. **Pair-state particle signature** — one preset per 4×4 matrix cell: - `comforting` → staged echo: 1 heart (giver → receiver) + 1 bubble dot + 1 translucent heart (bezier engine, `useParticles`). - `dancing` → 1 music note rising above the heads. - `cheering` / `listening` / `sitting_together` / `defusing` → yellow sparkle cluster. - `tension` / `back_turned` → grey sigh cloud. - `idle` (calm × calm) → silent; the couple just wanders. 6. **Generalised healing** — every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too — no one is left behind. 7. **Tap interaction** — spring bounce + floating hearts / sparkles (`FloatingHearts` component). 8. **Wander** — characters roam between pre-validated waypoints (3–8s move, 2–5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement). **Garden rendering (SVG):** Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds — Level 4 and above). **i18n:** a custom `LocaleContext` drives Korean/English switching with `@ustwo/locale` persisted to AsyncStorage. ### State persistence | Store | Data | Key | |-------|------|-----| | SQLite (server) | calls, analysis_results, checkins, garden_state | — | | AsyncStorage (app) | locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache) | `@ustwo/*` | --- ## Repository map ``` src/stage1/ Speaker diarization + ASR (Seungjae) src/stage2/ Audio + text emotion + fusion (Seungjae) src/stage3/ Character, garden, recap (Youngkyun) src/stage4/ FastAPI + SQLite + orchestration (Youngkyun) src/common/ Pydantic schemas shared across stages app/ React Native (Expo) app (Youngkyun) src/routes/ expo-router (tabs, checkin, results, dev) src/components/ characters, garden, scenes, layout src/contexts/ Locale, Garden, DevMode src/hooks/ Animation hooks (idle, blink, emotion, tap, wander) notebooks/ KcELECTRA fine-tuning (Colab) tests/ pytest (73 Python tests) + jest (71 JS tests) docs/stage{1,2,3,4}/ Per-stage technical docs docs/images/ README screenshots config.yaml Global config (model paths, thresholds) Dockerfile HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg) ``` --- ## Getting started ### 1. Clone ```bash git clone https://github.com/boolooppang/UsTwo.git cd UsTwo ``` ### 2. Backend (Python) ```bash python -m venv venv && source venv/bin/activate pip install -r requirements.txt export HF_TOKEN=your_huggingface_token # required by pyannote uvicorn src.stage4.main:app --reload --port 8000 # → POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls ``` ### 3. App (React Native / Expo) ```bash cd app && npm install npx expo start --dev-client # Press 'i' for iOS simulator, or scan the QR code with a real device ``` To point the app at a different server, edit `API_BASE` in `app/src/api/client.ts` (it currently defaults to the HF Spaces URL). ### 4. Tests ```bash python -m pytest tests/ -v # 73 Python tests cd app && npx jest # 71 JS tests (144 total) ``` --- ## Deployment — HuggingFace Spaces (Docker) The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 → 2 → 3). - **Live URL:** `https://bbbakery-ustwo-api.hf.space` - **Environment variables:** `HF_TOKEN`, `ANTHROPIC_API_KEY` (set in Spaces Settings). - **Config files:** [`Dockerfile`](Dockerfile), [`railway.toml`](railway.toml), [`requirements-deploy.txt`](requirements-deploy.txt). ```bash # Local Docker test docker build -t ustwo . docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo ``` --- ## Team & ownership | Member | Role | Responsibility | |--------|------|----------------| | **Juhyun** | Product Owner / Researcher | Product design, UX research, empathic-accuracy literature review, emotion → character mapping spec, evaluation plan, final paper | | **Seungjae** | ML Engineer | Stage 1–2 end to end — diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning | | **Youngkyun** | App Engineer | Stage 3–4 end to end — character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment | --- ## Success criteria **MVP — achieved:** - ✅ Upload a call recording → character reaction + recap card within ~2 minutes. - ✅ Happy vs tense calls produce visibly different character reactions (9 pair states). - ✅ Cumulative positive calls grow the garden through 5 levels. - ✅ Bilingual pipeline (Korean + English) with automatic language detection. - ✅ MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match. --- ## License MIT