--- title: UsTwo API emoji: πŸ’‘ colorFrom: pink colorTo: purple sdk: docker app_port: 7860 pinned: false --- # UsTwo β€” your characters react to your calls > **CMU 11-775 Large Scale Multimedia Analysis** β€” Team Project > 3-person team: Juhyun (PO / Research) Β· Seungjae (ML) Β· Youngkyun (App) UsTwo takes a recorded phone call between two people (couple, friends, family) and runs **multimodal analysis (audio + text)** to understand how each speaker felt. It then produces three things inside a **React Native + FastAPI** mobile app: a **character reaction scene**, a **growing emotion garden**, and an **LLM-written recap card**. The goal isn't just emotion classification. It's to visualize *"what kind of moment did these two share on this call?"*

Home Recap card History

--- ## At a glance ``` [call recording .wav/.m4a] β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 1 (Seungjae) ────────────────────┐ β”‚ pyannote 4.x (3.1) β†’ WhisperX large-v3-turbo INT8 β†’ ko/en β”‚ β”‚ speaker diarization ASR + forced alignment LID β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ segments: [{speaker, start, end, text, lang}] β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 2 (Seungjae) ────────────────────┐ β”‚ emotion2vec LoRA (ONNX) + KcELECTRA LoRA (ko) / DistilRoBERTa β”‚ β”‚ audio emotion (7-class) text emotion (7-class ko / 7-class en) β”‚ β”‚ β”‚ β”‚ fusion: per-language, per-class trained weights (v2) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ per-speaker emotion distribution β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 3 (Youngkyun) ───────────────────┐ β”‚ character_mapping + garden_logic + recap_generator β”‚ β”‚ 9 pair interactions 5 levels Β· 4 moods Claude LLM β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 4 (Youngkyun) ───────────────────┐ β”‚ FastAPI + SQLite ↔ React Native (Expo) β”‚ β”‚ 6 endpoints, async expo-router Β· SVG β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **End-to-end latency:** a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment. **Live server:** `https://bbbakery-ustwo-api.hf.space` --- ## Demo β€” the garden grows The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.

Level 1 β€” seedling Level 5 β€” full bloom

| Level | Threshold (cumulative interactions) | Visual | |-------|-------------------------------------|--------| | 1 | 0–2 | Two seedlings | | 2 | 3–7 | Grass + small flowers | | 3 | 8–14 | Trees + many flowers | | 4 | 15–24 | Lush garden + creatures | | 5 | 25+ | Sunset sky + butterflies, rabbits + full bloom | A **mood overlay** (`happy` Β· `neglected` Β· `recovering` Β· `conflict`) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns `neglected`; a recent call with heavy negative emotion pushes it to `conflict`. --- ## Stage 1–2 β€” ML Pipeline (Seungjae) ### Stage 1: Diarization + ASR | Component | Model | Config | |-----------|-------|--------| | VAD | Silero VAD | onset 0.5, min_speech 0.25s | | Diarization | pyannote-audio 4.x (3.1) | HF token required | | ASR | Faster-Whisper (large-v3-turbo) | INT8 quantized | | Forced alignment | WhisperX wav2vec2 | per-word timestamps | | Language ID | Whisper auto-detect + SenseVoice-Small | ko / en / unknown | **Output:** `Stage1Output` β€” `segments: [{speaker, start, end, text, lang}]`, `processing_info`, `models`. Entry point: `src/stage1/process.py`. ### Stage 2: Emotion Recognition (bilingual) | Channel | Model | Classes | |---------|-------|---------| | Audio emotion | `emotion2vec_plus_base` + **LoRA fine-tuned (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust | | Text emotion β€” Korean | **KcELECTRA-base-v2022 + PEFT LoRA (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust | | Text emotion β€” English | DistilRoBERTa (j-hartmann/emotion-english, zero-shot) | 7 Ekman + neutral | | Fusion | **Per-language, per-class weights trained via gradient descent** | EMOTION_FUSION_WEIGHTS_KO / _EN | ### Evaluation β€” English (RAVDESS, n=2,880) | Condition | Accuracy | Macro F1 | Latency | |-----------|----------|----------|---------| | Clean (studio) | **93.2%** | **0.932** | 250 ms | | Phone (PSTN simulation, 300–3400 Hz band-limit) | **71.5%** | **0.710** | 242 ms | | Degradation | βˆ’21.7 pp | βˆ’0.222 | β€” | - **Phone-robust emotions (F1 > 0.80):** anger (0.836), surprise (0.856) β€” high-energy acoustic cues survive the band-limit. - **Phone-degraded (largest drop):** joy (0.946 β†’ 0.585, βˆ’0.361), sadness (0.868 β†’ 0.559, βˆ’0.309) β€” subtler pitch/timbre cues degrade hardest. - **Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set):** 95.2% accuracy. Full report: [`docs/stage2/english-evaluation-report.md`](docs/stage2/english-evaluation-report.md) ### Evaluation β€” Korean (KcELECTRA fine-tuning) - Fine-tuned on the AI Hub *감성 λŒ€ν™” λ§λ­‰μΉ˜* (emotion dialogue corpus) on Colab GPU. - Macro F1: **0.20 (base) β†’ 0.65 (fine-tuned)**, a 3.25Γ— improvement. - Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes. ### Fusion weight training (v2, 2026-04-19) Replaced greedy per-emotion grid search with **PyTorch gradient descent** training `w_a = sigmoid(Ξ±)` per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split. | Language | Training data | Samples | Val Macro F1 (v1 β†’ v2) | |----------|---------------|---------|------------------------| | English | JL-Corpus + SAVEE + MELD + RAVDESS phone | 2,821 | 0.6295 β†’ **0.7596 (+12.67%p)** | | Korean | AI Hub 263 val | 1,294 | 0.8736 β†’ **0.8748 (tie)** | Biggest single win: English `fear` F1 rose from **0.04 β†’ 0.67** (greedy's `audio_w=0.00` was a local trap). Per-language weight tables: `src/common/constants.py` (`EMOTION_FUSION_WEIGHTS_KO` / `_EN`). Details: [`docs/stage2/fusion-weights-english-grid-search.md`](docs/stage2/fusion-weights-english-grid-search.md). ### End-to-end test β€” MELD (Friends TV dialogue) We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server. | Metric | Result | |--------|--------| | Pipeline success | **8 / 8** files completed Stage 1 β†’ 2 β†’ 3 | | Exact top-1 emotion label match | **7 / 7** (one tie) | | Average processing time | ~2 min / file (HF Spaces CPU) | ### End-to-end test β€” 20Hours Korean demo (2026-04-20) Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo. | Metric | Result | |--------|--------| | Pipeline success | **7 / 7** files completed Stage 1 β†’ 2 β†’ 3 | | Intended-emotion match (one speaker) | **4 / 7** | | Source | `data/20hours_test/` + `scripts/test_20hours_e2e_server.py` | Note: 20Hours source is ASR-training data without emotion labels β€” `ground_truth.json` lists *intended demo emotions* (for graph visualization), not ground-truth annotations. --- ## Stage 3–4 β€” App + Server (Youngkyun) ### Stage 3: Reaction Β· Garden Β· Recap `src/stage3/process.py` takes the Stage 2 output and runs three independent modules: | Module | Role | Key logic | |--------|------|-----------| | `character_mapping.py` | Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4Γ—4 matrix β†’ one of 9 pair states + a giver role | `joy β†’ up`, `anger β†’ tense`, `surprise/fear/disgust` resolved via residual distribution; `up Γ— down β†’ comforting (giver=A)`, `tense Γ— tense β†’ back_turned`, `calm Γ— calm β†’ idle`, ... | | `character_mapping.py` (intensity) | Emits a `CharacterReaction.intensity` in `{1, 2, 3}` used by the app for healing-cycle thresholds | Default 2 β†’ 3 cycles to heal; 1 β†’ 4 cycles; 3 β†’ 2 cycles | | `garden_logic.py` | Computes growth delta (0–3) from call quality | `positive_ratio β‰₯ 0.5` β†’ +3, `pos β‰₯ 0.3 && neg < 0.3` β†’ +2, `neg β‰₯ 0.5` β†’ 0 | | `recap_generator.py` | Generates the narrative recap card via the Claude API | System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing β€” titles are call-specific, not combo templates. Rule-based template fallback when no API key. | **Mood resolution:** `neg β‰₯ 0.5` β†’ `conflict`; `neg β‰₯ 0.3 && pos < 0.3` β†’ `recovering`; `level ≀ 1 && pos < 0.3` β†’ `neglected`; otherwise `happy`. **Level only goes up** β€” tough calls shift the tint, never the count. **Confidence gate:** a speaker whose top emotion probability is below 0.5 is demoted to `calm` for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle β€” the chip shows the dominant label, the particle shows the matrix cell after gating. ### Stage 4: FastAPI backend - **Framework:** FastAPI + SQLAlchemy + SQLite (local) Β· 4 tables: `calls`, `analysis_results`, `checkins`, `garden_state`. - **Async pipeline:** `POST /api/upload` β†’ `POST /api/analyze?call_id=X` returns 202 Accepted + a background thread, and the client polls `GET /api/calls/{id}`. - **Mock path:** drop `data/samples/{call_id}_stage2.json` and the API skips Stage 1–2, running only Stage 3 β€” useful for E2E testing without the heavy ML deps. - **Endpoints (6):** `/api/upload`, `/api/analyze`, `/api/calls`, `/api/calls/{id}`, `/api/checkins`, `/api/garden`. `/api/calls` extracts `recap_card.title` from the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title. - **Tests:** 12 API tests on in-memory SQLite via pytest. ### React Native (Expo) app **Router:** `expo-router` β€” 3 tabs (`Us`, `History`, `Settings`) plus modal routes (`checkin`, `results/[callId]`). | Screen | Description | |--------|-------------| | | **Onboarding** β€” paper-deck intro ("A garden for two") that sets the metaphor before the first call lands | | | **Home (`Us`)** β€” live character scene + garden, recent-call feed, mailbox entry point | | | **Seed bloom alert** β€” level-up moment; fires when the interaction count crosses a garden threshold, offering `View` / `Later` | | | **Check-in** β€” 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy) | | | **Results β€” emotion analysis** β€” `My mood` / `Their mood` chips, suggestion line in the garden-voice, `Emotional Landscape` wave (uplifting β†’ heavy), and `Moments that mattered` per-speaker slices | | | **Results β€” recap card** β€” call-specific LLM title + narrative + highlights + per-call `Our Garden` delta, followed by the `Was this accurate?` thumbs-up/down feedback loop | | | **History** β€” `Our Emotional Flow` chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title | | | **Settings** β€” language toggle (ko/en), developer mode, dev tools | **Character animation layers** (all built on `react-native-reanimated` + `react-native-svg`): 1. **Idle breathing** β€” per-mood rhythm profile (up / calm / down / tense), uniform `scaleAmp` 1.02, micro-bob, micro-sway (1.4s cycle). 2. **Eye blink** β€” 2.8–5.5s interval, 15% chance of a double-blink. 3. **Emotion transition** β€” joy = bounce + arms **raised**, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden β€” emotion reads via face, pupil, body pose, and idle rhythm instead. 4. **Pair-state body pose** β€” `comforting` giver moves 60% closer with no tilt; receiver gets a gentle sag + `pupilOffset` (head-lowering effect on eyes). 5. **Pair-state particle signature** β€” one preset per 4Γ—4 matrix cell: - `comforting` β†’ staged echo: 1 heart (giver β†’ receiver) + 1 bubble dot + 1 translucent heart (bezier engine, `useParticles`). - `dancing` β†’ 1 music note rising above the heads. - `cheering` / `listening` / `sitting_together` / `defusing` β†’ yellow sparkle cluster. - `tension` / `back_turned` β†’ grey sigh cloud. - `idle` (calm Γ— calm) β†’ silent; the couple just wanders. 6. **Generalised healing** β€” every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too β€” no one is left behind. 7. **Tap interaction** β€” spring bounce + floating hearts / sparkles (`FloatingHearts` component). 8. **Wander** β€” characters roam between pre-validated waypoints (3–8s move, 2–5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement). **Garden rendering (SVG):** Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds β€” Level 4 and above). **i18n:** a custom `LocaleContext` drives Korean/English switching with `@ustwo/locale` persisted to AsyncStorage. ### State persistence | Store | Data | Key | |-------|------|-----| | SQLite (server) | calls, analysis_results, checkins, garden_state | β€” | | AsyncStorage (app) | locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache) | `@ustwo/*` | --- ## Repository map ``` src/stage1/ Speaker diarization + ASR (Seungjae) src/stage2/ Audio + text emotion + fusion (Seungjae) src/stage3/ Character, garden, recap (Youngkyun) src/stage4/ FastAPI + SQLite + orchestration (Youngkyun) src/common/ Pydantic schemas shared across stages app/ React Native (Expo) app (Youngkyun) src/routes/ expo-router (tabs, checkin, results, dev) src/components/ characters, garden, scenes, layout src/contexts/ Locale, Garden, DevMode src/hooks/ Animation hooks (idle, blink, emotion, tap, wander) notebooks/ KcELECTRA fine-tuning (Colab) tests/ pytest (73 Python tests) + jest (71 JS tests) docs/stage{1,2,3,4}/ Per-stage technical docs docs/images/ README screenshots config.yaml Global config (model paths, thresholds) Dockerfile HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg) ``` --- ## Getting started ### 1. Clone ```bash git clone https://github.com/boolooppang/UsTwo.git cd UsTwo ``` ### 2. Backend (Python) ```bash python -m venv venv && source venv/bin/activate pip install -r requirements.txt export HF_TOKEN=your_huggingface_token # required by pyannote uvicorn src.stage4.main:app --reload --port 8000 # β†’ POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls ``` ### 3. App (React Native / Expo) ```bash cd app && npm install npx expo start --dev-client # Press 'i' for iOS simulator, or scan the QR code with a real device ``` To point the app at a different server, edit `API_BASE` in `app/src/api/client.ts` (it currently defaults to the HF Spaces URL). ### 4. Tests ```bash python -m pytest tests/ -v # 73 Python tests cd app && npx jest # 71 JS tests (144 total) ``` --- ## Deployment β€” HuggingFace Spaces (Docker) The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 β†’ 2 β†’ 3). - **Live URL:** `https://bbbakery-ustwo-api.hf.space` - **Environment variables:** `HF_TOKEN`, `ANTHROPIC_API_KEY` (set in Spaces Settings). - **Config files:** [`Dockerfile`](Dockerfile), [`railway.toml`](railway.toml), [`requirements-deploy.txt`](requirements-deploy.txt). ```bash # Local Docker test docker build -t ustwo . docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo ``` --- ## Team & ownership | Member | Role | Responsibility | |--------|------|----------------| | **Juhyun** | Product Owner / Researcher | Product design, UX research, empathic-accuracy literature review, emotion β†’ character mapping spec, evaluation plan, final paper | | **Seungjae** | ML Engineer | Stage 1–2 end to end β€” diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning | | **Youngkyun** | App Engineer | Stage 3–4 end to end β€” character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment | --- ## Success criteria **MVP β€” achieved:** - βœ… Upload a call recording β†’ character reaction + recap card within ~2 minutes. - βœ… Happy vs tense calls produce visibly different character reactions (9 pair states). - βœ… Cumulative positive calls grow the garden through 5 levels. - βœ… Bilingual pipeline (Korean + English) with automatic language detection. - βœ… MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match. --- ## License MIT