---
title: UsTwo API
emoji: π
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# UsTwo β your characters react to your calls
> **CMU 11-775 Large Scale Multimedia Analysis** β Team Project
> 3-person team: Juhyun (PO / Research) Β· Seungjae (ML) Β· Youngkyun (App)
UsTwo takes a recorded phone call between two people (couple, friends, family) and runs **multimodal analysis (audio + text)** to understand how each speaker felt. It then produces three things inside a **React Native + FastAPI** mobile app: a **character reaction scene**, a **growing emotion garden**, and an **LLM-written recap card**.
The goal isn't just emotion classification. It's to visualize *"what kind of moment did these two share on this call?"*
---
## At a glance
```
[call recording .wav/.m4a]
β
βΌ
βββββββββββββββββββββ Stage 1 (Seungjae) βββββββββββββββββββββ
β pyannote 4.x (3.1) β WhisperX large-v3-turbo INT8 β ko/en β
β speaker diarization ASR + forced alignment LID β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β segments: [{speaker, start, end, text, lang}]
βΌ
βββββββββββββββββββββ Stage 2 (Seungjae) βββββββββββββββββββββ
β emotion2vec LoRA (ONNX) + KcELECTRA LoRA (ko) / DistilRoBERTa β
β audio emotion (7-class) text emotion (7-class ko / 7-class en) β
β β
β fusion: per-language, per-class trained weights (v2) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β per-speaker emotion distribution
βΌ
βββββββββββββββββββββ Stage 3 (Youngkyun) ββββββββββββββββββββ
β character_mapping + garden_logic + recap_generator β
β 9 pair interactions 5 levels Β· 4 moods Claude LLM β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ Stage 4 (Youngkyun) ββββββββββββββββββββ
β FastAPI + SQLite β React Native (Expo) β
β 6 endpoints, async expo-router Β· SVG β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
**End-to-end latency:** a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment.
**Live server:** `https://bbbakery-ustwo-api.hf.space`
---
## Demo β the garden grows
The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.
| Level | Threshold (cumulative interactions) | Visual |
|-------|-------------------------------------|--------|
| 1 | 0β2 | Two seedlings |
| 2 | 3β7 | Grass + small flowers |
| 3 | 8β14 | Trees + many flowers |
| 4 | 15β24 | Lush garden + creatures |
| 5 | 25+ | Sunset sky + butterflies, rabbits + full bloom |
A **mood overlay** (`happy` Β· `neglected` Β· `recovering` Β· `conflict`) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns `neglected`; a recent call with heavy negative emotion pushes it to `conflict`.
---
## Stage 1β2 β ML Pipeline (Seungjae)
### Stage 1: Diarization + ASR
| Component | Model | Config |
|-----------|-------|--------|
| VAD | Silero VAD | onset 0.5, min_speech 0.25s |
| Diarization | pyannote-audio 4.x (3.1) | HF token required |
| ASR | Faster-Whisper (large-v3-turbo) | INT8 quantized |
| Forced alignment | WhisperX wav2vec2 | per-word timestamps |
| Language ID | Whisper auto-detect + SenseVoice-Small | ko / en / unknown |
**Output:** `Stage1Output` β `segments: [{speaker, start, end, text, lang}]`, `processing_info`, `models`. Entry point: `src/stage1/process.py`.
### Stage 2: Emotion Recognition (bilingual)
| Channel | Model | Classes |
|---------|-------|---------|
| Audio emotion | `emotion2vec_plus_base` + **LoRA fine-tuned (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust |
| Text emotion β Korean | **KcELECTRA-base-v2022 + PEFT LoRA (ONNX)** | 7: neutral, joy, sadness, anger, surprise, fear, disgust |
| Text emotion β English | DistilRoBERTa (j-hartmann/emotion-english, zero-shot) | 7 Ekman + neutral |
| Fusion | **Per-language, per-class weights trained via gradient descent** | EMOTION_FUSION_WEIGHTS_KO / _EN |
### Evaluation β English (RAVDESS, n=2,880)
| Condition | Accuracy | Macro F1 | Latency |
|-----------|----------|----------|---------|
| Clean (studio) | **93.2%** | **0.932** | 250 ms |
| Phone (PSTN simulation, 300β3400 Hz band-limit) | **71.5%** | **0.710** | 242 ms |
| Degradation | β21.7 pp | β0.222 | β |
- **Phone-robust emotions (F1 > 0.80):** anger (0.836), surprise (0.856) β high-energy acoustic cues survive the band-limit.
- **Phone-degraded (largest drop):** joy (0.946 β 0.585, β0.361), sadness (0.868 β 0.559, β0.309) β subtler pitch/timbre cues degrade hardest.
- **Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set):** 95.2% accuracy.
Full report: [`docs/stage2/english-evaluation-report.md`](docs/stage2/english-evaluation-report.md)
### Evaluation β Korean (KcELECTRA fine-tuning)
- Fine-tuned on the AI Hub *κ°μ± λν λ§λμΉ* (emotion dialogue corpus) on Colab GPU.
- Macro F1: **0.20 (base) β 0.65 (fine-tuned)**, a 3.25Γ improvement.
- Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes.
### Fusion weight training (v2, 2026-04-19)
Replaced greedy per-emotion grid search with **PyTorch gradient descent** training `w_a = sigmoid(Ξ±)` per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split.
| Language | Training data | Samples | Val Macro F1 (v1 β v2) |
|----------|---------------|---------|------------------------|
| English | JL-Corpus + SAVEE + MELD + RAVDESS phone | 2,821 | 0.6295 β **0.7596 (+12.67%p)** |
| Korean | AI Hub 263 val | 1,294 | 0.8736 β **0.8748 (tie)** |
Biggest single win: English `fear` F1 rose from **0.04 β 0.67** (greedy's `audio_w=0.00` was a local trap). Per-language weight tables: `src/common/constants.py` (`EMOTION_FUSION_WEIGHTS_KO` / `_EN`). Details: [`docs/stage2/fusion-weights-english-grid-search.md`](docs/stage2/fusion-weights-english-grid-search.md).
### End-to-end test β MELD (Friends TV dialogue)
We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server.
| Metric | Result |
|--------|--------|
| Pipeline success | **8 / 8** files completed Stage 1 β 2 β 3 |
| Exact top-1 emotion label match | **7 / 7** (one tie) |
| Average processing time | ~2 min / file (HF Spaces CPU) |
### End-to-end test β 20Hours Korean demo (2026-04-20)
Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo.
| Metric | Result |
|--------|--------|
| Pipeline success | **7 / 7** files completed Stage 1 β 2 β 3 |
| Intended-emotion match (one speaker) | **4 / 7** |
| Source | `data/20hours_test/` + `scripts/test_20hours_e2e_server.py` |
Note: 20Hours source is ASR-training data without emotion labels β `ground_truth.json` lists *intended demo emotions* (for graph visualization), not ground-truth annotations.
---
## Stage 3β4 β App + Server (Youngkyun)
### Stage 3: Reaction Β· Garden Β· Recap
`src/stage3/process.py` takes the Stage 2 output and runs three independent modules:
| Module | Role | Key logic |
|--------|------|-----------|
| `character_mapping.py` | Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4Γ4 matrix β one of 9 pair states + a giver role | `joy β up`, `anger β tense`, `surprise/fear/disgust` resolved via residual distribution; `up Γ down β comforting (giver=A)`, `tense Γ tense β back_turned`, `calm Γ calm β idle`, ... |
| `character_mapping.py` (intensity) | Emits a `CharacterReaction.intensity` in `{1, 2, 3}` used by the app for healing-cycle thresholds | Default 2 β 3 cycles to heal; 1 β 4 cycles; 3 β 2 cycles |
| `garden_logic.py` | Computes growth delta (0β3) from call quality | `positive_ratio β₯ 0.5` β +3, `pos β₯ 0.3 && neg < 0.3` β +2, `neg β₯ 0.5` β 0 |
| `recap_generator.py` | Generates the narrative recap card via the Claude API | System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing β titles are call-specific, not combo templates. Rule-based template fallback when no API key. |
**Mood resolution:** `neg β₯ 0.5` β `conflict`; `neg β₯ 0.3 && pos < 0.3` β `recovering`; `level β€ 1 && pos < 0.3` β `neglected`; otherwise `happy`. **Level only goes up** β tough calls shift the tint, never the count.
**Confidence gate:** a speaker whose top emotion probability is below 0.5 is demoted to `calm` for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle β the chip shows the dominant label, the particle shows the matrix cell after gating.
### Stage 4: FastAPI backend
- **Framework:** FastAPI + SQLAlchemy + SQLite (local) Β· 4 tables: `calls`, `analysis_results`, `checkins`, `garden_state`.
- **Async pipeline:** `POST /api/upload` β `POST /api/analyze?call_id=X` returns 202 Accepted + a background thread, and the client polls `GET /api/calls/{id}`.
- **Mock path:** drop `data/samples/{call_id}_stage2.json` and the API skips Stage 1β2, running only Stage 3 β useful for E2E testing without the heavy ML deps.
- **Endpoints (6):** `/api/upload`, `/api/analyze`, `/api/calls`, `/api/calls/{id}`, `/api/checkins`, `/api/garden`. `/api/calls` extracts `recap_card.title` from the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title.
- **Tests:** 12 API tests on in-memory SQLite via pytest.
### React Native (Expo) app
**Router:** `expo-router` β 3 tabs (`Us`, `History`, `Settings`) plus modal routes (`checkin`, `results/[callId]`).
| Screen | Description |
|--------|-------------|
|
| **Onboarding** β paper-deck intro ("A garden for two") that sets the metaphor before the first call lands |
|
| **Home (`Us`)** β live character scene + garden, recent-call feed, mailbox entry point |
|
| **Seed bloom alert** β level-up moment; fires when the interaction count crosses a garden threshold, offering `View` / `Later` |
|
| **Check-in** β 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy) |
|
| **Results β emotion analysis** β `My mood` / `Their mood` chips, suggestion line in the garden-voice, `Emotional Landscape` wave (uplifting β heavy), and `Moments that mattered` per-speaker slices |
|
| **Results β recap card** β call-specific LLM title + narrative + highlights + per-call `Our Garden` delta, followed by the `Was this accurate?` thumbs-up/down feedback loop |
|
| **History** β `Our Emotional Flow` chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title |
|
| **Settings** β language toggle (ko/en), developer mode, dev tools |
**Character animation layers** (all built on `react-native-reanimated` + `react-native-svg`):
1. **Idle breathing** β per-mood rhythm profile (up / calm / down / tense), uniform `scaleAmp` 1.02, micro-bob, micro-sway (1.4s cycle).
2. **Eye blink** β 2.8β5.5s interval, 15% chance of a double-blink.
3. **Emotion transition** β joy = bounce + arms **raised**, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden β emotion reads via face, pupil, body pose, and idle rhythm instead.
4. **Pair-state body pose** β `comforting` giver moves 60% closer with no tilt; receiver gets a gentle sag + `pupilOffset` (head-lowering effect on eyes).
5. **Pair-state particle signature** β one preset per 4Γ4 matrix cell:
- `comforting` β staged echo: 1 heart (giver β receiver) + 1 bubble dot + 1 translucent heart (bezier engine, `useParticles`).
- `dancing` β 1 music note rising above the heads.
- `cheering` / `listening` / `sitting_together` / `defusing` β yellow sparkle cluster.
- `tension` / `back_turned` β grey sigh cloud.
- `idle` (calm Γ calm) β silent; the couple just wanders.
6. **Generalised healing** β every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too β no one is left behind.
7. **Tap interaction** β spring bounce + floating hearts / sparkles (`FloatingHearts` component).
8. **Wander** β characters roam between pre-validated waypoints (3β8s move, 2β5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement).
**Garden rendering (SVG):** Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds β Level 4 and above).
**i18n:** a custom `LocaleContext` drives Korean/English switching with `@ustwo/locale` persisted to AsyncStorage.
### State persistence
| Store | Data | Key |
|-------|------|-----|
| SQLite (server) | calls, analysis_results, checkins, garden_state | β |
| AsyncStorage (app) | locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache) | `@ustwo/*` |
---
## Repository map
```
src/stage1/ Speaker diarization + ASR (Seungjae)
src/stage2/ Audio + text emotion + fusion (Seungjae)
src/stage3/ Character, garden, recap (Youngkyun)
src/stage4/ FastAPI + SQLite + orchestration (Youngkyun)
src/common/ Pydantic schemas shared across stages
app/ React Native (Expo) app (Youngkyun)
src/routes/ expo-router (tabs, checkin, results, dev)
src/components/ characters, garden, scenes, layout
src/contexts/ Locale, Garden, DevMode
src/hooks/ Animation hooks (idle, blink, emotion, tap, wander)
notebooks/ KcELECTRA fine-tuning (Colab)
tests/ pytest (73 Python tests) + jest (71 JS tests)
docs/stage{1,2,3,4}/ Per-stage technical docs
docs/images/ README screenshots
config.yaml Global config (model paths, thresholds)
Dockerfile HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg)
```
---
## Getting started
### 1. Clone
```bash
git clone https://github.com/boolooppang/UsTwo.git
cd UsTwo
```
### 2. Backend (Python)
```bash
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=your_huggingface_token # required by pyannote
uvicorn src.stage4.main:app --reload --port 8000
# β POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls
```
### 3. App (React Native / Expo)
```bash
cd app && npm install
npx expo start --dev-client
# Press 'i' for iOS simulator, or scan the QR code with a real device
```
To point the app at a different server, edit `API_BASE` in `app/src/api/client.ts` (it currently defaults to the HF Spaces URL).
### 4. Tests
```bash
python -m pytest tests/ -v # 73 Python tests
cd app && npx jest # 71 JS tests (144 total)
```
---
## Deployment β HuggingFace Spaces (Docker)
The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 β 2 β 3).
- **Live URL:** `https://bbbakery-ustwo-api.hf.space`
- **Environment variables:** `HF_TOKEN`, `ANTHROPIC_API_KEY` (set in Spaces Settings).
- **Config files:** [`Dockerfile`](Dockerfile), [`railway.toml`](railway.toml), [`requirements-deploy.txt`](requirements-deploy.txt).
```bash
# Local Docker test
docker build -t ustwo .
docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo
```
---
## Team & ownership
| Member | Role | Responsibility |
|--------|------|----------------|
| **Juhyun** | Product Owner / Researcher | Product design, UX research, empathic-accuracy literature review, emotion β character mapping spec, evaluation plan, final paper |
| **Seungjae** | ML Engineer | Stage 1β2 end to end β diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning |
| **Youngkyun** | App Engineer | Stage 3β4 end to end β character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment |
---
## Success criteria
**MVP β achieved:**
- β
Upload a call recording β character reaction + recap card within ~2 minutes.
- β
Happy vs tense calls produce visibly different character reactions (9 pair states).
- β
Cumulative positive calls grow the garden through 5 levels.
- β
Bilingual pipeline (Korean + English) with automatic language detection.
- β
MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match.
---
## License
MIT