title: UsTwo API
emoji: π
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
UsTwo β your characters react to your calls
CMU 11-775 Large Scale Multimedia Analysis β Team Project 3-person team: Juhyun (PO / Research) Β· Seungjae (ML) Β· Youngkyun (App)
UsTwo takes a recorded phone call between two people (couple, friends, family) and runs multimodal analysis (audio + text) to understand how each speaker felt. It then produces three things inside a React Native + FastAPI mobile app: a character reaction scene, a growing emotion garden, and an LLM-written recap card.
The goal isn't just emotion classification. It's to visualize "what kind of moment did these two share on this call?"
At a glance
[call recording .wav/.m4a]
β
βΌ
βββββββββββββββββββββ Stage 1 (Seungjae) βββββββββββββββββββββ
β pyannote 4.x (3.1) β WhisperX large-v3-turbo INT8 β ko/en β
β speaker diarization ASR + forced alignment LID β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β segments: [{speaker, start, end, text, lang}]
βΌ
βββββββββββββββββββββ Stage 2 (Seungjae) βββββββββββββββββββββ
β emotion2vec LoRA (ONNX) + KcELECTRA LoRA (ko) / DistilRoBERTa β
β audio emotion (7-class) text emotion (7-class ko / 7-class en) β
β β
β fusion: per-language, per-class trained weights (v2) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β per-speaker emotion distribution
βΌ
βββββββββββββββββββββ Stage 3 (Youngkyun) ββββββββββββββββββββ
β character_mapping + garden_logic + recap_generator β
β 9 pair interactions 5 levels Β· 4 moods Claude LLM β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββ Stage 4 (Youngkyun) ββββββββββββββββββββ
β FastAPI + SQLite β React Native (Expo) β
β 6 endpoints, async expo-router Β· SVG β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
End-to-end latency: a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment.
Live server: https://bbbakery-ustwo-api.hf.space
Demo β the garden grows
The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.
| Level | Threshold (cumulative interactions) | Visual |
|---|---|---|
| 1 | 0β2 | Two seedlings |
| 2 | 3β7 | Grass + small flowers |
| 3 | 8β14 | Trees + many flowers |
| 4 | 15β24 | Lush garden + creatures |
| 5 | 25+ | Sunset sky + butterflies, rabbits + full bloom |
A mood overlay (happy Β· neglected Β· recovering Β· conflict) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns neglected; a recent call with heavy negative emotion pushes it to conflict.
Stage 1β2 β ML Pipeline (Seungjae)
Stage 1: Diarization + ASR
| Component | Model | Config |
|---|---|---|
| VAD | Silero VAD | onset 0.5, min_speech 0.25s |
| Diarization | pyannote-audio 4.x (3.1) | HF token required |
| ASR | Faster-Whisper (large-v3-turbo) | INT8 quantized |
| Forced alignment | WhisperX wav2vec2 | per-word timestamps |
| Language ID | Whisper auto-detect + SenseVoice-Small | ko / en / unknown |
Output: Stage1Output β segments: [{speaker, start, end, text, lang}], processing_info, models. Entry point: src/stage1/process.py.
Stage 2: Emotion Recognition (bilingual)
| Channel | Model | Classes |
|---|---|---|
| Audio emotion | emotion2vec_plus_base + LoRA fine-tuned (ONNX) |
7: neutral, joy, sadness, anger, surprise, fear, disgust |
| Text emotion β Korean | KcELECTRA-base-v2022 + PEFT LoRA (ONNX) | 7: neutral, joy, sadness, anger, surprise, fear, disgust |
| Text emotion β English | DistilRoBERTa (j-hartmann/emotion-english, zero-shot) | 7 Ekman + neutral |
| Fusion | Per-language, per-class weights trained via gradient descent | EMOTION_FUSION_WEIGHTS_KO / _EN |
Evaluation β English (RAVDESS, n=2,880)
| Condition | Accuracy | Macro F1 | Latency |
|---|---|---|---|
| Clean (studio) | 93.2% | 0.932 | 250 ms |
| Phone (PSTN simulation, 300β3400 Hz band-limit) | 71.5% | 0.710 | 242 ms |
| Degradation | β21.7 pp | β0.222 | β |
- Phone-robust emotions (F1 > 0.80): anger (0.836), surprise (0.856) β high-energy acoustic cues survive the band-limit.
- Phone-degraded (largest drop): joy (0.946 β 0.585, β0.361), sadness (0.868 β 0.559, β0.309) β subtler pitch/timbre cues degrade hardest.
- Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set): 95.2% accuracy.
Full report: docs/stage2/english-evaluation-report.md
Evaluation β Korean (KcELECTRA fine-tuning)
- Fine-tuned on the AI Hub κ°μ± λν λ§λμΉ (emotion dialogue corpus) on Colab GPU.
- Macro F1: 0.20 (base) β 0.65 (fine-tuned), a 3.25Γ improvement.
- Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes.
Fusion weight training (v2, 2026-04-19)
Replaced greedy per-emotion grid search with PyTorch gradient descent training w_a = sigmoid(Ξ±) per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split.
| Language | Training data | Samples | Val Macro F1 (v1 β v2) |
|---|---|---|---|
| English | JL-Corpus + SAVEE + MELD + RAVDESS phone | 2,821 | 0.6295 β 0.7596 (+12.67%p) |
| Korean | AI Hub 263 val | 1,294 | 0.8736 β 0.8748 (tie) |
Biggest single win: English fear F1 rose from 0.04 β 0.67 (greedy's audio_w=0.00 was a local trap). Per-language weight tables: src/common/constants.py (EMOTION_FUSION_WEIGHTS_KO / _EN). Details: docs/stage2/fusion-weights-english-grid-search.md.
End-to-end test β MELD (Friends TV dialogue)
We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server.
| Metric | Result |
|---|---|
| Pipeline success | 8 / 8 files completed Stage 1 β 2 β 3 |
| Exact top-1 emotion label match | 7 / 7 (one tie) |
| Average processing time | ~2 min / file (HF Spaces CPU) |
End-to-end test β 20Hours Korean demo (2026-04-20)
Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo.
| Metric | Result |
|---|---|
| Pipeline success | 7 / 7 files completed Stage 1 β 2 β 3 |
| Intended-emotion match (one speaker) | 4 / 7 |
| Source | data/20hours_test/ + scripts/test_20hours_e2e_server.py |
Note: 20Hours source is ASR-training data without emotion labels β ground_truth.json lists intended demo emotions (for graph visualization), not ground-truth annotations.
Stage 3β4 β App + Server (Youngkyun)
Stage 3: Reaction Β· Garden Β· Recap
src/stage3/process.py takes the Stage 2 output and runs three independent modules:
| Module | Role | Key logic |
|---|---|---|
character_mapping.py |
Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4Γ4 matrix β one of 9 pair states + a giver role | joy β up, anger β tense, surprise/fear/disgust resolved via residual distribution; up Γ down β comforting (giver=A), tense Γ tense β back_turned, calm Γ calm β idle, ... |
character_mapping.py (intensity) |
Emits a CharacterReaction.intensity in {1, 2, 3} used by the app for healing-cycle thresholds |
Default 2 β 3 cycles to heal; 1 β 4 cycles; 3 β 2 cycles |
garden_logic.py |
Computes growth delta (0β3) from call quality | positive_ratio β₯ 0.5 β +3, pos β₯ 0.3 && neg < 0.3 β +2, neg β₯ 0.5 β 0 |
recap_generator.py |
Generates the narrative recap card via the Claude API | System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing β titles are call-specific, not combo templates. Rule-based template fallback when no API key. |
Mood resolution: neg β₯ 0.5 β conflict; neg β₯ 0.3 && pos < 0.3 β recovering; level β€ 1 && pos < 0.3 β neglected; otherwise happy. Level only goes up β tough calls shift the tint, never the count.
Confidence gate: a speaker whose top emotion probability is below 0.5 is demoted to calm for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle β the chip shows the dominant label, the particle shows the matrix cell after gating.
Stage 4: FastAPI backend
- Framework: FastAPI + SQLAlchemy + SQLite (local) Β· 4 tables:
calls,analysis_results,checkins,garden_state. - Async pipeline:
POST /api/uploadβPOST /api/analyze?call_id=Xreturns 202 Accepted + a background thread, and the client pollsGET /api/calls/{id}. - Mock path: drop
data/samples/{call_id}_stage2.jsonand the API skips Stage 1β2, running only Stage 3 β useful for E2E testing without the heavy ML deps. - Endpoints (6):
/api/upload,/api/analyze,/api/calls,/api/calls/{id},/api/checkins,/api/garden./api/callsextractsrecap_card.titlefrom the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title. - Tests: 12 API tests on in-memory SQLite via pytest.
React Native (Expo) app
Router: expo-router β 3 tabs (Us, History, Settings) plus modal routes (checkin, results/[callId]).
| Screen | Description |
|---|---|
![]() |
Onboarding β paper-deck intro ("A garden for two") that sets the metaphor before the first call lands |
![]() |
Home (Us) β live character scene + garden, recent-call feed, mailbox entry point |
![]() |
Seed bloom alert β level-up moment; fires when the interaction count crosses a garden threshold, offering View / Later |
![]() |
Check-in β 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy) |
![]() |
Results β emotion analysis β My mood / Their mood chips, suggestion line in the garden-voice, Emotional Landscape wave (uplifting β heavy), and Moments that mattered per-speaker slices |
![]() |
Results β recap card β call-specific LLM title + narrative + highlights + per-call Our Garden delta, followed by the Was this accurate? thumbs-up/down feedback loop |
![]() |
History β Our Emotional Flow chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title |
![]() |
Settings β language toggle (ko/en), developer mode, dev tools |
Character animation layers (all built on react-native-reanimated + react-native-svg):
- Idle breathing β per-mood rhythm profile (up / calm / down / tense), uniform
scaleAmp1.02, micro-bob, micro-sway (1.4s cycle). - Eye blink β 2.8β5.5s interval, 15% chance of a double-blink.
- Emotion transition β joy = bounce + arms raised, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden β emotion reads via face, pupil, body pose, and idle rhythm instead.
- Pair-state body pose β
comfortinggiver moves 60% closer with no tilt; receiver gets a gentle sag +pupilOffset(head-lowering effect on eyes). - Pair-state particle signature β one preset per 4Γ4 matrix cell:
comfortingβ staged echo: 1 heart (giver β receiver) + 1 bubble dot + 1 translucent heart (bezier engine,useParticles).dancingβ 1 music note rising above the heads.cheering/listening/sitting_together/defusingβ yellow sparkle cluster.tension/back_turnedβ grey sigh cloud.idle(calm Γ calm) β silent; the couple just wanders.
- Generalised healing β every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too β no one is left behind.
- Tap interaction β spring bounce + floating hearts / sparkles (
FloatingHeartscomponent). - Wander β characters roam between pre-validated waypoints (3β8s move, 2β5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement).
Garden rendering (SVG): Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds β Level 4 and above).
i18n: a custom LocaleContext drives Korean/English switching with @ustwo/locale persisted to AsyncStorage.
State persistence
| Store | Data | Key |
|---|---|---|
| SQLite (server) | calls, analysis_results, checkins, garden_state | β |
| AsyncStorage (app) | locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache) | @ustwo/* |
Repository map
src/stage1/ Speaker diarization + ASR (Seungjae)
src/stage2/ Audio + text emotion + fusion (Seungjae)
src/stage3/ Character, garden, recap (Youngkyun)
src/stage4/ FastAPI + SQLite + orchestration (Youngkyun)
src/common/ Pydantic schemas shared across stages
app/ React Native (Expo) app (Youngkyun)
src/routes/ expo-router (tabs, checkin, results, dev)
src/components/ characters, garden, scenes, layout
src/contexts/ Locale, Garden, DevMode
src/hooks/ Animation hooks (idle, blink, emotion, tap, wander)
notebooks/ KcELECTRA fine-tuning (Colab)
tests/ pytest (73 Python tests) + jest (71 JS tests)
docs/stage{1,2,3,4}/ Per-stage technical docs
docs/images/ README screenshots
config.yaml Global config (model paths, thresholds)
Dockerfile HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg)
Getting started
1. Clone
git clone https://github.com/boolooppang/UsTwo.git
cd UsTwo
2. Backend (Python)
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=your_huggingface_token # required by pyannote
uvicorn src.stage4.main:app --reload --port 8000
# β POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls
3. App (React Native / Expo)
cd app && npm install
npx expo start --dev-client
# Press 'i' for iOS simulator, or scan the QR code with a real device
To point the app at a different server, edit API_BASE in app/src/api/client.ts (it currently defaults to the HF Spaces URL).
4. Tests
python -m pytest tests/ -v # 73 Python tests
cd app && npx jest # 71 JS tests (144 total)
Deployment β HuggingFace Spaces (Docker)
The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 β 2 β 3).
- Live URL:
https://bbbakery-ustwo-api.hf.space - Environment variables:
HF_TOKEN,ANTHROPIC_API_KEY(set in Spaces Settings). - Config files:
Dockerfile,railway.toml,requirements-deploy.txt.
# Local Docker test
docker build -t ustwo .
docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo
Team & ownership
| Member | Role | Responsibility |
|---|---|---|
| Juhyun | Product Owner / Researcher | Product design, UX research, empathic-accuracy literature review, emotion β character mapping spec, evaluation plan, final paper |
| Seungjae | ML Engineer | Stage 1β2 end to end β diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning |
| Youngkyun | App Engineer | Stage 3β4 end to end β character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment |
Success criteria
MVP β achieved:
- β Upload a call recording β character reaction + recap card within ~2 minutes.
- β Happy vs tense calls produce visibly different character reactions (9 pair states).
- β Cumulative positive calls grow the garden through 5 levels.
- β Bilingual pipeline (Korean + English) with automatic language detection.
- β MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match.
License
MIT




