ustwo-api / README.md
asdfasdfqrqwer's picture
Deploy from GitHub 2026-04-23T03:56:31Z
c857b85
|
Raw
History Blame Contribute Delete
19 kB
metadata
title: UsTwo API
emoji: πŸ’‘
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

UsTwo β€” your characters react to your calls

CMU 11-775 Large Scale Multimedia Analysis β€” Team Project 3-person team: Juhyun (PO / Research) Β· Seungjae (ML) Β· Youngkyun (App)

UsTwo takes a recorded phone call between two people (couple, friends, family) and runs multimodal analysis (audio + text) to understand how each speaker felt. It then produces three things inside a React Native + FastAPI mobile app: a character reaction scene, a growing emotion garden, and an LLM-written recap card.

The goal isn't just emotion classification. It's to visualize "what kind of moment did these two share on this call?"

Home Recap card History


At a glance

[call recording .wav/.m4a]
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 1 (Seungjae) ────────────────────┐
β”‚  pyannote 4.x (3.1) β†’ WhisperX large-v3-turbo INT8 β†’ ko/en β”‚
β”‚  speaker diarization   ASR + forced alignment          LID  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚  segments: [{speaker, start, end, text, lang}]
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 2 (Seungjae) ────────────────────┐
β”‚  emotion2vec LoRA (ONNX)  +  KcELECTRA LoRA (ko) / DistilRoBERTa β”‚
β”‚  audio emotion (7-class)     text emotion (7-class ko / 7-class en) β”‚
β”‚                                                             β”‚
β”‚    fusion:  per-language, per-class trained weights (v2)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚  per-speaker emotion distribution
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 3 (Youngkyun) ───────────────────┐
β”‚  character_mapping  +  garden_logic  +  recap_generator    β”‚
β”‚  9 pair interactions    5 levels Β· 4 moods    Claude LLM   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Stage 4 (Youngkyun) ───────────────────┐
β”‚  FastAPI + SQLite          ↔          React Native (Expo)  β”‚
β”‚  6 endpoints, async                   expo-router Β· SVG    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

End-to-end latency: a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment. Live server: https://bbbakery-ustwo-api.hf.space


Demo β€” the garden grows

The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.

Level 1 β€” seedling Level 5 β€” full bloom

Level Threshold (cumulative interactions) Visual
1 0–2 Two seedlings
2 3–7 Grass + small flowers
3 8–14 Trees + many flowers
4 15–24 Lush garden + creatures
5 25+ Sunset sky + butterflies, rabbits + full bloom

A mood overlay (happy Β· neglected Β· recovering Β· conflict) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns neglected; a recent call with heavy negative emotion pushes it to conflict.


Stage 1–2 β€” ML Pipeline (Seungjae)

Stage 1: Diarization + ASR

Component Model Config
VAD Silero VAD onset 0.5, min_speech 0.25s
Diarization pyannote-audio 4.x (3.1) HF token required
ASR Faster-Whisper (large-v3-turbo) INT8 quantized
Forced alignment WhisperX wav2vec2 per-word timestamps
Language ID Whisper auto-detect + SenseVoice-Small ko / en / unknown

Output: Stage1Output β€” segments: [{speaker, start, end, text, lang}], processing_info, models. Entry point: src/stage1/process.py.

Stage 2: Emotion Recognition (bilingual)

Channel Model Classes
Audio emotion emotion2vec_plus_base + LoRA fine-tuned (ONNX) 7: neutral, joy, sadness, anger, surprise, fear, disgust
Text emotion β€” Korean KcELECTRA-base-v2022 + PEFT LoRA (ONNX) 7: neutral, joy, sadness, anger, surprise, fear, disgust
Text emotion β€” English DistilRoBERTa (j-hartmann/emotion-english, zero-shot) 7 Ekman + neutral
Fusion Per-language, per-class weights trained via gradient descent EMOTION_FUSION_WEIGHTS_KO / _EN

Evaluation β€” English (RAVDESS, n=2,880)

Condition Accuracy Macro F1 Latency
Clean (studio) 93.2% 0.932 250 ms
Phone (PSTN simulation, 300–3400 Hz band-limit) 71.5% 0.710 242 ms
Degradation βˆ’21.7 pp βˆ’0.222 β€”
  • Phone-robust emotions (F1 > 0.80): anger (0.836), surprise (0.856) β€” high-energy acoustic cues survive the band-limit.
  • Phone-degraded (largest drop): joy (0.946 β†’ 0.585, βˆ’0.361), sadness (0.868 β†’ 0.559, βˆ’0.309) β€” subtler pitch/timbre cues degrade hardest.
  • Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set): 95.2% accuracy.

Full report: docs/stage2/english-evaluation-report.md

Evaluation β€” Korean (KcELECTRA fine-tuning)

  • Fine-tuned on the AI Hub 감성 λŒ€ν™” λ§λ­‰μΉ˜ (emotion dialogue corpus) on Colab GPU.
  • Macro F1: 0.20 (base) β†’ 0.65 (fine-tuned), a 3.25Γ— improvement.
  • Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes.

Fusion weight training (v2, 2026-04-19)

Replaced greedy per-emotion grid search with PyTorch gradient descent training w_a = sigmoid(Ξ±) per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split.

Language Training data Samples Val Macro F1 (v1 β†’ v2)
English JL-Corpus + SAVEE + MELD + RAVDESS phone 2,821 0.6295 β†’ 0.7596 (+12.67%p)
Korean AI Hub 263 val 1,294 0.8736 β†’ 0.8748 (tie)

Biggest single win: English fear F1 rose from 0.04 β†’ 0.67 (greedy's audio_w=0.00 was a local trap). Per-language weight tables: src/common/constants.py (EMOTION_FUSION_WEIGHTS_KO / _EN). Details: docs/stage2/fusion-weights-english-grid-search.md.

End-to-end test β€” MELD (Friends TV dialogue)

We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server.

Metric Result
Pipeline success 8 / 8 files completed Stage 1 β†’ 2 β†’ 3
Exact top-1 emotion label match 7 / 7 (one tie)
Average processing time ~2 min / file (HF Spaces CPU)

End-to-end test β€” 20Hours Korean demo (2026-04-20)

Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo.

Metric Result
Pipeline success 7 / 7 files completed Stage 1 β†’ 2 β†’ 3
Intended-emotion match (one speaker) 4 / 7
Source data/20hours_test/ + scripts/test_20hours_e2e_server.py

Note: 20Hours source is ASR-training data without emotion labels β€” ground_truth.json lists intended demo emotions (for graph visualization), not ground-truth annotations.


Stage 3–4 β€” App + Server (Youngkyun)

Stage 3: Reaction Β· Garden Β· Recap

src/stage3/process.py takes the Stage 2 output and runs three independent modules:

Module Role Key logic
character_mapping.py Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4Γ—4 matrix β†’ one of 9 pair states + a giver role joy β†’ up, anger β†’ tense, surprise/fear/disgust resolved via residual distribution; up Γ— down β†’ comforting (giver=A), tense Γ— tense β†’ back_turned, calm Γ— calm β†’ idle, ...
character_mapping.py (intensity) Emits a CharacterReaction.intensity in {1, 2, 3} used by the app for healing-cycle thresholds Default 2 β†’ 3 cycles to heal; 1 β†’ 4 cycles; 3 β†’ 2 cycles
garden_logic.py Computes growth delta (0–3) from call quality positive_ratio β‰₯ 0.5 β†’ +3, pos β‰₯ 0.3 && neg < 0.3 β†’ +2, neg β‰₯ 0.5 β†’ 0
recap_generator.py Generates the narrative recap card via the Claude API System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing β€” titles are call-specific, not combo templates. Rule-based template fallback when no API key.

Mood resolution: neg β‰₯ 0.5 β†’ conflict; neg β‰₯ 0.3 && pos < 0.3 β†’ recovering; level ≀ 1 && pos < 0.3 β†’ neglected; otherwise happy. Level only goes up β€” tough calls shift the tint, never the count.

Confidence gate: a speaker whose top emotion probability is below 0.5 is demoted to calm for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle β€” the chip shows the dominant label, the particle shows the matrix cell after gating.

Stage 4: FastAPI backend

  • Framework: FastAPI + SQLAlchemy + SQLite (local) Β· 4 tables: calls, analysis_results, checkins, garden_state.
  • Async pipeline: POST /api/upload β†’ POST /api/analyze?call_id=X returns 202 Accepted + a background thread, and the client polls GET /api/calls/{id}.
  • Mock path: drop data/samples/{call_id}_stage2.json and the API skips Stage 1–2, running only Stage 3 β€” useful for E2E testing without the heavy ML deps.
  • Endpoints (6): /api/upload, /api/analyze, /api/calls, /api/calls/{id}, /api/checkins, /api/garden. /api/calls extracts recap_card.title from the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title.
  • Tests: 12 API tests on in-memory SQLite via pytest.

React Native (Expo) app

Router: expo-router β€” 3 tabs (Us, History, Settings) plus modal routes (checkin, results/[callId]).

Screen Description
Onboarding β€” paper-deck intro ("A garden for two") that sets the metaphor before the first call lands
Home (Us) β€” live character scene + garden, recent-call feed, mailbox entry point
Seed bloom alert β€” level-up moment; fires when the interaction count crosses a garden threshold, offering View / Later
Check-in β€” 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy)
Results β€” emotion analysis β€” My mood / Their mood chips, suggestion line in the garden-voice, Emotional Landscape wave (uplifting β†’ heavy), and Moments that mattered per-speaker slices
Results β€” recap card β€” call-specific LLM title + narrative + highlights + per-call Our Garden delta, followed by the Was this accurate? thumbs-up/down feedback loop
History β€” Our Emotional Flow chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title
Settings β€” language toggle (ko/en), developer mode, dev tools

Character animation layers (all built on react-native-reanimated + react-native-svg):

  1. Idle breathing β€” per-mood rhythm profile (up / calm / down / tense), uniform scaleAmp 1.02, micro-bob, micro-sway (1.4s cycle).
  2. Eye blink β€” 2.8–5.5s interval, 15% chance of a double-blink.
  3. Emotion transition β€” joy = bounce + arms raised, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden β€” emotion reads via face, pupil, body pose, and idle rhythm instead.
  4. Pair-state body pose β€” comforting giver moves 60% closer with no tilt; receiver gets a gentle sag + pupilOffset (head-lowering effect on eyes).
  5. Pair-state particle signature β€” one preset per 4Γ—4 matrix cell:
    • comforting β†’ staged echo: 1 heart (giver β†’ receiver) + 1 bubble dot + 1 translucent heart (bezier engine, useParticles).
    • dancing β†’ 1 music note rising above the heads.
    • cheering / listening / sitting_together / defusing β†’ yellow sparkle cluster.
    • tension / back_turned β†’ grey sigh cloud.
    • idle (calm Γ— calm) β†’ silent; the couple just wanders.
  6. Generalised healing β€” every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too β€” no one is left behind.
  7. Tap interaction β€” spring bounce + floating hearts / sparkles (FloatingHearts component).
  8. Wander β€” characters roam between pre-validated waypoints (3–8s move, 2–5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement).

Garden rendering (SVG): Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds β€” Level 4 and above).

i18n: a custom LocaleContext drives Korean/English switching with @ustwo/locale persisted to AsyncStorage.

State persistence

Store Data Key
SQLite (server) calls, analysis_results, checkins, garden_state β€”
AsyncStorage (app) locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache) @ustwo/*

Repository map

src/stage1/          Speaker diarization + ASR            (Seungjae)
src/stage2/          Audio + text emotion + fusion        (Seungjae)
src/stage3/          Character, garden, recap             (Youngkyun)
src/stage4/          FastAPI + SQLite + orchestration     (Youngkyun)
src/common/          Pydantic schemas shared across stages
app/                 React Native (Expo) app              (Youngkyun)
  src/routes/        expo-router (tabs, checkin, results, dev)
  src/components/    characters, garden, scenes, layout
  src/contexts/      Locale, Garden, DevMode
  src/hooks/         Animation hooks (idle, blink, emotion, tap, wander)
notebooks/           KcELECTRA fine-tuning (Colab)
tests/               pytest (73 Python tests) + jest (71 JS tests)
docs/stage{1,2,3,4}/ Per-stage technical docs
docs/images/         README screenshots
config.yaml          Global config (model paths, thresholds)
Dockerfile           HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg)

Getting started

1. Clone

git clone https://github.com/boolooppang/UsTwo.git
cd UsTwo

2. Backend (Python)

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=your_huggingface_token   # required by pyannote

uvicorn src.stage4.main:app --reload --port 8000
# β†’ POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls

3. App (React Native / Expo)

cd app && npm install
npx expo start --dev-client
# Press 'i' for iOS simulator, or scan the QR code with a real device

To point the app at a different server, edit API_BASE in app/src/api/client.ts (it currently defaults to the HF Spaces URL).

4. Tests

python -m pytest tests/ -v   # 73 Python tests
cd app && npx jest            # 71 JS tests (144 total)

Deployment β€” HuggingFace Spaces (Docker)

The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 β†’ 2 β†’ 3).

# Local Docker test
docker build -t ustwo .
docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo

Team & ownership

Member Role Responsibility
Juhyun Product Owner / Researcher Product design, UX research, empathic-accuracy literature review, emotion β†’ character mapping spec, evaluation plan, final paper
Seungjae ML Engineer Stage 1–2 end to end β€” diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning
Youngkyun App Engineer Stage 3–4 end to end β€” character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment

Success criteria

MVP β€” achieved:

  • βœ… Upload a call recording β†’ character reaction + recap card within ~2 minutes.
  • βœ… Happy vs tense calls produce visibly different character reactions (9 pair states).
  • βœ… Cumulative positive calls grow the garden through 5 levels.
  • βœ… Bilingual pipeline (Korean + English) with automatic language detection.
  • βœ… MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match.

License

MIT