Spaces:

BBBAKERY
/

ustwo-api

Sleeping

App Files Files Community

ustwo-api / README.md

asdfasdfqrqwer

Deploy from GitHub 2026-04-23T03:56:31Z

c857b85 2 months ago

preview code

Raw

History Blame Contribute Delete

19 kB

metadata

title: UsTwo API
emoji: 💑
colorFrom: pink
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

UsTwo — your characters react to your calls

CMU 11-775 Large Scale Multimedia Analysis — Team Project 3-person team: Juhyun (PO / Research) · Seungjae (ML) · Youngkyun (App)

UsTwo takes a recorded phone call between two people (couple, friends, family) and runs multimodal analysis (audio + text) to understand how each speaker felt. It then produces three things inside a React Native + FastAPI mobile app: a character reaction scene, a growing emotion garden, and an LLM-written recap card.

The goal isn't just emotion classification. It's to visualize "what kind of moment did these two share on this call?"

Home Recap card History

At a glance

[call recording .wav/.m4a]
        │
        ▼
┌──────────────────── Stage 1 (Seungjae) ────────────────────┐
│  pyannote 4.x (3.1) → WhisperX large-v3-turbo INT8 → ko/en │
│  speaker diarization   ASR + forced alignment          LID  │
└──────────────────────────────────────────────────────────────┘
        │  segments: [{speaker, start, end, text, lang}]
        ▼
┌──────────────────── Stage 2 (Seungjae) ────────────────────┐
│  emotion2vec LoRA (ONNX)  +  KcELECTRA LoRA (ko) / DistilRoBERTa │
│  audio emotion (7-class)     text emotion (7-class ko / 7-class en) │
│                                                             │
│    fusion:  per-language, per-class trained weights (v2)   │
└─────────────────────────────────────────────────────────────┘
        │  per-speaker emotion distribution
        ▼
┌──────────────────── Stage 3 (Youngkyun) ───────────────────┐
│  character_mapping  +  garden_logic  +  recap_generator    │
│  9 pair interactions    5 levels · 4 moods    Claude LLM   │
└────────────────────────────────────────────────────────────┘
        │
        ▼
┌──────────────────── Stage 4 (Youngkyun) ───────────────────┐
│  FastAPI + SQLite          ↔          React Native (Expo)  │
│  6 endpoints, async                   expo-router · SVG    │
└────────────────────────────────────────────────────────────┘

End-to-end latency: a 1-minute call finishes in about 2 minutes on the HF Spaces CPU deployment. Live server: https://bbbakery-ustwo-api.hf.space

Demo — the garden grows

The emotion garden levels up as positive-ratio interactions accumulate. Level 1 shows just two seedlings. After many positive calls, Level 5 is a full bloom with flowers, trees, and creatures.

Level 1 — seedling Level 5 — full bloom

Level	Threshold (cumulative interactions)	Visual
1	0–2	Two seedlings
2	3–7	Grass + small flowers
3	8–14	Trees + many flowers
4	15–24	Lush garden + creatures
5	25+	Sunset sky + butterflies, rabbits + full bloom

A mood overlay (happy · neglected · recovering · conflict) is layered on top, decided by the recent positive/negative ratio and the days since the last interaction. A healthy garden that's been ignored for 5+ days turns neglected; a recent call with heavy negative emotion pushes it to conflict.

Stage 1–2 — ML Pipeline (Seungjae)

Stage 1: Diarization + ASR

Component	Model	Config
VAD	Silero VAD	onset 0.5, min_speech 0.25s
Diarization	pyannote-audio 4.x (3.1)	HF token required
ASR	Faster-Whisper (large-v3-turbo)	INT8 quantized
Forced alignment	WhisperX wav2vec2	per-word timestamps
Language ID	Whisper auto-detect + SenseVoice-Small	ko / en / unknown

Output: Stage1Output — segments: [{speaker, start, end, text, lang}], processing_info, models. Entry point: src/stage1/process.py.

Stage 2: Emotion Recognition (bilingual)

Channel	Model	Classes
Audio emotion	`emotion2vec_plus_base` + LoRA fine-tuned (ONNX)	7: neutral, joy, sadness, anger, surprise, fear, disgust
Text emotion — Korean	KcELECTRA-base-v2022 + PEFT LoRA (ONNX)	7: neutral, joy, sadness, anger, surprise, fear, disgust
Text emotion — English	DistilRoBERTa (j-hartmann/emotion-english, zero-shot)	7 Ekman + neutral
Fusion	Per-language, per-class weights trained via gradient descent	EMOTION_FUSION_WEIGHTS_KO / _EN

Evaluation — English (RAVDESS, n=2,880)

Condition	Accuracy	Macro F1	Latency
Clean (studio)	93.2%	0.932	250 ms
Phone (PSTN simulation, 300–3400 Hz band-limit)	71.5%	0.710	242 ms
Degradation	−21.7 pp	−0.222	—

Phone-robust emotions (F1 > 0.80): anger (0.836), surprise (0.856) — high-energy acoustic cues survive the band-limit.
Phone-degraded (largest drop): joy (0.946 → 0.585, −0.361), sadness (0.868 → 0.559, −0.309) — subtler pitch/timbre cues degrade hardest.
Text emotion sanity check (DistilRoBERTa on j-hartmann's held-out set): 95.2% accuracy.

Full report: docs/stage2/english-evaluation-report.md

Evaluation — Korean (KcELECTRA fine-tuning)

Fine-tuned on the AI Hub 감성 대화 말뭉치 (emotion dialogue corpus) on Colab GPU.
Macro F1: 0.20 (base) → 0.65 (fine-tuned), a 3.25× improvement.
Discovered an unlabeled "joy" cluster in the dataset and applied class weighting to recover minority classes.

Fusion weight training (v2, 2026-04-19)

Replaced greedy per-emotion grid search with PyTorch gradient descent training w_a = sigmoid(α) per class, optimized jointly with cross-entropy + L2 regularization on a held-out val split.

Language	Training data	Samples	Val Macro F1 (v1 → v2)
English	JL-Corpus + SAVEE + MELD + RAVDESS phone	2,821	0.6295 → 0.7596 (+12.67%p)
Korean	AI Hub 263 val	1,294	0.8736 → 0.8748 (tie)

Biggest single win: English fear F1 rose from 0.04 → 0.67 (greedy's audio_w=0.00 was a local trap). Per-language weight tables: src/common/constants.py (EMOTION_FUSION_WEIGHTS_KO / _EN). Details: docs/stage2/fusion-weights-english-grid-search.md.

End-to-end test — MELD (Friends TV dialogue)

We built 8 scenario WAVs from MELD (anger, joy, sadness, surprise, fear, bittersweet, annoyance, calm) and ran them through the deployed server.

Metric	Result
Pipeline success	8 / 8 files completed Stage 1 → 2 → 3
Exact top-1 emotion label match	7 / 7 (one tie)
Average processing time	~2 min / file (HF Spaces CPU)

End-to-end test — 20Hours Korean demo (2026-04-20)

Companion Korean E2E set curated from the 20Hours Korean Conversational Speech dataset (M-F pairs only). 7 scenarios @ ~1 min each for live demo.

Metric	Result
Pipeline success	7 / 7 files completed Stage 1 → 2 → 3
Intended-emotion match (one speaker)	4 / 7
Source	`data/20hours_test/` + `scripts/test_20hours_e2e_server.py`

Note: 20Hours source is ASR-training data without emotion labels — ground_truth.json lists intended demo emotions (for graph visualization), not ground-truth annotations.

Stage 3–4 — App + Server (Youngkyun)

Stage 3: Reaction · Garden · Recap

src/stage3/process.py takes the Stage 2 output and runs three independent modules:

Module	Role	Key logic
`character_mapping.py`	Buckets each speaker's 7-class emotion into 4 moods (up / calm / down / tense) and looks up the pair cell in a 4×4 matrix → one of 9 pair states + a giver role	`joy → up`, `anger → tense`, `surprise/fear/disgust` resolved via residual distribution; `up × down → comforting (giver=A)`, `tense × tense → back_turned`, `calm × calm → idle`, ...
`character_mapping.py` (intensity)	Emits a `CharacterReaction.intensity` in `{1, 2, 3}` used by the app for healing-cycle thresholds	Default 2 → 3 cycles to heal; 1 → 4 cycles; 3 → 2 cycles
`garden_logic.py`	Computes growth delta (0–3) from call quality	`positive_ratio ≥ 0.5` → +3, `pos ≥ 0.3 && neg < 0.3` → +2, `neg ≥ 0.5` → 0
`recap_generator.py`	Generates the narrative recap card via the Claude API	System prompt requires one concrete hook from the transcript (topic, decision, shared joke) + a light garden-voice framing — titles are call-specific, not combo templates. Rule-based template fallback when no API key.

Mood resolution: neg ≥ 0.5 → conflict; neg ≥ 0.3 && pos < 0.3 → recovering; level ≤ 1 && pos < 0.3 → neglected; otherwise happy. Level only goes up — tough calls shift the tint, never the count.

Confidence gate: a speaker whose top emotion probability is below 0.5 is demoted to calm for pair-state lookup. The 7-class mood chip on the Results screen can therefore differ from the pair-state particle — the chip shows the dominant label, the particle shows the matrix cell after gating.

Stage 4: FastAPI backend

Framework: FastAPI + SQLAlchemy + SQLite (local) · 4 tables: calls, analysis_results, checkins, garden_state.
Async pipeline: POST /api/upload → POST /api/analyze?call_id=X returns 202 Accepted + a background thread, and the client polls GET /api/calls/{id}.
Mock path: drop data/samples/{call_id}_stage2.json and the API skips Stage 1–2, running only Stage 3 — useful for E2E testing without the heavy ML deps.
Endpoints (6): /api/upload, /api/analyze, /api/calls, /api/calls/{id}, /api/checkins, /api/garden. /api/calls extracts recap_card.title from the stored Stage 3 JSON so the Home and History feeds can headline each card with a distinct title.
Tests: 12 API tests on in-memory SQLite via pytest.

React Native (Expo) app

Router: expo-router — 3 tabs (Us, History, Settings) plus modal routes (checkin, results/[callId]).

Screen	Description
	Onboarding — paper-deck intro ("A garden for two") that sets the metaphor before the first call lands
	Home (`Us`) — live character scene + garden, recent-call feed, mailbox entry point
	Seed bloom alert — level-up moment; fires when the interaction count crosses a garden threshold, offering `View` / `Later`
	Check-in — 2-step prompt: my mood, then my guess for the partner's mood (empathic accuracy)
	Results — emotion analysis — `My mood` / `Their mood` chips, suggestion line in the garden-voice, `Emotional Landscape` wave (uplifting → heavy), and `Moments that mattered` per-speaker slices
	Results — recap card — call-specific LLM title + narrative + highlights + per-call `Our Garden` delta, followed by the `Was this accurate?` thumbs-up/down feedback loop
	History — `Our Emotional Flow` chart (Me vs Partner), garden delta summary, recent moment cards headlined by the call-specific recap title
	Settings — language toggle (ko/en), developer mode, dev tools

Character animation layers (all built on react-native-reanimated + react-native-svg):

Idle breathing — per-mood rhythm profile (up / calm / down / tense), uniform scaleAmp 1.02, micro-bob, micro-sway (1.4s cycle).
Eye blink — 2.8–5.5s interval, 15% chance of a double-blink.
Emotion transition — joy = bounce + arms raised, sadness = sink + drooping arms, anger / disgust = crossed arms (no tilt), surprise = pop, fear = shrink. Body rotation is globally forbidden — emotion reads via face, pupil, body pose, and idle rhythm instead.
Pair-state body pose — comforting giver moves 60% closer with no tilt; receiver gets a gentle sag + pupilOffset (head-lowering effect on eyes).
Pair-state particle signature — one preset per 4×4 matrix cell:
- comforting → staged echo: 1 heart (giver → receiver) + 1 bubble dot + 1 translucent heart (bezier engine, useParticles).
- dancing → 1 music note rising above the heads.
- cheering / listening / sitting_together / defusing → yellow sparkle cluster.
- tension / back_turned → grey sigh cloud.
- idle (calm × calm) → silent; the couple just wanders.
Generalised healing — every giver-present cell (comforting, listening, defusing, cheering) counts interaction cycles; when the intensity-scaled threshold is hit the receiver transitions to neutral. If the giver's solo state was negative-caring (fear / anger / sadness / disgust), they heal too — no one is left behind.
Tap interaction — spring bounce + floating hearts / sparkles (FloatingHearts component).
Wander — characters roam between pre-validated waypoints (3–8s move, 2–5s pause), with direction-aware facing (scaleX flip for left/right, back-view for upward movement).

Garden rendering (SVG): Sky, Ground, Trees (level-specific types and counts), Flowers (distributed across four quadrants), Creatures (butterflies, rabbits, birds — Level 4 and above).

i18n: a custom LocaleContext drives Korean/English switching with @ustwo/locale persisted to AsyncStorage.

State persistence

Store	Data	Key
SQLite (server)	calls, analysis_results, checkins, garden_state	—
AsyncStorage (app)	locale, dev_mode, garden (interactionCount + lastPositiveRatio + lastInteractionDate), checkins (local cache)	`@ustwo/*`

Repository map

src/stage1/          Speaker diarization + ASR            (Seungjae)
src/stage2/          Audio + text emotion + fusion        (Seungjae)
src/stage3/          Character, garden, recap             (Youngkyun)
src/stage4/          FastAPI + SQLite + orchestration     (Youngkyun)
src/common/          Pydantic schemas shared across stages
app/                 React Native (Expo) app              (Youngkyun)
  src/routes/        expo-router (tabs, checkin, results, dev)
  src/components/    characters, garden, scenes, layout
  src/contexts/      Locale, Garden, DevMode
  src/hooks/         Animation hooks (idle, blink, emotion, tap, wander)
notebooks/           KcELECTRA fine-tuning (Colab)
tests/               pytest (73 Python tests) + jest (71 JS tests)
docs/stage{1,2,3,4}/ Per-stage technical docs
docs/images/         README screenshots
config.yaml          Global config (model paths, thresholds)
Dockerfile           HF Spaces deployment (Python 3.12 + torch CPU + ffmpeg)

Getting started

1. Clone

git clone https://github.com/boolooppang/UsTwo.git
cd UsTwo

2. Backend (Python)

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=your_huggingface_token   # required by pyannote

uvicorn src.stage4.main:app --reload --port 8000
# → POST /api/upload, POST /api/analyze?call_id=X, GET /api/calls

3. App (React Native / Expo)

cd app && npm install
npx expo start --dev-client
# Press 'i' for iOS simulator, or scan the QR code with a real device

To point the app at a different server, edit API_BASE in app/src/api/client.ts (it currently defaults to the HF Spaces URL).

4. Tests

python -m pytest tests/ -v   # 73 Python tests
cd app && npx jest            # 71 JS tests (144 total)

Deployment — HuggingFace Spaces (Docker)

The API server is deployed to HuggingFace Spaces as a Docker image, running the full pipeline (Stage 1 → 2 → 3).

Live URL: https://bbbakery-ustwo-api.hf.space
Environment variables: HF_TOKEN, ANTHROPIC_API_KEY (set in Spaces Settings).
Config files: Dockerfile, railway.toml, requirements-deploy.txt.

# Local Docker test
docker build -t ustwo .
docker run -p 7860:7860 -e HF_TOKEN=your_token ustwo

Team & ownership

Member	Role	Responsibility
Juhyun	Product Owner / Researcher	Product design, UX research, empathic-accuracy literature review, emotion → character mapping spec, evaluation plan, final paper
Seungjae	ML Engineer	Stage 1–2 end to end — diarization, ASR, audio emotion, text emotion (ko/en), RAVDESS/MELD evaluation, KcELECTRA fine-tuning
Youngkyun	App Engineer	Stage 3–4 end to end — character mapping implementation, garden logic, LLM recap, FastAPI + SQLite server, React Native app, HF Spaces Docker deployment

Success criteria

MVP — achieved:

✅ Upload a call recording → character reaction + recap card within ~2 minutes.
✅ Happy vs tense calls produce visibly different character reactions (9 pair states).
✅ Cumulative positive calls grow the garden through 5 levels.
✅ Bilingual pipeline (Korean + English) with automatic language detection.
✅ MELD-based end-to-end test: 8/8 pipeline success, 7/7 exact emotion-label match.

License

MIT