Spaces:
Running
Running
Data Collection Protocol
#1
by
consome2
- opened
This thread finalizes the technical spec for data collection: sampling/bit depth, per-speaker channels, background noise policy, file naming, and minimal metadata (region, accent, age band, topic tags).
Data Collection Protocol – Draft Proposal (v0.1)
Summary
Proposed spec for collecting spontaneous two-speaker conversations: audio parameters, per-speaker channeling, metadata, random topic prompts, and an in-recording self-redaction feature.
A. Audio & Channeling
- Format: PCM WAV, 48 kHz, 16-bit
- Channeling: Dual-mono (Speaker A / Speaker B) per session; optional mixed-down reference track
- Environment: Everyday ambient noise allowed; copyrighted audio (music/TV/podcasts) is not allowed
B. Session Design
- Session length: 25 minutes (fixed) per session
- Chunking: Not used. If 5-minute chunking becomes operationally necessary, we will open a separate discussion before adopting it
- Starter prompts: Show 1–3 random topics at session start (e.g., favorite foods, countries visited, recent books)
- Task-oriented dialogs: Out of scope for now; may be added if community consensus emerges (e.g., trip planning, brainstorming, tongue-twisters, word-chain)
C. File Naming (TBD)
- Example (session-level files):
YYYYMMDD_sessionID_speaker{A|B}.wav
Companion JSON (session-level): YYYYMMDD_sessionID.meta.json
Do not place any personally identifying information in file names.
D. Metadata (per participant / session)
All profile fields are opt-out (per-field). Items may be masked or withheld at release time based on re-identification risk.
- speaker_id — stable pseudonymous public ID per participant (e.g., spk_6Z4G3Y9Q). A separate internal ID is maintained privately for withdrawal/compliance workflows
- age (e.g., 32)
- gender (self-identified)
- nationality
- birth_country and birth_state/prefecture
- accent (e.g., Japanese English)
- first_language, second_language
- education_level (e.g., bachelor’s degree)
- MBTI (optional)
- occupation (e.g., entrepreneur)
- residence_country
- interests (e.g., AI, Crypto)
- device/OS/microphone (technical metadata)
- network/latency logs (quality metadata)
E. In-Recording Privacy Control
- Self-redaction: During recording, a speaker can delete the most recent 10 seconds of their own speech, which is removed on-device and not uploaded
- UI hint: A “rewind 10s → delete” control with a confirmation dialog
Open Questions
- Naming convention: Start with YYYYMMDD_sessionID_speakerX or include short codes for language/locale/topic?
- Chunking trigger: Under what operational conditions (if any) should we introduce 5-minute chunking (e.g., moderation backlog thresholds)?
- High-risk fields: For items like alma mater, should we adopt “collect-yes / public-by-request or withheld-by-default”?
- Self-redaction duration: Keep 10s fixed, or offer selectable 5/10/20s?
- Non-dual-mono submissions: Should we accept WebRTC-style separated streams as an alternative to dual-mono?