Data Collection Protocol

#1
by consome2 - opened

This thread finalizes the technical spec for data collection: sampling/bit depth, per-speaker channels, background noise policy, file naming, and minimal metadata (region, accent, age band, topic tags).

Data Collection Protocol – Draft Proposal (v0.1)

Summary
Proposed spec for collecting spontaneous two-speaker conversations: audio parameters, per-speaker channeling, metadata, random topic prompts, and an in-recording self-redaction feature.

A. Audio & Channeling

  • Format: PCM WAV, 48 kHz, 16-bit
  • Channeling: Dual-mono (Speaker A / Speaker B) per session; optional mixed-down reference track
  • Environment: Everyday ambient noise allowed; copyrighted audio (music/TV/podcasts) is not allowed

B. Session Design

  • Session length: 25 minutes (fixed) per session
  • Chunking: Not used. If 5-minute chunking becomes operationally necessary, we will open a separate discussion before adopting it
  • Starter prompts: Show 1–3 random topics at session start (e.g., favorite foods, countries visited, recent books)
  • Task-oriented dialogs: Out of scope for now; may be added if community consensus emerges (e.g., trip planning, brainstorming, tongue-twisters, word-chain)

C. File Naming (TBD)

  • Example (session-level files):
    YYYYMMDD_sessionID_speaker{A|B}.wav
    Companion JSON (session-level): YYYYMMDD_sessionID.meta.json
    Do not place any personally identifying information in file names.

D. Metadata (per participant / session)
All profile fields are opt-out (per-field). Items may be masked or withheld at release time based on re-identification risk.

  • speaker_id — stable pseudonymous public ID per participant (e.g., spk_6Z4G3Y9Q). A separate internal ID is maintained privately for withdrawal/compliance workflows
  • age (e.g., 32)
  • gender (self-identified)
  • nationality
  • birth_country and birth_state/prefecture
  • accent (e.g., Japanese English)
  • first_language, second_language
  • education_level (e.g., bachelor’s degree)
  • MBTI (optional)
  • occupation (e.g., entrepreneur)
  • residence_country
  • interests (e.g., AI, Crypto)
  • device/OS/microphone (technical metadata)
  • network/latency logs (quality metadata)

E. In-Recording Privacy Control

  • Self-redaction: During recording, a speaker can delete the most recent 10 seconds of their own speech, which is removed on-device and not uploaded
  • UI hint: A “rewind 10s → delete” control with a confirmation dialog

Open Questions

  1. Naming convention: Start with YYYYMMDD_sessionID_speakerX or include short codes for language/locale/topic?
  2. Chunking trigger: Under what operational conditions (if any) should we introduce 5-minute chunking (e.g., moderation backlog thresholds)?
  3. High-risk fields: For items like alma mater, should we adopt “collect-yes / public-by-request or withheld-by-default”?
  4. Self-redaction duration: Keep 10s fixed, or offer selectable 5/10/20s?
  5. Non-dual-mono submissions: Should we accept WebRTC-style separated streams as an alternative to dual-mono?

Sign up or log in to comment