| # TREA 2.0 - Technical Documentation | |
| Comprehensive technical documentation for the TREA 2.0 audio dataset generation pipeline. This document covers the complete implementation including algorithms, mathematical formulations, configuration parameters, preprocessing details, and capacity-aware balancing mechanisms. | |
| **For Quick Start Guide**: See [README.md](README.md) | |
| --- | |
| ## Table of Contents | |
| 1. [Pipeline Overview](#pipeline-overview) | |
| 2. [How Sample Durations Are Generated](#how-sample-durations-are-generated) | |
| 3. [Configuration Reference](#configuration-reference) | |
| 4. [ESC-50 Preprocessing](#esc-50-preprocessing-duration-task-only) | |
| 5. [Audio Utilities](#audio-utilities) | |
| 6. [Task: COUNT](#task-count) | |
| 7. [Task: DURATION](#task-duration) | |
| 8. [Task: ORDER](#task-order) | |
| 9. [Task: VOLUME](#task-volume) | |
| 10. [Deterministic Balancing Mechanisms](#deterministic-balancing-mechanisms) | |
| 11. [Rejection Logic and Retry Mechanisms](#rejection-logic-and-retry-mechanisms) | |
| 12. [Command-Line Arguments](#command-line-arguments) | |
| 13. [Summary](#summary) | |
| --- | |
| ## Pipeline Overview | |
| ### Architecture | |
| The pipeline generates four types of audio-based question-answering samples: | |
| | Task | Question Type | Example Question | | |
| |------|---------------|------------------| | |
| | **COUNT** | Counting unique sounds | "How many unique sounds do you hear?" | | |
| | **DURATION** | Temporal comparison | "Which sound plays for the longest duration?" | | |
| | **ORDER** | Temporal ordering | "Which sound plays first/last/after X?" | | |
| | **VOLUME** | Loudness comparison | "Which sound is the loudest/softest?" | | |
| ### Directory Structure | |
| ``` | |
| pipeline/ | |
| βββ main.py # Entry point - orchestrates all tasks | |
| βββ config.yaml # All configuration parameters | |
| βββ tasks/ | |
| β βββ task_count.py # CountTaskGenerator class | |
| β βββ task_duration.py # DurationTaskGenerator class | |
| β βββ task_order.py # OrderTaskGenerator class | |
| β βββ task_volume.py # VolumeTaskGenerator class | |
| βββ utils/ | |
| β βββ __init__.py # Exports all utilities | |
| β βββ audio_utils.py # Audio processing functions | |
| β βββ dataset_utils.py # ESC50Dataset, PreprocessedESC50Dataset | |
| β βββ question_utils.py # QuestionGenerator | |
| β βββ llm_utils.py # LLMQuestionGenerator | |
| β βββ logger.py # setup_logger | |
| βββ output/ # Generated outputs | |
| ``` | |
| ### Data Flow | |
| ``` | |
| ESC-50 Dataset (2000 clips, 50 categories, 5s each) | |
| β | |
| [DURATION TASK ONLY] Preprocessing Script (preprocess_esc50.py) | |
| βββ Detects sound regions using adaptive noise-floor thresholding | |
| βββ Trims leading/trailing silence (keeps internal structure) | |
| βββ Calculates effective durations | |
| β | |
| ESC-50_preprocessed/ | |
| βββ effective_durations.csv (metadata with effective durations) | |
| βββ trimmed_audio/*.wav (edge-trimmed clips) | |
| β | |
| Pipeline (task-specific generation with balancing) | |
| βββ COUNT: Uses raw ESC-50 clips | |
| βββ DURATION: Uses preprocessed clips with effective durations | |
| βββ ORDER: Uses raw ESC-50 clips | |
| βββ VOLUME: Uses raw ESC-50 clips (normalized then volume-adjusted) | |
| β | |
| output/{task}/ | |
| βββ audios/*.wav (generated audio samples) | |
| βββ {task}_mcq.csv (multiple choice questions) | |
| βββ {task}_open_text.csv (open-ended questions) | |
| βββ {task}_metadata.csv (detailed metadata) | |
| ``` | |
| ### Entry Point: `main.py` | |
| The main orchestration happens via individual task runner functions: | |
| ```python | |
| def run_count_task(config: dict, logger): | |
| generator = CountTaskGenerator(config, logger) | |
| generator.dataset.reset_category_usage() | |
| generator.generate_dataset() | |
| def run_duration_task(config: dict, logger): | |
| generator = DurationTaskGenerator(config, logger) | |
| generator.dataset.reset_category_usage() | |
| generator.generate_dataset() | |
| def run_order_task(config: dict, logger): | |
| generator = OrderTaskGenerator(config, logger) | |
| generator.dataset.reset_category_usage() | |
| generator.generate_dataset() | |
| def run_volume_task(config: dict, logger): | |
| generator = VolumeTaskGenerator(config, logger) | |
| generator.dataset.reset_category_usage() | |
| generator.generate_dataset() | |
| ``` | |
| --- | |
| ## How Sample Durations Are Generated | |
| **IMPORTANT**: Sample durations are generated upfront to **exactly fill the target task duration**. | |
| ### The Algorithm | |
| Located in `utils/audio_utils.py`: | |
| ```python | |
| def generate_sample_durations_for_task( | |
| task_duration_hours: float, | |
| min_clip_duration: float, | |
| max_clip_duration: float | |
| ) -> list: | |
| """ | |
| Generate sample durations that exactly fill the target task duration. | |
| """ | |
| task_duration_seconds = task_duration_hours * 3600 | |
| remaining = task_duration_seconds | |
| durations = [] | |
| while remaining >= min_clip_duration: | |
| # Cap max at remaining to avoid overshoot | |
| effective_max = min(max_clip_duration, remaining) | |
| # If remaining is less than min, we can't fit another sample | |
| if effective_max < min_clip_duration: | |
| break | |
| # Sample uniformly within valid range | |
| d = random.uniform(min_clip_duration, effective_max) | |
| durations.append(d) | |
| remaining -= d | |
| # Shuffle to randomize order | |
| random.shuffle(durations) | |
| return durations | |
| ``` | |
| 1. Start with `remaining = total_seconds` | |
| 2. While `remaining >= min_clip_duration`: | |
| - Sample `d ~ Uniform(min, min(max, remaining))` | |
| - Append `d` to durations list | |
| - Subtract `d` from remaining | |
| 3. Shuffle and return | |
| ### Mathematical Properties | |
| **Guarantee**: $\sum_{i=1}^{N} d_i \leq T$ and $T - \sum d_i < d_{\min}$ | |
| Where: | |
| - $T$ = total task duration | |
| - $d_i$ = duration of sample $i$ | |
| - $d_{\min}$ = minimum clip duration | |
| - $N$ = number of samples generated (variable, not fixed!) | |
| **Each duration**: $d_i \sim \text{Uniform}(d_{\min}, \min(d_{\max}, \text{remaining}_i))$ | |
| ### Example | |
| With `task_duration_size = 1.0` hours (3600s), `min = 20s`, `max = 60s`: | |
| ``` | |
| remaining=3600 β dβ=45.2s β remaining=3554.8 | |
| remaining=3554.8 β dβ=28.7s β remaining=3526.1 | |
| remaining=3526.1 β dβ=52.1s β remaining=3474.0 | |
| ... | |
| remaining=35.2 β dββ=35.2s β remaining=0 (capped at remaining) | |
| ``` | |
| Result: 89 samples totaling exactly 3600s (instead of estimated 90) | |
| ### Where It's Called | |
| Each task's `generate_dataset()` method uses this: | |
| ```python | |
| def generate_dataset(self) -> tuple: | |
| # Generate all durations upfront | |
| sample_durations = generate_sample_durations_for_task( | |
| self.task_duration_hours, | |
| self.min_clip_duration, | |
| self.max_clip_duration | |
| ) | |
| num_samples = len(sample_durations) | |
| self.logger.info(f"Generating {num_samples} samples...") | |
| # Each sample uses its pre-assigned duration | |
| for i, target_duration in enumerate(sample_durations): | |
| metadata = self.generate_sample(i, target_duration=target_duration, ...) | |
| ``` | |
| ``` | |
| --- | |
| ## Configuration Reference | |
| All parameters are defined in `config.yaml`. | |
| ### Dataset Class Subset Configuration | |
| ```yaml | |
| dataset: | |
| use_class_subset: false # Enable to use only a subset of ESC-50 classes | |
| num_classes_subset: 40 # Number of classes for train/val/test (e.g., 40 of 50) | |
| subset_persist_path: "output/class_subset.json" # Path to save/load class subset | |
| subset_seed: 42 # Random seed for subset selection (persisted) | |
| ``` | |
| **Purpose**: Create in-distribution (ID) splits using a subset of classes, then optionally test on out-of-distribution (OOD) using all classes. | |
| **Workflow**: | |
| 1. Set `use_class_subset: true` and `num_classes_subset: 40` | |
| 2. Run pipeline - 40 classes randomly selected and saved to `class_subset.json` | |
| 3. Generate train/val/test splits - all use same 40 classes | |
| 4. For OOD test: Set `use_class_subset: false`, use different output path | |
| ### Global Audio Parameters | |
| ```yaml | |
| audio: | |
| min_clip_duration: 20.0 # Minimum generated clip duration (seconds) | |
| max_clip_duration: 60.0 # Maximum generated clip duration (seconds) | |
| source_clip_duration: 5.0 # ESC-50 clip length (seconds) | |
| # Silence and crossfade parameters (applied to ALL tasks) | |
| min_silence_duration: 100 # Minimum silence ALWAYS between clips (ms) | |
| max_extra_silence_per_gap: 500 # Max extra silence per gap when distributing remainder (ms) | |
| crossfade_duration: 500 # Crossfade between audio-silence transitions (ms) for smooth joins | |
| crossfade_within_source: 50 # Small crossfade within same-source repetitions (ms) for COUNT task | |
| with_silence: true # Enable silence insertion between clips | |
| normalize: false | |
| normalize_target_dBFS: -20.0 | |
| ``` | |
| ### Task-Specific Parameters | |
| #### COUNT Task | |
| ```yaml | |
| count: | |
| enabled: true | |
| task_duration_size: 2.0 # Hours of total audio to generate | |
| max_clips_per_sample: 10 # Maximum unique sounds per sample (1 to 10) | |
| ordering_mode: "random" # "random" (shuffled clips) or "consecutive" (grouped by source) | |
| # CAPACITY-AWARE ANSWER BALANCING: | |
| # - Creates balanced distribution of answers from 1 to max_clips_per_sample | |
| # - Sorts samples by capacity (max_clips each can fit) | |
| # - Assigns higher targets to high-capacity samples | |
| # - Clamps targets to what actually fits (reduces excessive silence) | |
| ``` | |
| #### DURATION Task | |
| ```yaml | |
| duration: | |
| enabled: true | |
| task_duration_size: 2.0 | |
| preprocessed_data_path: "/home/debarpanb1/TREA_2.0/ESC-50_preprocessed" | |
| question_types: ["shortest", "longest"] | |
| num_unique_sources: 10 # Can be int or list (e.g., [2,3,4,5]) | |
| ordering_methods: ["consecutive"] # Only consecutive for duration task | |
| # Preprocessing parameters (adaptive noise-floor thresholding) | |
| threshold_strategy: "noise_floor" # Adaptive per-clip (recommended) | |
| noise_floor_percentile: 2.0 # Use 2nd percentile as noise floor | |
| noise_floor_delta_db: 5.0 # Threshold = noise_floor + 5dB | |
| min_sound_duration_ms: 25 # Filter transient spikes | |
| # Gap multipliers | |
| multiplier_longest: 1.5 # Target must be β₯ 1.5x max background | |
| multiplier_shortest: 0.75 # Target must be β€ 0.75x min background (changed from 0.5) | |
| min_effective_duration_per_source: 1.0 # Minimum duration per source (seconds) | |
| reject_if_gap_not_met: true | |
| sample_different_clips_same_class: true | |
| ``` | |
| #### ORDER Task | |
| ```yaml | |
| order: | |
| enabled: true | |
| task_duration_size: 2.0 | |
| max_clips_per_sample: 10 # Cap for maximum clips to join | |
| question_types: ["first", "last", "second", "second_last", "after", "before"] | |
| min_clips_for_second_questions: 3 # "second" and "second_last" require β₯3 clips | |
| allow_source_repetition: false # Each clip from unique source | |
| # CAPACITY-AWARE QUESTION TYPE BALANCING: | |
| # - Each question type appears equally across samples | |
| # - Advanced types (second, second_last) assigned to high-capacity samples | |
| # - Basic types (first, last, after, before) for lower-capacity samples | |
| # - NO n_clips balancing: randomly samples from [max(2, max_clips-3), max_clips_per_sample] | |
| ``` | |
| #### VOLUME Task | |
| ```yaml | |
| volume: | |
| enabled: true | |
| task_duration_size: 2.0 | |
| max_clips_per_sample: 10 # Cap for maximum clips with different volumes | |
| question_types: ["max_loudness", "min_loudness"] | |
| # Normalization (CRITICAL for controlled volume comparison) | |
| normalize_to_baseline: true | |
| baseline_dBFS: -20.0 # All clips normalized to this level first | |
| use_lufs: false # DISABLED - LUFS makes everything same perceived loudness! | |
| baseline_lufs: -23.0 # EBU R128 standard (not used when use_lufs=false) | |
| # Volume gap constraints (multipliers) | |
| multiplier_max_loudness: 4.0 # Max must be β₯ 4x second-loudest (~12 dB) | |
| multiplier_min_loudness: 0.25 # Min must be β€ 0.25x second-softest (~12 dB) | |
| reject_if_gap_not_met: true | |
| # Source clip options | |
| use_same_clip_different_volumes: false # Use different clips (not same clip repeated) | |
| repetitions_per_source: [2, 3, 4] # If same clip used, how many repetitions | |
| # QUESTION TYPE BALANCING: Each question type appears equally across samples | |
| # NO n_clips balancing: randomly samples from [max(2, max_clips-3), max_clips_per_sample] | |
| ``` | |
| --- | |
| ## ESC-50 Preprocessing (Duration Task Only) | |
| **File**: `preprocess_esc50.py` | |
| **Purpose**: Preprocess ESC-50 clips for duration task by detecting actual sound regions and trimming silence. | |
| ### Why Preprocessing? | |
| The DURATION task compares sound durations. Raw ESC-50 clips have variable amounts of leading/trailing silence, which would make duration comparisons ambiguous. Preprocessing: | |
| 1. **Detects actual sound regions** using adaptive amplitude thresholding | |
| 2. **Trims leading and trailing silence** (preserves internal structure) | |
| 3. **Calculates effective duration** (sum of all sound regions) | |
| 4. **Generates metadata CSV** with per-clip durations | |
| ### Preprocessing Pipeline | |
| ``` | |
| Raw ESC-50 clip (5s with silence) | |
| β | |
| 1. Load audio and convert to amplitude array | |
| 2. Compute RMS envelope (frame-by-frame energy) | |
| 3. Convert RMS to dB values | |
| 4. Apply adaptive threshold strategy | |
| 5. Detect contiguous sound regions | |
| 6. Trim edges (only if silence >= 100ms) | |
| 7. Calculate effective duration | |
| 8. Save trimmed audio + metadata | |
| ``` | |
| ### Adaptive Noise-Floor Thresholding | |
| The preprocessing uses an **adaptive per-clip threshold** strategy: | |
| ```python | |
| # Strategy: 'noise_floor' (adaptive, recommended) | |
| noise_floor_db = np.percentile(db_values, noise_floor_percentile) # e.g., 2nd percentile | |
| absolute_threshold = noise_floor_db + noise_floor_delta_db # e.g., +5 dB above noise floor | |
| ``` | |
| **Key Parameters** (from `config.yaml`): | |
| ```yaml | |
| duration: | |
| threshold_strategy: "noise_floor" # Adaptive per-clip (recommended) | |
| noise_floor_percentile: 2.0 # Use 2nd percentile as noise floor estimate | |
| noise_floor_delta_db: 5.0 # Threshold = noise_floor + 5 dB | |
| min_sound_duration_ms: 25 # Filter out transient spikes < 25ms | |
| ``` | |
| **Why Adaptive?** | |
| - Each clip has different background noise levels | |
| - Fixed threshold (e.g., -40 dB) works poorly across diverse sounds | |
| - Adaptive threshold adjusts per-clip based on its own noise floor | |
| **Alternative** (legacy): | |
| ```yaml | |
| threshold_strategy: "peak_relative" # threshold = peak_dB - 20 dB (fixed offset) | |
| amplitude_threshold_db: -20.0 | |
| ``` | |
| ### Edge Trimming Strategy | |
| **ADAPTIVE EDGE-ONLY TRIMMING** - preserves natural periodicity: | |
| ```python | |
| def extract_sound_with_edges_trimmed(audio, regions, min_silence_to_trim_ms=100, buffer_ratio=0.1): | |
| """ | |
| Trim ONLY leftmost and rightmost silence IF significant. | |
| Preserves ALL internal structure (perfect for periodic sounds). | |
| """ | |
| leading_silence_ms = regions[0][0] # Time before first sound | |
| trailing_silence_ms = len(audio) - regions[-1][1] # Time after last sound | |
| # Only trim if silence >= 100ms | |
| if leading_silence_ms >= min_silence_to_trim_ms: | |
| buffer_ms = max(200, int(leading_silence_ms * 0.1)) # Keep 10% as buffer | |
| trim_start_ms = max(0, regions[0][0] - buffer_ms) | |
| else: | |
| trim_start_ms = 0 # Keep from start | |
| # Similar for trailing silence | |
| ... | |
| return audio[trim_start_ms:trim_end_ms] | |
| ``` | |
| **Why Edge-Only?** | |
| - Clock ticks, footsteps, typing have periodic silence between sounds | |
| - Removing internal silences destroys natural rhythm | |
| - Edge trimming removes irrelevant silence while preserving periodicity | |
| ### Output Files | |
| ``` | |
| ESC-50_preprocessed/ | |
| βββ effective_durations.csv | |
| β βββ filename | |
| β βββ category | |
| β βββ raw_duration_s (original 5.0s) | |
| β βββ final_duration_s (after edge trimming) | |
| β βββ effective_duration_s (sum of sound regions) | |
| β βββ num_sound_regions | |
| β βββ peak_amplitude_db | |
| β βββ avg_rms_db | |
| β βββ threshold_strategy, noise_floor_percentile, noise_floor_delta_db | |
| βββ trimmed_audio/ | |
| βββ 1-100032-A-0.wav (edge-trimmed clips) | |
| βββ ... | |
| ``` | |
| ### Running Preprocessing | |
| ```bash | |
| # Using config defaults | |
| python preprocess_esc50.py --config config.yaml | |
| # Override parameters | |
| python preprocess_esc50.py --config config.yaml \ | |
| --threshold-strategy noise_floor \ | |
| --noise-floor-percentile 2.0 \ | |
| --noise-floor-delta-db 5.0 \ | |
| --min-sound-ms 25 | |
| # Don't save trimmed audio (only CSV) | |
| python preprocess_esc50.py --config config.yaml --no-trimmed-audio | |
| ``` | |
| ### Preprocessing Statistics Example | |
| ``` | |
| ESC-50 Preprocessing Summary | |
| ============================================================ | |
| Total clips processed: 2000 | |
| Successfully processed: 2000 | |
| Raw duration statistics: | |
| Mean: 5.000s Std: 0.000s Min: 5.000s Max: 5.000s | |
| Final duration statistics (edges trimmed): | |
| Mean: 4.723s Std: 0.412s Min: 2.134s Max: 5.000s | |
| Effective duration statistics (sum of sound regions): | |
| Mean: 3.856s Std: 0.823s Min: 0.542s Max: 4.982s | |
| Comparison: | |
| Avg effective: 3.856s | |
| Avg final: 4.723s | |
| Difference: 0.867s (internal silences preserved) | |
| Average edge trimming reduction: 5.5% | |
| ``` | |
| ### How Duration Task Uses Preprocessed Data | |
| The `DurationTaskGenerator` loads preprocessed data: | |
| ```python | |
| self.preprocessed_dataset = PreprocessedESC50Dataset( | |
| metadata_csv=config['tasks']['duration']['preprocessed_data_path'] + '/effective_durations.csv', | |
| audio_dir=config['tasks']['duration']['preprocessed_data_path'] + '/trimmed_audio' | |
| ) | |
| # Calculate average effective duration for slot distribution | |
| effective_durations = self.preprocessed_dataset.metadata_df['effective_duration_s'] | |
| self.avg_effective_duration = effective_durations.mean() # ~3.856s | |
| ``` | |
| --- | |
| ## Audio Utilities | |
| Located in `utils/audio_utils.py`. | |
| ### `generate_single_clip_duration(min_duration, max_duration) β float` | |
| **Purpose**: Generate a random target clip duration using UNIFORM sampling. | |
| **Implementation**: | |
| ```python | |
| def generate_single_clip_duration(min_duration: float, max_duration: float) -> float: | |
| return random.uniform(min_duration, max_duration) | |
| ``` | |
| **Mathematical Formulation**: | |
| $$d \sim \text{Uniform}(d_{\min}, d_{\max})$$ | |
| With default values (20s, 60s): | |
| - Mean: $\mu = \frac{20 + 60}{2} = 40$ seconds | |
| - Standard Deviation: $\sigma = \frac{60 - 20}{\sqrt{12}} \approx 11.5$ seconds | |
| --- | |
| ### `get_max_clip_num_to_be_joined(target_duration_s, source_duration_s, min_silence_ms) β Tuple[int, float]` | |
| **Purpose**: Calculate maximum number of source clips that can fit in target duration. | |
| **Returns**: Tuple of (max_clips, remainder_seconds) | |
| **Implementation** (conceptual): | |
| ```python | |
| def get_max_clip_num_to_be_joined(target_s, source_s, min_silence_ms): | |
| silence_s = min_silence_ms / 1000.0 | |
| # Each clip + silence except last | |
| effective_unit = source_s + silence_s | |
| max_clips = int((target_s + silence_s) / effective_unit) | |
| remainder = target_s - (max_clips * source_s + (max_clips - 1) * silence_s) | |
| return max_clips, remainder | |
| ``` | |
| **Mathematical Formula**: | |
| $$N_{\max} = \left\lfloor \frac{T + g}{S + g} \right\rfloor$$ | |
| Where: | |
| - $T$ = target duration (seconds) | |
| - $S$ = source clip duration (5.0s for ESC-50) | |
| - $g$ = minimum silence gap (seconds) | |
| --- | |
| ### `build_count_task_audio(source_audios, source_categories, target_duration, ...)` | |
| **Purpose**: Build the final audio for COUNT task. | |
| **Parameters**: | |
| - `source_audios`: List of AudioSegment objects (one per category) | |
| - `source_categories`: List of category names | |
| - `target_duration`: Target total duration in seconds | |
| - `ordering_mode`: "random" or "consecutive" | |
| - `source_clip_duration_seconds`: Duration of each source clip | |
| - `min_silence_ms`, `max_extra_silence_per_gap_ms`: Silence parameters | |
| **Returns**: Tuple of (final_audio, clip_sequence, build_metadata) | |
| --- | |
| ### `build_duration_task_audio(...)` | |
| **Purpose**: Build audio for DURATION task with slot distribution. | |
| --- | |
| ### `build_clip_sequence_with_silences(clips, target_duration_s, min_silence_ms, max_extra_silence_per_gap_ms, crossfade_ms)` | |
| **Purpose**: Concatenate clips with random silence gaps and smooth crossfades. | |
| **Algorithm**: | |
| 1. Calculate total audio content duration | |
| 2. Calculate minimum required silence: `(n_clips - 1) Γ min_silence_ms` | |
| 3. Calculate available extra time: `target_duration - total_audio - min_silence` | |
| 4. Distribute extra time randomly across gaps (up to `max_extra_silence_per_gap_ms` per gap) | |
| 5. Build sequence with crossfades: | |
| - Audio β Silence: crossfade for smooth transition | |
| - Silence β Audio: No crossfade (preserves audio start) | |
| **Crossfade Benefits**: | |
| - Smooth transitions between audio and silence | |
| - Reduces clicks/pops at audio boundaries | |
| - Preserves natural sound attack (no crossfade at audio start) | |
| --- | |
| ## Task: COUNT | |
| **File**: `tasks/task_count.py` | |
| **Class**: `CountTaskGenerator` | |
| ### Complete Flow | |
| ``` | |
| CountTaskGenerator.__init__(config, logger) | |
| β | |
| Initialize: | |
| - ESC50Dataset (loads metadata, tracks category usage) | |
| - AudioProcessor | |
| - QuestionGenerator | |
| - LLMQuestionGenerator (if enabled) | |
| β | |
| generate_dataset() | |
| β | |
| 1. num_samples = calculate_num_samples_for_task(task_duration_hours, min, max) | |
| 2. Create balanced_answers list from num_clips_per_sample | |
| 3. Shuffle balanced_answers | |
| 4. For each sample: | |
| generate_sample(sample_id, target_unique_count=balanced_answers[i]) | |
| 5. Save CSVs | |
| ``` | |
| ### Key Method: `generate_sample(sample_id, target_unique_count)` | |
| **Pipeline**: | |
| 1. Generate random target duration: `clip_duration_seconds = generate_single_clip_duration(min, max)` | |
| 2. Calculate max clips: `max_clips, remainder = get_max_clip_num_to_be_joined(...)` | |
| 3. Cap `n_unique_audios` at min(target_unique_count, max_clips, 50) | |
| 4. Select categories: `selected_categories = dataset.get_least_used_categories(n_unique_audios)` | |
| 5. Track usage: Increment `category_usage_counts` for each selected category | |
| 6. Sample one file per category: `dataset.sample_file_from_category(category)` | |
| 7. Load source audios | |
| 8. Build final audio: `build_count_task_audio(source_audios, categories, target_duration, ordering_mode, ...)` | |
| 9. Export audio file | |
| 10. Generate MCQ and open-text questions | |
| 11. Return metadata dict | |
| ### Balanced Answer Distribution (Updated with max_clips_per_sample) | |
| ```python | |
| # In generate_dataset() | |
| max_clips_per_sample = self.task_config.get('max_clips_per_sample', 10) # Single number: 10 | |
| possible_answers = list(range(1, max_clips_per_sample + 1)) # [1, 2, 3, ..., 10] | |
| samples_per_answer = num_samples // len(possible_answers) | |
| remainder = num_samples % len(possible_answers) | |
| balanced_answers = [] | |
| for answer in possible_answers: | |
| count = samples_per_answer + (1 if remainder > 0 else 0) | |
| balanced_answers.extend([answer] * count) | |
| remainder = max(0, remainder - 1) | |
| random.shuffle(balanced_answers) | |
| ``` | |
| **For 90 samples, max_clips_per_sample=10**: Each answer (1-10) appears exactly 9 times. | |
| ### Silence Reduction Strategy (NEW) | |
| Each sample's target answer is capped at what actually fits in the duration: | |
| ```python | |
| # In generate_sample() | |
| max_clips, _ = get_max_clip_num_to_be_joined(clip_duration_seconds, source_clip_duration, min_silence_ms) | |
| if target_unique_count is not None: | |
| # Cap target at what actually fits (reduces silence) | |
| n_unique_audios = min(target_unique_count, max_clips, len(CATEGORIES)) | |
| ``` | |
| **Example**: | |
| - Target answer from balanced pool: **8 unique sounds** | |
| - Duration allows: **max_clips = 7** | |
| - Actual n_unique_audios: **min(8, 7) = 7** β (uses max possible, reduces silence) | |
| **Why?** Prevents excessive silence when target exceeds what fits in duration. | |
| --- | |
| ## Task: DURATION | |
| **File**: `tasks/task_duration.py` | |
| **Class**: `DurationTaskGenerator` | |
| ### Complete Flow | |
| ``` | |
| DurationTaskGenerator.__init__(config, logger) | |
| β | |
| Initialize: | |
| - PreprocessedESC50Dataset (uses effective_durations.csv) | |
| - Calculate avg_effective_duration from preprocessed data | |
| - AudioProcessor, QuestionGenerator | |
| - Load multiplier_longest, multiplier_shortest from config | |
| β | |
| generate_dataset() | |
| β | |
| 1. num_samples = calculate_num_samples_for_task(...) | |
| 2. Create balanced question types: ["longest"] * 45 + ["shortest"] * 45 | |
| 3. Shuffle balanced_types | |
| 4. While len(samples) < num_samples: | |
| generate_sample(sample_idx, question_type=balanced_types[idx]) | |
| If returns None β increment rejection_count, continue | |
| 5. Save CSVs | |
| ``` | |
| ### Key Methods | |
| #### `_calculate_max_clips_and_sources(target_duration_s, question_type)` | |
| **Purpose**: Determine valid number of sources based on question type and duration. | |
| **For LONGEST**: | |
| - Target needs β₯2 clips to beat backgrounds by 1.5x | |
| - `min_valid_sources = 2` | |
| - `max_valid_sources = max_clips - 2 + 1` | |
| **For SHORTEST**: | |
| - Target gets 1 clip | |
| - Each background needs β₯2 clips to be 2x target | |
| - `max_valid_sources = 1 + (max_clips - 1) // 2` | |
| ```python | |
| # Filter config values to valid range, then pick RANDOMLY | |
| valid_config_sources = [n for n in num_sources_config if min_valid <= n <= max_valid] | |
| n_sources = random.choice(valid_config_sources) | |
| ``` | |
| #### `_try_generate_sample(sample_id, question_type)` | |
| **Full Algorithm**: | |
| 1. Generate target duration: `generate_single_clip_duration(min, max)` | |
| 2. Calculate max_clips and n_sources: `_calculate_max_clips_and_sources(...)` | |
| 3. Select target category (least used) | |
| 4. Select background categories (from remaining least used) | |
| 5. Calculate slot distribution based on question_type | |
| 6. For each category, select source files and generate clip durations | |
| 7. Load and trim clips | |
| 8. Calculate total effective duration per category | |
| 9. Verify gap constraint | |
| 10. If gap not satisfied, try `_try_improve_slot_distribution()` | |
| 11. If still not satisfied, return None (triggers retry) | |
| 12. Build audio and generate questions | |
| 13. Return metadata | |
| #### `_try_improve_slot_distribution(slot_distribution, durations, question_type, max_clips)` | |
| **Purpose**: Redistribute slots to satisfy gap constraint. | |
| --- | |
| ## Task: ORDER | |
| **File**: `tasks/task_order.py` | |
| **Class**: `OrderTaskGenerator` | |
| ### Complete Flow | |
| ``` | |
| OrderTaskGenerator.__init__(config, logger) | |
| β | |
| Initialize ESC50Dataset, AudioProcessor, QuestionGenerator | |
| β | |
| generate_dataset() | |
| β | |
| 1. Generate sample durations upfront (exact fill) | |
| 2. num_samples = len(sample_durations) | |
| 3. Create balanced question_types distribution | |
| 4. For each sample: | |
| generate_sample(sample_id, target_question_type=balanced_types[i]) | |
| β n_clips randomly selected from [max(2, max_clips-3), min(max_clips, max_clips_per_sample)] | |
| 5. Save CSVs | |
| ``` | |
| ### Key Method: `_get_valid_question_types(n_clips)` | |
| Filters question types based on clip count: | |
| - `second`, `second_last`: require `n_clips >= min_clips_for_second_questions` (default: 4) | |
| - `after`, `before`: require `n_clips >= 2` | |
| - `first`, `last`: always valid | |
| ### Key Method: `generate_sample(sample_id, target_question_type, target_duration_seconds)` | |
| **Algorithm**: | |
| 1. Use pre-generated `target_duration_seconds` (from sample_durations) | |
| 2. Calculate max_clips from duration: `get_max_clip_num_to_be_joined(...)` | |
| 3. **Silence reduction - randomly select n_clips**: | |
| ```python | |
| min_clips = max(2, max_clips - 3) | |
| max_clips_allowed = min(max_clips, max_clips_per_sample, len(CATEGORIES)) | |
| if min_clips > max_clips_allowed: # Handle edge case | |
| min_clips = max_clips_allowed | |
| n_clips = random.randint(min_clips, max_clips_allowed) | |
| ``` | |
| 4. Get valid question types for n_clips | |
| 5. Select answer position based on question type: | |
| - `first` β position 0 | |
| - `last` β position n_clips - 1 | |
| - `second` β position 1 | |
| - `second_last` β position n_clips - 2 | |
| - `after` β random position 1 to n-1 | |
| - `before` β random position 0 to n-2 | |
| 6. Select categories using least-used balancing (answer first, then others) | |
| 7. Build audio with `build_clip_sequence_with_silences` (includes crossfade) | |
| 8. Generate questions including sequence question | |
| 9. Return metadata | |
| **Silence Reduction**: Target n_clips is capped at `max_clips` to avoid excessive silence. | |
| --- | |
| ## Task: VOLUME | |
| **File**: `tasks/task_volume.py` | |
| **Class**: `VolumeTaskGenerator` | |
| ### Complete Flow | |
| ``` | |
| VolumeTaskGenerator.__init__(config, logger) | |
| β | |
| Initialize ESC50Dataset, AudioProcessor, QuestionGenerator | |
| Load multiplier_max_loudness, multiplier_min_loudness, baseline normalization settings | |
| β | |
| generate_dataset() | |
| β | |
| 1. Generate sample durations upfront (exact fill) | |
| 2. num_samples = len(sample_durations) | |
| 3. Create balanced clips_count_pool from 2 to max_clips_per_sample | |
| 4. Create balanced question_types: ["max_loudness"] * N/2 + ["min_loudness"] * N/2 | |
| 5. Shuffle both pools | |
| 6. Store clips_count_pool as instance variable | |
| 7. For each sample: | |
| generate_sample(sample_id, target_question_type=balanced_types[i]) | |
| β Uses clips_count_pool.pop(0) internally, capped at max_clips_that_fit | |
| β Normalizes clips to baseline, applies volume adjustments | |
| β Verifies gap constraints (up to 10 attempts) | |
| 8. Save CSVs | |
| ``` | |
| ### Key Methods | |
| #### `_normalize_to_baseline(audio)` | |
| ```python | |
| def _normalize_to_baseline(self, audio): | |
| if not self.normalize_to_baseline: | |
| return audio | |
| change_in_dBFS = self.baseline_dBFS - audio.dBFS | |
| return audio.apply_gain(change_in_dBFS) | |
| ``` | |
| #### `_verify_loudness_gap(volume_levels, question_type)` | |
| **For MAX_LOUDNESS**: | |
| ```python | |
| required_gap_dB = 20 * math.log10(self.multiplier_max_loudness) # β 3.52 dB | |
| actual_gap_dB = max_level - second_max | |
| gap_satisfied = actual_gap_dB >= required_gap_dB | |
| ``` | |
| **For MIN_LOUDNESS**: | |
| ```python | |
| required_gap_dB = abs(20 * math.log10(self.multiplier_min_loudness)) # β 6.02 dB | |
| actual_gap_dB = second_min - min_level | |
| gap_satisfied = actual_gap_dB >= required_gap_dB | |
| ``` | |
| #### Volume Level Generation | |
| Volume levels are generated to satisfy gap constraints: | |
| - For `max_loudness`: target gets +gap_dB above baseline, backgrounds at/below baseline | |
| - For `min_loudness`: target gets -gap_dB below baseline, backgrounds at/above baseline | |
| --- | |
| ## Deterministic Balancing Mechanisms | |
| ### Overview | |
| The pipeline ensures balanced distributions across multiple dimensions with **capacity-aware assignment**. | |
| ### 1. Capacity-Aware Answer Balancing (COUNT Task) | |
| Each possible answer (1-10) appears equally often, but **higher targets are assigned to samples with higher capacity**. | |
| ```python | |
| # Calculate capacity for each sample | |
| for duration in sample_durations: | |
| max_clips, _ = get_max_clip_num_to_be_joined(duration, source_clip_duration, min_silence_ms) | |
| max_for_sample = min(max_clips, max_clips_per_sample, len(CATEGORIES)) | |
| sample_max_clips.append(max_for_sample) | |
| # Create balanced pool | |
| possible_answers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | |
| samples_per_answer = num_samples // len(possible_answers) | |
| remainder = num_samples % len(possible_answers) | |
| assignment_pool = [] | |
| for answer in possible_answers: | |
| count = samples_per_answer + (1 if remainder > 0 else 0) | |
| assignment_pool.extend([answer] * count) | |
| remainder = max(0, remainder - 1) | |
| # Sort samples by capacity (descending) | |
| sample_info.sort(key=lambda x: x[2], reverse=True) | |
| # Sort pool descending - assign high targets first | |
| assignment_pool.sort(reverse=True) | |
| # Assign targets, clamped to capacity | |
| for idx, (sample_idx, duration, capacity) in enumerate(sample_info): | |
| target = min(assignment_pool[idx], capacity) | |
| balanced_assignments[sample_idx] = target | |
| ``` | |
| **Guarantee**: Each answer value appears equally, and high targets go to samples that can fit them. | |
| ### 2. Capacity-Aware Question Type Balancing (ORDER Task) | |
| ORDER task uses **capacity-aware balancing** - advanced question types assigned to high-capacity samples. | |
| ```python | |
| # Separate question types by requirements | |
| basic_types = ['first', 'last', 'after', 'before'] # Need >= 2 clips | |
| advanced_types = ['second', 'second_last'] # Need >= min_clips_for_second (e.g., 3) | |
| # Sort samples by capacity (descending) | |
| sample_info.sort(key=lambda x: x[2], reverse=True) | |
| # Build assignment pool - advanced types first | |
| samples_per_type = num_samples // len(question_types) | |
| remainder = num_samples % len(question_types) | |
| assignment_pool = [] | |
| # Add advanced types first (for high-capacity samples) | |
| for qtype in advanced_types: | |
| count = samples_per_type + (1 if remainder > 0 else 0) | |
| assignment_pool.extend([qtype] * count) | |
| remainder = max(0, remainder - 1) | |
| # Then basic types | |
| for qtype in basic_types: | |
| count = samples_per_type + (1 if remainder > 0 else 0) | |
| assignment_pool.extend([qtype] * count) | |
| remainder = max(0, remainder - 1) | |
| # Assign with validation | |
| for idx, (sample_idx, duration, capacity) in enumerate(sample_info): | |
| target_qtype = assignment_pool[idx] | |
| valid_types = _get_valid_question_types(capacity) | |
| if target_qtype not in valid_types: | |
| # Downgrade to valid type | |
| target_qtype = random.choice(valid_types) | |
| balanced_assignments[sample_idx] = target_qtype | |
| ``` | |
| ### 3. Simple Question Type Balancing (DURATION, VOLUME Tasks) | |
| ```python | |
| # DURATION: 2 types β N/2 each | |
| # VOLUME: 2 types β N/2 each | |
| samples_per_type = num_samples // len(question_types) | |
| remainder = num_samples % len(question_types) | |
| balanced_types = [] | |
| for qtype in question_types: | |
| count = samples_per_type + (1 if remainder > 0 else 0) | |
| balanced_types.extend([qtype] * count) | |
| remainder = max(0, remainder - 1) | |
| random.shuffle(balanced_types) | |
| ``` | |
| ### 4. Category Usage Balancing | |
| All 50 ESC-50 categories are used equally via least-used selection: | |
| ```python | |
| def get_least_used_categories(self, n: int, exclude: List[str] = None) -> List[str]: | |
| # Sort categories by usage count | |
| sorted_cats = sorted( | |
| self.category_usage_counts.items(), | |
| key=lambda x: (x[1], x[0]) # Sort by count, then alphabetically for ties | |
| ) | |
| # Filter excluded and return first n | |
| available = [cat for cat, _ in sorted_cats if cat not in (exclude or [])] | |
| return available[:n] | |
| ``` | |
| Each task calls `reset_category_usage()` at the start to ensure independent balancing. | |
| ### 5. N_Clips Selection Strategy | |
| **COUNT Task**: Uses capacity-aware answer balancing (see #1 above) | |
| **ORDER and VOLUME Tasks**: Use **silence reduction strategy** (NOT balanced): | |
| ```python | |
| # Randomly sample n_clips from valid range to minimize silence | |
| min_clips = max(2, max_clips - 3) | |
| max_clips_allowed = min(max_clips, max_clips_per_sample, len(CATEGORIES)) | |
| if min_clips > max_clips_allowed: | |
| min_clips = max_clips_allowed # Handle edge case | |
| n_clips = random.randint(min_clips, max_clips_allowed) | |
| ``` | |
| This maximizes clip usage within the allowed range, minimizing excessive silence. | |
| --- | |
| ## Rejection Logic and Retry Mechanisms | |
| ### When Samples Are Rejected | |
| Rejections occur only in tasks with gap constraints: | |
| 1. **DURATION Task**: Gap constraint not satisfied | |
| - LONGEST: target_duration < max_background Γ 1.5 | |
| - SHORTEST: target_duration > min_background Γ 0.5 | |
| 2. **VOLUME Task**: Gap constraint not satisfied | |
| - MAX_LOUDNESS: actual_gap_dB < required_gap_dB (3.52 dB) | |
| - MIN_LOUDNESS: actual_gap_dB < required_gap_dB (6.02 dB) | |
| ### DURATION Task Retry Logic | |
| ```python | |
| def generate_dataset(self): | |
| all_metadata = [] | |
| sample_idx = 0 | |
| type_idx = 0 | |
| while len(all_metadata) < num_samples and type_idx < len(balanced_types) * 2: | |
| question_type = balanced_types[type_idx % len(balanced_types)] | |
| metadata = self.generate_sample(sample_idx, question_type) | |
| if metadata is not None: | |
| all_metadata.append(metadata) | |
| sample_idx += 1 | |
| # If None, sample was rejected - just move to next | |
| type_idx += 1 | |
| ``` | |
| ### Rejection Rate Calculation | |
| $$\text{Rejection Rate} = \frac{\text{rejections}}{\text{rejections} + \text{successes}} \times 100\%$$ | |
| --- | |
| ## Complete Task Creation Explanation | |
| ### How Each Task Is Generated (Step-by-Step) | |
| #### COUNT TASK - "How many unique sounds?" | |
| **Goal**: Create audio with N unique sound sources, ask how many distinct sounds exist. | |
| **Process**: | |
| 1. **Preprocessing**: None (uses raw ESC-50 clips) | |
| 2. **Duration Generation**: `target_duration ~ Uniform(20s, 60s)` per sample | |
| 3. **Calculate Max Clips**: `max_clips = get_max_clip_num_to_be_joined(target_duration, 5s, 100ms)` | |
| - Example: 45s duration β ~8 clips of 5s each with 100ms silence between | |
| 4. **Balanced Answer Selection**: Pre-generated pool of answers [1,2,3,...,10] balanced equally | |
| - Target answer (e.g., 5 unique sounds) selected from pool | |
| 5. **Silence Reduction**: Cap target at `min(target_answer, max_clips)` | |
| - If target=8 but max_clips=6 β use 6 (prevents excessive silence) | |
| 6. **Category Selection**: Pick N least-used categories from ESC-50 (balancing) | |
| 7. **Audio Construction**: | |
| - Load one file per category | |
| - Calculate repetitions needed: `total_clips = max_clips` | |
| - Distribute repetitions across N sources | |
| - **Ordering mode**: | |
| - `random`: Shuffle clips (A B A C B...) - harder, tests recognition | |
| - `consecutive`: Group same-source (AAA BBB CCC) - easier | |
| 8. **Silence Insertion**: | |
| - Minimum 100ms silence between EVERY clip | |
| - Extra silence (up to 500ms per gap) distributed from remainder | |
| - **Crossfade**: 50ms within same-source, 500ms at audio-silence boundaries | |
| 9. **Question Generation**: MCQ + open-text asking "How many unique sounds?" | |
| 10. **Export**: Save audio WAV + metadata | |
| **Example**: | |
| - Target duration: 40s | |
| - Max clips that fit: 7 clips (7Γ5s + 6Γ0.1s = 35.6s) | |
| - Target answer: 3 unique sounds | |
| - Actual: 3 unique sounds (7 total clips: 3+2+2 repetitions) | |
| - Ordering: Random shuffle β [A B A C B A C] | |
| - Result: Audio with 3 distinct sounds, some repeated, with silences and crossfades | |
| #### DURATION TASK - "Which sound is longest/shortest?" | |
| **Goal**: Create audio where one sound has clearly longest/shortest duration compared to others. | |
| **Process**: | |
| 1. **Preprocessing** (preprocess_esc50.py - REQUIRED): | |
| - Load raw ESC-50 clips | |
| - Detect sound regions using adaptive noise-floor thresholding | |
| - Trim leading/trailing silence (preserve internal structure) | |
| - Calculate effective duration per clip | |
| - Save trimmed audio + effective_durations.csv | |
| 2. **Duration Generation**: `target_duration ~ Uniform(20s, 60s)` per sample | |
| 3. **Calculate Max Clips**: Based on average effective duration (~3.86s) | |
| 4. **Determine N Sources**: Based on question type and max_clips | |
| - **LONGEST**: Target needs β₯2 clips, backgrounds get 1 each β `n_sources β€ max_clips - 1` | |
| - **SHORTEST**: Target gets 1 clip, backgrounds need β₯2 each β `n_sources β€ 1 + (max_clips-1)//2` | |
| 5. **Category Selection**: Pick target + backgrounds from least-used categories | |
| 6. **Slot Distribution**: Allocate clips to each source | |
| - LONGEST: Give most clips to target, 1 to each background | |
| - SHORTEST: Give 1 to target, multiple to each background | |
| 7. **Clip Selection**: For each source, select clips from preprocessed dataset | |
| 8. **Gap Verification**: | |
| - LONGEST: `target_duration β₯ max_background Γ 1.5` β | |
| - SHORTEST: `target_duration β€ min_background Γ 0.75` β | |
| - If gap not satisfied: Try redistributing slots, or reject sample | |
| 9. **Audio Construction**: | |
| - Load trimmed clips | |
| - Concatenate with consecutive ordering (preserve periodicity) | |
| - Insert silences with crossfades | |
| 10. **Question Generation**: "Which sound is longest/shortest?" | |
| 11. **Export**: Audio + metadata | |
| **Example**: | |
| - Question type: LONGEST | |
| - Target duration: 50s, max_clips: 12 | |
| - N sources: 4 (target + 3 backgrounds) | |
| - Slot distribution: Target=6 clips (6Γ3.8s=22.8s), Backgrounds=2 clips each (2Γ3.8s=7.6s) | |
| - Gap check: 22.8s β₯ 7.6s Γ 1.5 = 11.4s β | |
| - Result: Target sound clearly longest | |
| #### ORDER TASK - "Which sound is first/last/after X?" | |
| **Goal**: Create ordered sequence of sounds, ask about temporal relationships. | |
| **Process**: | |
| 1. **Preprocessing**: None (uses raw ESC-50) | |
| 2. **Duration Generation**: Pre-generated durations to exactly fill task duration | |
| 3. **Calculate Max Clips**: `get_max_clip_num_to_be_joined(target_duration, 5s, 100ms)` | |
| 4. **Balanced N_Clips Selection**: Pre-generated pool [2,3,4,...,10] balanced equally | |
| - Target n_clips (e.g., 5) selected from pool | |
| - Capped at `min(target_n_clips, max_clips)` (silence reduction) | |
| 5. **Question Type Selection**: From balanced pool (first, last, second, after, before, second_last) | |
| 6. **Answer Position Determination**: Based on question type | |
| - `first` β position 0 | |
| - `last` β position n_clips-1 | |
| - `second` β position 1 | |
| - `second_last` β position n_clips-2 | |
| - `after`/`before` β random valid position | |
| 7. **Category Selection**: Answer category at determined position, others from least-used | |
| 8. **Audio Construction**: | |
| - Load one clip per position | |
| - Build sequence with silences (min 100ms + random extra up to 500ms per gap) | |
| - **Crossfade**: 500ms at audio-silence boundaries for smooth transitions | |
| 9. **Question Generation**: | |
| - MCQ: "Which sound is first?" with 4 options | |
| - Open-text: "What is the first sound?" + full sequence | |
| 10. **Export**: Audio + metadata | |
| **Example**: | |
| - Target n_clips: 4, max_clips: 8 β use 4 β | |
| - Question: "Which sound is second?" | |
| - Answer position: 1 (0-indexed) | |
| - Sequence: [dog, cat, bird, rain] β Answer: cat | |
| - Audio: 4 clips in order with silences and crossfades | |
| #### VOLUME TASK - "Which sound is loudest/softest?" | |
| **Goal**: Create audio with clips at different volume levels, ask about loudness comparison. | |
| **Process**: | |
| 1. **Preprocessing**: None (uses raw ESC-50) | |
| 2. **Duration Generation**: Pre-generated durations | |
| 3. **Calculate Max Clips**: `get_max_clip_num_to_be_joined(...)` | |
| 4. **Balanced N_Clips Selection**: From pool [2,3,...,10], capped at max_clips | |
| 5. **Question Type Selection**: "max_loudness" or "min_loudness" (balanced 50/50) | |
| 6. **Volume Level Generation**: Create n_clips volume adjustments (in dB) | |
| - Ensure gap constraint (multiplier 4.0 for max, 0.25 for min) | |
| - Example: [+12dB, 0dB, -6dB] β max at +12dB has β₯12dB gap from second | |
| 7. **Gap Verification** (up to 10 attempts): | |
| - MAX: `max_level - second_max β₯ 20Γlog10(4.0) β 12dB` | |
| - MIN: `second_min - min_level β₯ 20Γlog10(4.0) β 12dB` | |
| - If not satisfied: Regenerate levels or reject | |
| 8. **Category Selection**: Answer at determined position, others from least-used | |
| 9. **Audio Construction**: | |
| - Load clips | |
| - **CRITICAL: Normalize all to baseline (-20 dBFS)** β ensures controlled comparison | |
| - Apply volume adjustments to normalized clips | |
| - Concatenate with silences and crossfades | |
| 10. **Question Generation**: "Which sound has maximum/minimum loudness?" | |
| 11. **Export**: Audio + metadata with volume levels | |
| **Example**: | |
| - Target n_clips: 3, max_clips: 6 β use 3 β | |
| - Question: "max_loudness" | |
| - Volume levels: [+12dB, 0dB, -6dB] | |
| - Gap check: 12 - 0 = 12dB β₯ 12dB β | |
| - Process: Normalize all clips to -20dBFS, then adjust to [-8dBFS, -20dBFS, -26dBFS] | |
| - Result: First sound clearly loudest | |
| ### Key Innovations | |
| 1. **Crossfade Everywhere**: Smooth transitions at audio-silence boundaries (500ms), small crossfade within same-source repetitions (50ms) | |
| 2. **Adaptive Preprocessing**: Noise-floor thresholding adapts per-clip (duration task) | |
| 3. **Silence Reduction**: ORDER/VOLUME tasks sample n_clips from [max_clips-3, max_clips_per_sample] to minimize silence | |
| 4. **Balanced Distribution**: | |
| - **COUNT**: Balances answers (1 to max_clips_per_sample) + question types | |
| - **ORDER/VOLUME**: Balances question types only (n_clips uses silence reduction) | |
| 5. **Category Balancing**: Least-used selection ensures all 50 ESC-50 categories used evenly | |
| 6. **Gap Constraints**: Mathematical guarantees for duration/volume comparisons | |
| 7. **Exact Duration Filling**: Pre-generate sample durations to exactly fill task duration (no wasted time) | |
| --- | |
| ## Command-Line Arguments | |
| ### Main Pipeline (`main.py`) | |
| ```bash | |
| python main.py [OPTIONS] | |
| Options: | |
| --config, -c PATH Path to config YAML (default: config.yaml) | |
| --tasks, -t TASKS Specific tasks to run (choices: count, duration, order, volume) | |
| --output, -o PATH Custom output directory (overrides config) | |
| Examples: | |
| # Run all enabled tasks with default config | |
| python main.py | |
| # Run specific tasks only | |
| python main.py --tasks count order | |
| # Use custom config and output | |
| python main.py --config my_config.yaml --output ./my_dataset | |
| ``` | |
| ### Preprocessing Script (`preprocess_esc50.py`) | |
| ```bash | |
| python preprocess_esc50.py [OPTIONS] | |
| Options: | |
| --config PATH Path to config YAML (default: config.yaml) | |
| --threshold-strategy STRATEGY "noise_floor" or "peak_relative" | |
| --threshold-db FLOAT Threshold in dB (for peak_relative) | |
| --noise-floor-percentile FLOAT Percentile for noise floor estimation | |
| --noise-floor-delta-db FLOAT Delta above noise floor in dB | |
| --min-sound-ms INT Minimum sound duration in ms | |
| --no-trimmed-audio Skip saving trimmed audio files | |
| --output-dir PATH Custom output directory | |
| Examples: | |
| # Use config defaults | |
| python preprocess_esc50.py --config config.yaml | |
| # Override threshold parameters | |
| python preprocess_esc50.py --config config.yaml \ | |
| --threshold-strategy noise_floor \ | |
| --noise-floor-percentile 2.0 \ | |
| --noise-floor-delta-db 5.0 \ | |
| --min-sound-ms 25 | |
| # Generate metadata only (no trimmed audio) | |
| python preprocess_esc50.py --config config.yaml --no-trimmed-audio | |
| ``` | |
| --- | |
| ## Summary | |
| The TREA 2.0 pipeline generates balanced, constraint-satisfying audio QA samples through: | |
| 1. **Preprocessing** (Duration only): Adaptive noise-floor thresholding + edge trimming | |
| 2. **Exact Duration Filling**: Pre-generate sample durations to sum exactly to task duration | |
| 3. **Capacity-Aware Balancing**: | |
| - **COUNT**: High answer targets β high-capacity samples | |
| - **ORDER**: Advanced question types β high-capacity samples | |
| 4. **Silence Reduction**: ORDER/VOLUME randomly sample n_clips from [max_clips-3, max_clips_per_sample] | |
| 5. **Crossfade Transitions**: Smooth audio-silence boundaries (500ms) + within-source (50ms) | |
| 6. **Category Balancing**: Least-used selection ensures even ESC-50 category distribution | |
| 7. **Gap Constraints**: Mathematical guarantees (1.5x for longest, 0.75x for shortest, 4.0x/0.25x for volume) | |
| 8. **Retry Mechanisms**: Failed samples rejected, pipeline continues until target count reached | |
| All randomness is seeded (`random_seed: 42`) for reproducibility. | |