TREA 2.0 - Technical Documentation
Comprehensive technical documentation for the TREA 2.0 audio dataset generation pipeline. This document covers the complete implementation including algorithms, mathematical formulations, configuration parameters, preprocessing details, and capacity-aware balancing mechanisms.
For Quick Start Guide: See README.md
Table of Contents
- Pipeline Overview
- How Sample Durations Are Generated
- Configuration Reference
- ESC-50 Preprocessing
- Audio Utilities
- Task: COUNT
- Task: DURATION
- Task: ORDER
- Task: VOLUME
- Deterministic Balancing Mechanisms
- Rejection Logic and Retry Mechanisms
- Command-Line Arguments
- Summary
Pipeline Overview
Architecture
The pipeline generates four types of audio-based question-answering samples:
| Task | Question Type | Example Question |
|---|---|---|
| COUNT | Counting unique sounds | "How many unique sounds do you hear?" |
| DURATION | Temporal comparison | "Which sound plays for the longest duration?" |
| ORDER | Temporal ordering | "Which sound plays first/last/after X?" |
| VOLUME | Loudness comparison | "Which sound is the loudest/softest?" |
Directory Structure
pipeline/
βββ main.py # Entry point - orchestrates all tasks
βββ config.yaml # All configuration parameters
βββ tasks/
β βββ task_count.py # CountTaskGenerator class
β βββ task_duration.py # DurationTaskGenerator class
β βββ task_order.py # OrderTaskGenerator class
β βββ task_volume.py # VolumeTaskGenerator class
βββ utils/
β βββ __init__.py # Exports all utilities
β βββ audio_utils.py # Audio processing functions
β βββ dataset_utils.py # ESC50Dataset, PreprocessedESC50Dataset
β βββ question_utils.py # QuestionGenerator
β βββ llm_utils.py # LLMQuestionGenerator
β βββ logger.py # setup_logger
βββ output/ # Generated outputs
Data Flow
ESC-50 Dataset (2000 clips, 50 categories, 5s each)
β
[DURATION TASK ONLY] Preprocessing Script (preprocess_esc50.py)
βββ Detects sound regions using adaptive noise-floor thresholding
βββ Trims leading/trailing silence (keeps internal structure)
βββ Calculates effective durations
β
ESC-50_preprocessed/
βββ effective_durations.csv (metadata with effective durations)
βββ trimmed_audio/*.wav (edge-trimmed clips)
β
Pipeline (task-specific generation with balancing)
βββ COUNT: Uses raw ESC-50 clips
βββ DURATION: Uses preprocessed clips with effective durations
βββ ORDER: Uses raw ESC-50 clips
βββ VOLUME: Uses raw ESC-50 clips (normalized then volume-adjusted)
β
output/{task}/
βββ audios/*.wav (generated audio samples)
βββ {task}_mcq.csv (multiple choice questions)
βββ {task}_open_text.csv (open-ended questions)
βββ {task}_metadata.csv (detailed metadata)
Entry Point: main.py
The main orchestration happens via individual task runner functions:
def run_count_task(config: dict, logger):
generator = CountTaskGenerator(config, logger)
generator.dataset.reset_category_usage()
generator.generate_dataset()
def run_duration_task(config: dict, logger):
generator = DurationTaskGenerator(config, logger)
generator.dataset.reset_category_usage()
generator.generate_dataset()
def run_order_task(config: dict, logger):
generator = OrderTaskGenerator(config, logger)
generator.dataset.reset_category_usage()
generator.generate_dataset()
def run_volume_task(config: dict, logger):
generator = VolumeTaskGenerator(config, logger)
generator.dataset.reset_category_usage()
generator.generate_dataset()
How Sample Durations Are Generated
IMPORTANT: Sample durations are generated upfront to exactly fill the target task duration.
The Algorithm
Located in utils/audio_utils.py:
def generate_sample_durations_for_task(
task_duration_hours: float,
min_clip_duration: float,
max_clip_duration: float
) -> list:
"""
Generate sample durations that exactly fill the target task duration.
"""
task_duration_seconds = task_duration_hours * 3600
remaining = task_duration_seconds
durations = []
while remaining >= min_clip_duration:
# Cap max at remaining to avoid overshoot
effective_max = min(max_clip_duration, remaining)
# If remaining is less than min, we can't fit another sample
if effective_max < min_clip_duration:
break
# Sample uniformly within valid range
d = random.uniform(min_clip_duration, effective_max)
durations.append(d)
remaining -= d
# Shuffle to randomize order
random.shuffle(durations)
return durations
- Start with
remaining = total_seconds - While
remaining >= min_clip_duration:- Sample
d ~ Uniform(min, min(max, remaining)) - Append
dto durations list - Subtract
dfrom remaining
- Sample
- Shuffle and return
Mathematical Properties
Guarantee: $\sum_{i=1}^{N} d_i \leq T$ and $T - \sum d_i < d_{\min}$
Where:
- $T$ = total task duration
- $d_i$ = duration of sample $i$
- $d_{\min}$ = minimum clip duration
- $N$ = number of samples generated (variable, not fixed!)
Each duration: $d_i \sim \text{Uniform}(d_{\min}, \min(d_{\max}, \text{remaining}_i))$
Example
With task_duration_size = 1.0 hours (3600s), min = 20s, max = 60s:
remaining=3600 β dβ=45.2s β remaining=3554.8
remaining=3554.8 β dβ=28.7s β remaining=3526.1
remaining=3526.1 β dβ=52.1s β remaining=3474.0
...
remaining=35.2 β dββ=35.2s β remaining=0 (capped at remaining)
Result: 89 samples totaling exactly 3600s (instead of estimated 90)
Where It's Called
Each task's generate_dataset() method uses this:
def generate_dataset(self) -> tuple:
# Generate all durations upfront
sample_durations = generate_sample_durations_for_task(
self.task_duration_hours,
self.min_clip_duration,
self.max_clip_duration
)
num_samples = len(sample_durations)
self.logger.info(f"Generating {num_samples} samples...")
# Each sample uses its pre-assigned duration
for i, target_duration in enumerate(sample_durations):
metadata = self.generate_sample(i, target_duration=target_duration, ...)
---
## Configuration Reference
All parameters are defined in `config.yaml`.
### Dataset Class Subset Configuration
```yaml
dataset:
use_class_subset: false # Enable to use only a subset of ESC-50 classes
num_classes_subset: 40 # Number of classes for train/val/test (e.g., 40 of 50)
subset_persist_path: "output/class_subset.json" # Path to save/load class subset
subset_seed: 42 # Random seed for subset selection (persisted)
Purpose: Create in-distribution (ID) splits using a subset of classes, then optionally test on out-of-distribution (OOD) using all classes.
Workflow:
- Set
use_class_subset: trueandnum_classes_subset: 40 - Run pipeline - 40 classes randomly selected and saved to
class_subset.json - Generate train/val/test splits - all use same 40 classes
- For OOD test: Set
use_class_subset: false, use different output path
Global Audio Parameters
audio:
min_clip_duration: 20.0 # Minimum generated clip duration (seconds)
max_clip_duration: 60.0 # Maximum generated clip duration (seconds)
source_clip_duration: 5.0 # ESC-50 clip length (seconds)
# Silence and crossfade parameters (applied to ALL tasks)
min_silence_duration: 100 # Minimum silence ALWAYS between clips (ms)
max_extra_silence_per_gap: 500 # Max extra silence per gap when distributing remainder (ms)
crossfade_duration: 500 # Crossfade between audio-silence transitions (ms) for smooth joins
crossfade_within_source: 50 # Small crossfade within same-source repetitions (ms) for COUNT task
with_silence: true # Enable silence insertion between clips
normalize: false
normalize_target_dBFS: -20.0
Task-Specific Parameters
COUNT Task
count:
enabled: true
task_duration_size: 2.0 # Hours of total audio to generate
max_clips_per_sample: 10 # Maximum unique sounds per sample (1 to 10)
ordering_mode: "random" # "random" (shuffled clips) or "consecutive" (grouped by source)
# CAPACITY-AWARE ANSWER BALANCING:
# - Creates balanced distribution of answers from 1 to max_clips_per_sample
# - Sorts samples by capacity (max_clips each can fit)
# - Assigns higher targets to high-capacity samples
# - Clamps targets to what actually fits (reduces excessive silence)
DURATION Task
duration:
enabled: true
task_duration_size: 2.0
preprocessed_data_path: "/home/debarpanb1/TREA_2.0/ESC-50_preprocessed"
question_types: ["shortest", "longest"]
num_unique_sources: 10 # Can be int or list (e.g., [2,3,4,5])
ordering_methods: ["consecutive"] # Only consecutive for duration task
# Preprocessing parameters (adaptive noise-floor thresholding)
threshold_strategy: "noise_floor" # Adaptive per-clip (recommended)
noise_floor_percentile: 2.0 # Use 2nd percentile as noise floor
noise_floor_delta_db: 5.0 # Threshold = noise_floor + 5dB
min_sound_duration_ms: 25 # Filter transient spikes
# Gap multipliers
multiplier_longest: 1.5 # Target must be β₯ 1.5x max background
multiplier_shortest: 0.75 # Target must be β€ 0.75x min background (changed from 0.5)
min_effective_duration_per_source: 1.0 # Minimum duration per source (seconds)
reject_if_gap_not_met: true
sample_different_clips_same_class: true
ORDER Task
order:
enabled: true
task_duration_size: 2.0
max_clips_per_sample: 10 # Cap for maximum clips to join
question_types: ["first", "last", "second", "second_last", "after", "before"]
min_clips_for_second_questions: 3 # "second" and "second_last" require β₯3 clips
allow_source_repetition: false # Each clip from unique source
# CAPACITY-AWARE QUESTION TYPE BALANCING:
# - Each question type appears equally across samples
# - Advanced types (second, second_last) assigned to high-capacity samples
# - Basic types (first, last, after, before) for lower-capacity samples
# - NO n_clips balancing: randomly samples from [max(2, max_clips-3), max_clips_per_sample]
VOLUME Task
volume:
enabled: true
task_duration_size: 2.0
max_clips_per_sample: 10 # Cap for maximum clips with different volumes
question_types: ["max_loudness", "min_loudness"]
# Normalization (CRITICAL for controlled volume comparison)
normalize_to_baseline: true
baseline_dBFS: -20.0 # All clips normalized to this level first
use_lufs: false # DISABLED - LUFS makes everything same perceived loudness!
baseline_lufs: -23.0 # EBU R128 standard (not used when use_lufs=false)
# Volume gap constraints (multipliers)
multiplier_max_loudness: 4.0 # Max must be β₯ 4x second-loudest (~12 dB)
multiplier_min_loudness: 0.25 # Min must be β€ 0.25x second-softest (~12 dB)
reject_if_gap_not_met: true
# Source clip options
use_same_clip_different_volumes: false # Use different clips (not same clip repeated)
repetitions_per_source: [2, 3, 4] # If same clip used, how many repetitions
# QUESTION TYPE BALANCING: Each question type appears equally across samples
# NO n_clips balancing: randomly samples from [max(2, max_clips-3), max_clips_per_sample]
ESC-50 Preprocessing (Duration Task Only)
File: preprocess_esc50.py
Purpose: Preprocess ESC-50 clips for duration task by detecting actual sound regions and trimming silence.
Why Preprocessing?
The DURATION task compares sound durations. Raw ESC-50 clips have variable amounts of leading/trailing silence, which would make duration comparisons ambiguous. Preprocessing:
- Detects actual sound regions using adaptive amplitude thresholding
- Trims leading and trailing silence (preserves internal structure)
- Calculates effective duration (sum of all sound regions)
- Generates metadata CSV with per-clip durations
Preprocessing Pipeline
Raw ESC-50 clip (5s with silence)
β
1. Load audio and convert to amplitude array
2. Compute RMS envelope (frame-by-frame energy)
3. Convert RMS to dB values
4. Apply adaptive threshold strategy
5. Detect contiguous sound regions
6. Trim edges (only if silence >= 100ms)
7. Calculate effective duration
8. Save trimmed audio + metadata
Adaptive Noise-Floor Thresholding
The preprocessing uses an adaptive per-clip threshold strategy:
# Strategy: 'noise_floor' (adaptive, recommended)
noise_floor_db = np.percentile(db_values, noise_floor_percentile) # e.g., 2nd percentile
absolute_threshold = noise_floor_db + noise_floor_delta_db # e.g., +5 dB above noise floor
Key Parameters (from config.yaml):
duration:
threshold_strategy: "noise_floor" # Adaptive per-clip (recommended)
noise_floor_percentile: 2.0 # Use 2nd percentile as noise floor estimate
noise_floor_delta_db: 5.0 # Threshold = noise_floor + 5 dB
min_sound_duration_ms: 25 # Filter out transient spikes < 25ms
Why Adaptive?
- Each clip has different background noise levels
- Fixed threshold (e.g., -40 dB) works poorly across diverse sounds
- Adaptive threshold adjusts per-clip based on its own noise floor
Alternative (legacy):
threshold_strategy: "peak_relative" # threshold = peak_dB - 20 dB (fixed offset)
amplitude_threshold_db: -20.0
Edge Trimming Strategy
ADAPTIVE EDGE-ONLY TRIMMING - preserves natural periodicity:
def extract_sound_with_edges_trimmed(audio, regions, min_silence_to_trim_ms=100, buffer_ratio=0.1):
"""
Trim ONLY leftmost and rightmost silence IF significant.
Preserves ALL internal structure (perfect for periodic sounds).
"""
leading_silence_ms = regions[0][0] # Time before first sound
trailing_silence_ms = len(audio) - regions[-1][1] # Time after last sound
# Only trim if silence >= 100ms
if leading_silence_ms >= min_silence_to_trim_ms:
buffer_ms = max(200, int(leading_silence_ms * 0.1)) # Keep 10% as buffer
trim_start_ms = max(0, regions[0][0] - buffer_ms)
else:
trim_start_ms = 0 # Keep from start
# Similar for trailing silence
...
return audio[trim_start_ms:trim_end_ms]
Why Edge-Only?
- Clock ticks, footsteps, typing have periodic silence between sounds
- Removing internal silences destroys natural rhythm
- Edge trimming removes irrelevant silence while preserving periodicity
Output Files
ESC-50_preprocessed/
βββ effective_durations.csv
β βββ filename
β βββ category
β βββ raw_duration_s (original 5.0s)
β βββ final_duration_s (after edge trimming)
β βββ effective_duration_s (sum of sound regions)
β βββ num_sound_regions
β βββ peak_amplitude_db
β βββ avg_rms_db
β βββ threshold_strategy, noise_floor_percentile, noise_floor_delta_db
βββ trimmed_audio/
βββ 1-100032-A-0.wav (edge-trimmed clips)
βββ ...
Running Preprocessing
# Using config defaults
python preprocess_esc50.py --config config.yaml
# Override parameters
python preprocess_esc50.py --config config.yaml \
--threshold-strategy noise_floor \
--noise-floor-percentile 2.0 \
--noise-floor-delta-db 5.0 \
--min-sound-ms 25
# Don't save trimmed audio (only CSV)
python preprocess_esc50.py --config config.yaml --no-trimmed-audio
Preprocessing Statistics Example
ESC-50 Preprocessing Summary
============================================================
Total clips processed: 2000
Successfully processed: 2000
Raw duration statistics:
Mean: 5.000s Std: 0.000s Min: 5.000s Max: 5.000s
Final duration statistics (edges trimmed):
Mean: 4.723s Std: 0.412s Min: 2.134s Max: 5.000s
Effective duration statistics (sum of sound regions):
Mean: 3.856s Std: 0.823s Min: 0.542s Max: 4.982s
Comparison:
Avg effective: 3.856s
Avg final: 4.723s
Difference: 0.867s (internal silences preserved)
Average edge trimming reduction: 5.5%
How Duration Task Uses Preprocessed Data
The DurationTaskGenerator loads preprocessed data:
self.preprocessed_dataset = PreprocessedESC50Dataset(
metadata_csv=config['tasks']['duration']['preprocessed_data_path'] + '/effective_durations.csv',
audio_dir=config['tasks']['duration']['preprocessed_data_path'] + '/trimmed_audio'
)
# Calculate average effective duration for slot distribution
effective_durations = self.preprocessed_dataset.metadata_df['effective_duration_s']
self.avg_effective_duration = effective_durations.mean() # ~3.856s
Audio Utilities
Located in utils/audio_utils.py.
generate_single_clip_duration(min_duration, max_duration) β float
Purpose: Generate a random target clip duration using UNIFORM sampling.
Implementation:
def generate_single_clip_duration(min_duration: float, max_duration: float) -> float:
return random.uniform(min_duration, max_duration)
Mathematical Formulation:
With default values (20s, 60s):
- Mean: $\mu = \frac{20 + 60}{2} = 40$ seconds
- Standard Deviation: $\sigma = \frac{60 - 20}{\sqrt{12}} \approx 11.5$ seconds
get_max_clip_num_to_be_joined(target_duration_s, source_duration_s, min_silence_ms) β Tuple[int, float]
Purpose: Calculate maximum number of source clips that can fit in target duration.
Returns: Tuple of (max_clips, remainder_seconds)
Implementation (conceptual):
def get_max_clip_num_to_be_joined(target_s, source_s, min_silence_ms):
silence_s = min_silence_ms / 1000.0
# Each clip + silence except last
effective_unit = source_s + silence_s
max_clips = int((target_s + silence_s) / effective_unit)
remainder = target_s - (max_clips * source_s + (max_clips - 1) * silence_s)
return max_clips, remainder
Mathematical Formula:
Where:
- $T$ = target duration (seconds)
- $S$ = source clip duration (5.0s for ESC-50)
- $g$ = minimum silence gap (seconds)
build_count_task_audio(source_audios, source_categories, target_duration, ...)
Purpose: Build the final audio for COUNT task.
Parameters:
source_audios: List of AudioSegment objects (one per category)source_categories: List of category namestarget_duration: Target total duration in secondsordering_mode: "random" or "consecutive"source_clip_duration_seconds: Duration of each source clipmin_silence_ms,max_extra_silence_per_gap_ms: Silence parameters
Returns: Tuple of (final_audio, clip_sequence, build_metadata)
build_duration_task_audio(...)
Purpose: Build audio for DURATION task with slot distribution.
build_clip_sequence_with_silences(clips, target_duration_s, min_silence_ms, max_extra_silence_per_gap_ms, crossfade_ms)
Purpose: Concatenate clips with random silence gaps and smooth crossfades.
Algorithm:
- Calculate total audio content duration
- Calculate minimum required silence:
(n_clips - 1) Γ min_silence_ms - Calculate available extra time:
target_duration - total_audio - min_silence - Distribute extra time randomly across gaps (up to
max_extra_silence_per_gap_msper gap) - Build sequence with crossfades:
- Audio β Silence: crossfade for smooth transition
- Silence β Audio: No crossfade (preserves audio start)
Crossfade Benefits:
- Smooth transitions between audio and silence
- Reduces clicks/pops at audio boundaries
- Preserves natural sound attack (no crossfade at audio start)
Task: COUNT
File: tasks/task_count.py
Class: CountTaskGenerator
Complete Flow
CountTaskGenerator.__init__(config, logger)
β
Initialize:
- ESC50Dataset (loads metadata, tracks category usage)
- AudioProcessor
- QuestionGenerator
- LLMQuestionGenerator (if enabled)
β
generate_dataset()
β
1. num_samples = calculate_num_samples_for_task(task_duration_hours, min, max)
2. Create balanced_answers list from num_clips_per_sample
3. Shuffle balanced_answers
4. For each sample:
generate_sample(sample_id, target_unique_count=balanced_answers[i])
5. Save CSVs
Key Method: generate_sample(sample_id, target_unique_count)
Pipeline:
- Generate random target duration:
clip_duration_seconds = generate_single_clip_duration(min, max) - Calculate max clips:
max_clips, remainder = get_max_clip_num_to_be_joined(...) - Cap
n_unique_audiosat min(target_unique_count, max_clips, 50) - Select categories:
selected_categories = dataset.get_least_used_categories(n_unique_audios) - Track usage: Increment
category_usage_countsfor each selected category - Sample one file per category:
dataset.sample_file_from_category(category) - Load source audios
- Build final audio:
build_count_task_audio(source_audios, categories, target_duration, ordering_mode, ...) - Export audio file
- Generate MCQ and open-text questions
- Return metadata dict
Balanced Answer Distribution (Updated with max_clips_per_sample)
# In generate_dataset()
max_clips_per_sample = self.task_config.get('max_clips_per_sample', 10) # Single number: 10
possible_answers = list(range(1, max_clips_per_sample + 1)) # [1, 2, 3, ..., 10]
samples_per_answer = num_samples // len(possible_answers)
remainder = num_samples % len(possible_answers)
balanced_answers = []
for answer in possible_answers:
count = samples_per_answer + (1 if remainder > 0 else 0)
balanced_answers.extend([answer] * count)
remainder = max(0, remainder - 1)
random.shuffle(balanced_answers)
For 90 samples, max_clips_per_sample=10: Each answer (1-10) appears exactly 9 times.
Silence Reduction Strategy (NEW)
Each sample's target answer is capped at what actually fits in the duration:
# In generate_sample()
max_clips, _ = get_max_clip_num_to_be_joined(clip_duration_seconds, source_clip_duration, min_silence_ms)
if target_unique_count is not None:
# Cap target at what actually fits (reduces silence)
n_unique_audios = min(target_unique_count, max_clips, len(CATEGORIES))
Example:
- Target answer from balanced pool: 8 unique sounds
- Duration allows: max_clips = 7
- Actual n_unique_audios: min(8, 7) = 7 β (uses max possible, reduces silence)
Why? Prevents excessive silence when target exceeds what fits in duration.
Task: DURATION
File: tasks/task_duration.py
Class: DurationTaskGenerator
Complete Flow
DurationTaskGenerator.__init__(config, logger)
β
Initialize:
- PreprocessedESC50Dataset (uses effective_durations.csv)
- Calculate avg_effective_duration from preprocessed data
- AudioProcessor, QuestionGenerator
- Load multiplier_longest, multiplier_shortest from config
β
generate_dataset()
β
1. num_samples = calculate_num_samples_for_task(...)
2. Create balanced question types: ["longest"] * 45 + ["shortest"] * 45
3. Shuffle balanced_types
4. While len(samples) < num_samples:
generate_sample(sample_idx, question_type=balanced_types[idx])
If returns None β increment rejection_count, continue
5. Save CSVs
Key Methods
_calculate_max_clips_and_sources(target_duration_s, question_type)
Purpose: Determine valid number of sources based on question type and duration.
For LONGEST:
- Target needs β₯2 clips to beat backgrounds by 1.5x
min_valid_sources = 2max_valid_sources = max_clips - 2 + 1
For SHORTEST:
- Target gets 1 clip
- Each background needs β₯2 clips to be 2x target
max_valid_sources = 1 + (max_clips - 1) // 2
# Filter config values to valid range, then pick RANDOMLY
valid_config_sources = [n for n in num_sources_config if min_valid <= n <= max_valid]
n_sources = random.choice(valid_config_sources)
_try_generate_sample(sample_id, question_type)
Full Algorithm:
- Generate target duration:
generate_single_clip_duration(min, max) - Calculate max_clips and n_sources:
_calculate_max_clips_and_sources(...) - Select target category (least used)
- Select background categories (from remaining least used)
- Calculate slot distribution based on question_type
- For each category, select source files and generate clip durations
- Load and trim clips
- Calculate total effective duration per category
- Verify gap constraint
- If gap not satisfied, try
_try_improve_slot_distribution() - If still not satisfied, return None (triggers retry)
- Build audio and generate questions
- Return metadata
_try_improve_slot_distribution(slot_distribution, durations, question_type, max_clips)
Purpose: Redistribute slots to satisfy gap constraint.
Task: ORDER
File: tasks/task_order.py
Class: OrderTaskGenerator
Complete Flow
OrderTaskGenerator.__init__(config, logger)
β
Initialize ESC50Dataset, AudioProcessor, QuestionGenerator
β
generate_dataset()
β
1. Generate sample durations upfront (exact fill)
2. num_samples = len(sample_durations)
3. Create balanced question_types distribution
4. For each sample:
generate_sample(sample_id, target_question_type=balanced_types[i])
β n_clips randomly selected from [max(2, max_clips-3), min(max_clips, max_clips_per_sample)]
5. Save CSVs
Key Method: _get_valid_question_types(n_clips)
Filters question types based on clip count:
second,second_last: requiren_clips >= min_clips_for_second_questions(default: 4)after,before: requiren_clips >= 2first,last: always valid
Key Method: generate_sample(sample_id, target_question_type, target_duration_seconds)
Algorithm:
- Use pre-generated
target_duration_seconds(from sample_durations) - Calculate max_clips from duration:
get_max_clip_num_to_be_joined(...) - Silence reduction - randomly select n_clips:
min_clips = max(2, max_clips - 3) max_clips_allowed = min(max_clips, max_clips_per_sample, len(CATEGORIES)) if min_clips > max_clips_allowed: # Handle edge case min_clips = max_clips_allowed n_clips = random.randint(min_clips, max_clips_allowed) - Get valid question types for n_clips
- Select answer position based on question type:
firstβ position 0lastβ position n_clips - 1secondβ position 1second_lastβ position n_clips - 2afterβ random position 1 to n-1beforeβ random position 0 to n-2
- Select categories using least-used balancing (answer first, then others)
- Build audio with
build_clip_sequence_with_silences(includes crossfade) - Generate questions including sequence question
- Return metadata
Silence Reduction: Target n_clips is capped at max_clips to avoid excessive silence.
Task: VOLUME
File: tasks/task_volume.py
Class: VolumeTaskGenerator
Complete Flow
VolumeTaskGenerator.__init__(config, logger)
β
Initialize ESC50Dataset, AudioProcessor, QuestionGenerator
Load multiplier_max_loudness, multiplier_min_loudness, baseline normalization settings
β
generate_dataset()
β
1. Generate sample durations upfront (exact fill)
2. num_samples = len(sample_durations)
3. Create balanced clips_count_pool from 2 to max_clips_per_sample
4. Create balanced question_types: ["max_loudness"] * N/2 + ["min_loudness"] * N/2
5. Shuffle both pools
6. Store clips_count_pool as instance variable
7. For each sample:
generate_sample(sample_id, target_question_type=balanced_types[i])
β Uses clips_count_pool.pop(0) internally, capped at max_clips_that_fit
β Normalizes clips to baseline, applies volume adjustments
β Verifies gap constraints (up to 10 attempts)
8. Save CSVs
Key Methods
_normalize_to_baseline(audio)
def _normalize_to_baseline(self, audio):
if not self.normalize_to_baseline:
return audio
change_in_dBFS = self.baseline_dBFS - audio.dBFS
return audio.apply_gain(change_in_dBFS)
_verify_loudness_gap(volume_levels, question_type)
For MAX_LOUDNESS:
required_gap_dB = 20 * math.log10(self.multiplier_max_loudness) # β 3.52 dB
actual_gap_dB = max_level - second_max
gap_satisfied = actual_gap_dB >= required_gap_dB
For MIN_LOUDNESS:
required_gap_dB = abs(20 * math.log10(self.multiplier_min_loudness)) # β 6.02 dB
actual_gap_dB = second_min - min_level
gap_satisfied = actual_gap_dB >= required_gap_dB
Volume Level Generation
Volume levels are generated to satisfy gap constraints:
- For
max_loudness: target gets +gap_dB above baseline, backgrounds at/below baseline - For
min_loudness: target gets -gap_dB below baseline, backgrounds at/above baseline
Deterministic Balancing Mechanisms
Overview
The pipeline ensures balanced distributions across multiple dimensions with capacity-aware assignment.
1. Capacity-Aware Answer Balancing (COUNT Task)
Each possible answer (1-10) appears equally often, but higher targets are assigned to samples with higher capacity.
# Calculate capacity for each sample
for duration in sample_durations:
max_clips, _ = get_max_clip_num_to_be_joined(duration, source_clip_duration, min_silence_ms)
max_for_sample = min(max_clips, max_clips_per_sample, len(CATEGORIES))
sample_max_clips.append(max_for_sample)
# Create balanced pool
possible_answers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
samples_per_answer = num_samples // len(possible_answers)
remainder = num_samples % len(possible_answers)
assignment_pool = []
for answer in possible_answers:
count = samples_per_answer + (1 if remainder > 0 else 0)
assignment_pool.extend([answer] * count)
remainder = max(0, remainder - 1)
# Sort samples by capacity (descending)
sample_info.sort(key=lambda x: x[2], reverse=True)
# Sort pool descending - assign high targets first
assignment_pool.sort(reverse=True)
# Assign targets, clamped to capacity
for idx, (sample_idx, duration, capacity) in enumerate(sample_info):
target = min(assignment_pool[idx], capacity)
balanced_assignments[sample_idx] = target
Guarantee: Each answer value appears equally, and high targets go to samples that can fit them.
2. Capacity-Aware Question Type Balancing (ORDER Task)
ORDER task uses capacity-aware balancing - advanced question types assigned to high-capacity samples.
# Separate question types by requirements
basic_types = ['first', 'last', 'after', 'before'] # Need >= 2 clips
advanced_types = ['second', 'second_last'] # Need >= min_clips_for_second (e.g., 3)
# Sort samples by capacity (descending)
sample_info.sort(key=lambda x: x[2], reverse=True)
# Build assignment pool - advanced types first
samples_per_type = num_samples // len(question_types)
remainder = num_samples % len(question_types)
assignment_pool = []
# Add advanced types first (for high-capacity samples)
for qtype in advanced_types:
count = samples_per_type + (1 if remainder > 0 else 0)
assignment_pool.extend([qtype] * count)
remainder = max(0, remainder - 1)
# Then basic types
for qtype in basic_types:
count = samples_per_type + (1 if remainder > 0 else 0)
assignment_pool.extend([qtype] * count)
remainder = max(0, remainder - 1)
# Assign with validation
for idx, (sample_idx, duration, capacity) in enumerate(sample_info):
target_qtype = assignment_pool[idx]
valid_types = _get_valid_question_types(capacity)
if target_qtype not in valid_types:
# Downgrade to valid type
target_qtype = random.choice(valid_types)
balanced_assignments[sample_idx] = target_qtype
3. Simple Question Type Balancing (DURATION, VOLUME Tasks)
# DURATION: 2 types β N/2 each
# VOLUME: 2 types β N/2 each
samples_per_type = num_samples // len(question_types)
remainder = num_samples % len(question_types)
balanced_types = []
for qtype in question_types:
count = samples_per_type + (1 if remainder > 0 else 0)
balanced_types.extend([qtype] * count)
remainder = max(0, remainder - 1)
random.shuffle(balanced_types)
4. Category Usage Balancing
All 50 ESC-50 categories are used equally via least-used selection:
def get_least_used_categories(self, n: int, exclude: List[str] = None) -> List[str]:
# Sort categories by usage count
sorted_cats = sorted(
self.category_usage_counts.items(),
key=lambda x: (x[1], x[0]) # Sort by count, then alphabetically for ties
)
# Filter excluded and return first n
available = [cat for cat, _ in sorted_cats if cat not in (exclude or [])]
return available[:n]
Each task calls reset_category_usage() at the start to ensure independent balancing.
5. N_Clips Selection Strategy
COUNT Task: Uses capacity-aware answer balancing (see #1 above)
ORDER and VOLUME Tasks: Use silence reduction strategy (NOT balanced):
# Randomly sample n_clips from valid range to minimize silence
min_clips = max(2, max_clips - 3)
max_clips_allowed = min(max_clips, max_clips_per_sample, len(CATEGORIES))
if min_clips > max_clips_allowed:
min_clips = max_clips_allowed # Handle edge case
n_clips = random.randint(min_clips, max_clips_allowed)
This maximizes clip usage within the allowed range, minimizing excessive silence.
Rejection Logic and Retry Mechanisms
When Samples Are Rejected
Rejections occur only in tasks with gap constraints:
DURATION Task: Gap constraint not satisfied
- LONGEST: target_duration < max_background Γ 1.5
- SHORTEST: target_duration > min_background Γ 0.5
VOLUME Task: Gap constraint not satisfied
- MAX_LOUDNESS: actual_gap_dB < required_gap_dB (3.52 dB)
- MIN_LOUDNESS: actual_gap_dB < required_gap_dB (6.02 dB)
DURATION Task Retry Logic
def generate_dataset(self):
all_metadata = []
sample_idx = 0
type_idx = 0
while len(all_metadata) < num_samples and type_idx < len(balanced_types) * 2:
question_type = balanced_types[type_idx % len(balanced_types)]
metadata = self.generate_sample(sample_idx, question_type)
if metadata is not None:
all_metadata.append(metadata)
sample_idx += 1
# If None, sample was rejected - just move to next
type_idx += 1
Rejection Rate Calculation
Complete Task Creation Explanation
How Each Task Is Generated (Step-by-Step)
COUNT TASK - "How many unique sounds?"
Goal: Create audio with N unique sound sources, ask how many distinct sounds exist.
Process:
- Preprocessing: None (uses raw ESC-50 clips)
- Duration Generation:
target_duration ~ Uniform(20s, 60s)per sample - Calculate Max Clips:
max_clips = get_max_clip_num_to_be_joined(target_duration, 5s, 100ms)- Example: 45s duration β ~8 clips of 5s each with 100ms silence between
- Balanced Answer Selection: Pre-generated pool of answers [1,2,3,...,10] balanced equally
- Target answer (e.g., 5 unique sounds) selected from pool
- Silence Reduction: Cap target at
min(target_answer, max_clips)- If target=8 but max_clips=6 β use 6 (prevents excessive silence)
- Category Selection: Pick N least-used categories from ESC-50 (balancing)
- Audio Construction:
- Load one file per category
- Calculate repetitions needed:
total_clips = max_clips - Distribute repetitions across N sources
- Ordering mode:
random: Shuffle clips (A B A C B...) - harder, tests recognitionconsecutive: Group same-source (AAA BBB CCC) - easier
- Silence Insertion:
- Minimum 100ms silence between EVERY clip
- Extra silence (up to 500ms per gap) distributed from remainder
- Crossfade: 50ms within same-source, 500ms at audio-silence boundaries
- Question Generation: MCQ + open-text asking "How many unique sounds?"
- Export: Save audio WAV + metadata
Example:
- Target duration: 40s
- Max clips that fit: 7 clips (7Γ5s + 6Γ0.1s = 35.6s)
- Target answer: 3 unique sounds
- Actual: 3 unique sounds (7 total clips: 3+2+2 repetitions)
- Ordering: Random shuffle β [A B A C B A C]
- Result: Audio with 3 distinct sounds, some repeated, with silences and crossfades
DURATION TASK - "Which sound is longest/shortest?"
Goal: Create audio where one sound has clearly longest/shortest duration compared to others.
Process:
- Preprocessing (preprocess_esc50.py - REQUIRED):
- Load raw ESC-50 clips
- Detect sound regions using adaptive noise-floor thresholding
- Trim leading/trailing silence (preserve internal structure)
- Calculate effective duration per clip
- Save trimmed audio + effective_durations.csv
- Duration Generation:
target_duration ~ Uniform(20s, 60s)per sample - Calculate Max Clips: Based on average effective duration (~3.86s)
- Determine N Sources: Based on question type and max_clips
- LONGEST: Target needs β₯2 clips, backgrounds get 1 each β
n_sources β€ max_clips - 1 - SHORTEST: Target gets 1 clip, backgrounds need β₯2 each β
n_sources β€ 1 + (max_clips-1)//2
- LONGEST: Target needs β₯2 clips, backgrounds get 1 each β
- Category Selection: Pick target + backgrounds from least-used categories
- Slot Distribution: Allocate clips to each source
- LONGEST: Give most clips to target, 1 to each background
- SHORTEST: Give 1 to target, multiple to each background
- Clip Selection: For each source, select clips from preprocessed dataset
- Gap Verification:
- LONGEST:
target_duration β₯ max_background Γ 1.5β - SHORTEST:
target_duration β€ min_background Γ 0.75β - If gap not satisfied: Try redistributing slots, or reject sample
- LONGEST:
- Audio Construction:
- Load trimmed clips
- Concatenate with consecutive ordering (preserve periodicity)
- Insert silences with crossfades
- Question Generation: "Which sound is longest/shortest?"
- Export: Audio + metadata
Example:
- Question type: LONGEST
- Target duration: 50s, max_clips: 12
- N sources: 4 (target + 3 backgrounds)
- Slot distribution: Target=6 clips (6Γ3.8s=22.8s), Backgrounds=2 clips each (2Γ3.8s=7.6s)
- Gap check: 22.8s β₯ 7.6s Γ 1.5 = 11.4s β
- Result: Target sound clearly longest
ORDER TASK - "Which sound is first/last/after X?"
Goal: Create ordered sequence of sounds, ask about temporal relationships.
Process:
- Preprocessing: None (uses raw ESC-50)
- Duration Generation: Pre-generated durations to exactly fill task duration
- Calculate Max Clips:
get_max_clip_num_to_be_joined(target_duration, 5s, 100ms) - Balanced N_Clips Selection: Pre-generated pool [2,3,4,...,10] balanced equally
- Target n_clips (e.g., 5) selected from pool
- Capped at
min(target_n_clips, max_clips)(silence reduction)
- Question Type Selection: From balanced pool (first, last, second, after, before, second_last)
- Answer Position Determination: Based on question type
firstβ position 0lastβ position n_clips-1secondβ position 1second_lastβ position n_clips-2after/beforeβ random valid position
- Category Selection: Answer category at determined position, others from least-used
- Audio Construction:
- Load one clip per position
- Build sequence with silences (min 100ms + random extra up to 500ms per gap)
- Crossfade: 500ms at audio-silence boundaries for smooth transitions
- Question Generation:
- MCQ: "Which sound is first?" with 4 options
- Open-text: "What is the first sound?" + full sequence
- Export: Audio + metadata
Example:
- Target n_clips: 4, max_clips: 8 β use 4 β
- Question: "Which sound is second?"
- Answer position: 1 (0-indexed)
- Sequence: [dog, cat, bird, rain] β Answer: cat
- Audio: 4 clips in order with silences and crossfades
VOLUME TASK - "Which sound is loudest/softest?"
Goal: Create audio with clips at different volume levels, ask about loudness comparison.
Process:
- Preprocessing: None (uses raw ESC-50)
- Duration Generation: Pre-generated durations
- Calculate Max Clips:
get_max_clip_num_to_be_joined(...) - Balanced N_Clips Selection: From pool [2,3,...,10], capped at max_clips
- Question Type Selection: "max_loudness" or "min_loudness" (balanced 50/50)
- Volume Level Generation: Create n_clips volume adjustments (in dB)
- Ensure gap constraint (multiplier 4.0 for max, 0.25 for min)
- Example: [+12dB, 0dB, -6dB] β max at +12dB has β₯12dB gap from second
- Gap Verification (up to 10 attempts):
- MAX:
max_level - second_max β₯ 20Γlog10(4.0) β 12dB - MIN:
second_min - min_level β₯ 20Γlog10(4.0) β 12dB - If not satisfied: Regenerate levels or reject
- MAX:
- Category Selection: Answer at determined position, others from least-used
- Audio Construction:
- Load clips
- CRITICAL: Normalize all to baseline (-20 dBFS) β ensures controlled comparison
- Apply volume adjustments to normalized clips
- Concatenate with silences and crossfades
- Question Generation: "Which sound has maximum/minimum loudness?"
- Export: Audio + metadata with volume levels
Example:
- Target n_clips: 3, max_clips: 6 β use 3 β
- Question: "max_loudness"
- Volume levels: [+12dB, 0dB, -6dB]
- Gap check: 12 - 0 = 12dB β₯ 12dB β
- Process: Normalize all clips to -20dBFS, then adjust to [-8dBFS, -20dBFS, -26dBFS]
- Result: First sound clearly loudest
Key Innovations
- Crossfade Everywhere: Smooth transitions at audio-silence boundaries (500ms), small crossfade within same-source repetitions (50ms)
- Adaptive Preprocessing: Noise-floor thresholding adapts per-clip (duration task)
- Silence Reduction: ORDER/VOLUME tasks sample n_clips from [max_clips-3, max_clips_per_sample] to minimize silence
- Balanced Distribution:
- COUNT: Balances answers (1 to max_clips_per_sample) + question types
- ORDER/VOLUME: Balances question types only (n_clips uses silence reduction)
- Category Balancing: Least-used selection ensures all 50 ESC-50 categories used evenly
- Gap Constraints: Mathematical guarantees for duration/volume comparisons
- Exact Duration Filling: Pre-generate sample durations to exactly fill task duration (no wasted time)
Command-Line Arguments
Main Pipeline (main.py)
python main.py [OPTIONS]
Options:
--config, -c PATH Path to config YAML (default: config.yaml)
--tasks, -t TASKS Specific tasks to run (choices: count, duration, order, volume)
--output, -o PATH Custom output directory (overrides config)
Examples:
# Run all enabled tasks with default config
python main.py
# Run specific tasks only
python main.py --tasks count order
# Use custom config and output
python main.py --config my_config.yaml --output ./my_dataset
Preprocessing Script (preprocess_esc50.py)
python preprocess_esc50.py [OPTIONS]
Options:
--config PATH Path to config YAML (default: config.yaml)
--threshold-strategy STRATEGY "noise_floor" or "peak_relative"
--threshold-db FLOAT Threshold in dB (for peak_relative)
--noise-floor-percentile FLOAT Percentile for noise floor estimation
--noise-floor-delta-db FLOAT Delta above noise floor in dB
--min-sound-ms INT Minimum sound duration in ms
--no-trimmed-audio Skip saving trimmed audio files
--output-dir PATH Custom output directory
Examples:
# Use config defaults
python preprocess_esc50.py --config config.yaml
# Override threshold parameters
python preprocess_esc50.py --config config.yaml \
--threshold-strategy noise_floor \
--noise-floor-percentile 2.0 \
--noise-floor-delta-db 5.0 \
--min-sound-ms 25
# Generate metadata only (no trimmed audio)
python preprocess_esc50.py --config config.yaml --no-trimmed-audio
Summary
The TREA 2.0 pipeline generates balanced, constraint-satisfying audio QA samples through:
- Preprocessing (Duration only): Adaptive noise-floor thresholding + edge trimming
- Exact Duration Filling: Pre-generate sample durations to sum exactly to task duration
- Capacity-Aware Balancing:
- COUNT: High answer targets β high-capacity samples
- ORDER: Advanced question types β high-capacity samples
- Silence Reduction: ORDER/VOLUME randomly sample n_clips from [max_clips-3, max_clips_per_sample]
- Crossfade Transitions: Smooth audio-silence boundaries (500ms) + within-source (50ms)
- Category Balancing: Least-used selection ensures even ESC-50 category distribution
- Gap Constraints: Mathematical guarantees (1.5x for longest, 0.75x for shortest, 4.0x/0.25x for volume)
- Retry Mechanisms: Failed samples rejected, pipeline continues until target count reached
All randomness is seeded (random_seed: 42) for reproducibility.