Cosmos3-Super / BIAS.md
mingyuliutw's picture
Super-squash branch 'main' using huggingface_hub
3965cee

Bias

Field Response
Participation considerations from adversely impacted groups protected classes in model design and testing None.
Measures taken to mitigate against unwanted bias Training, evaluation, and testing data are curated before release to filter restricted content, including content relating to protected classes. Model behavior is evaluated across Physical AI domains — robotics, autonomous vehicles, human-centric scenes, common scenes, industry, miscellaneous, and physics-oriented benchmarks — with attention to coverage across diverse demographic and contextual characteristics that affect protected-class outcomes.
Which characteristic (feature) show(s) the greatest difference in performance?: Greatest performance differences are observed in tasks requiring long-horizon temporal consistency, fine-grained physical interactions, and embodiment-specific action generation. Performance is generally stronger on common visual reasoning and world-generation tasks than on complex multi-agent, robotics-control, or tightly synchronized multimodal generation scenarios.
Which feature(s) have the worst performance overall? Performance is generally weakest in tasks requiring long-horizon temporal consistency, precise physical interactions, embodiment-specific action control, and strict audio-visual synchronization.
If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: Bias-specific methods applied during data processing include person-presence screening, demographic-taxonomy classification (age, gender, ethnicity), embedding-based diversity analysis, and dataset balancing across sources. Internal analysis surfaced: non-person scenes are more prevalent than person-centric content; demographic-taxonomy outputs on person-present samples are most frequently "uncertain" across age, gender, and ethnicity dimensions; and source-type variation, with people-centric image and video datasets showing higher demographic signal than document-, object-, robotics-, or scene-focused datasets. (Quantitative details in the row below.) Downstream deployments should add bias audits, fairness evaluation, red-teaming, demographically balanced fine-tuning, or counterfactual augmentation as mitigations.
Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: Dataset analytics pipelines, metadata distribution analysis, heuristic quality checks, embedding-based clustering, model-assisted filtering systems, and benchmark evaluation suites are used to assess statistical imbalances and identify patterns that may introduce bias into model behavior.
Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: These datasets, such as OpenImages-derived detection-to-NLP datasets, visual grounding and VQA datasets, document/image understanding datasets, video/action understanding datasets, and NVIDIA-created or curated visual datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein). For instance, automated person-presence screening did not identify a person in approximately 58% of visual samples analyzed across approximately 400 datasets, while person-present signals were identified in approximately 42% of analyzed samples. In the subset where person-present signals were identified, these datasets contain uneven representation splits across the measured visual taxonomies: age outputs were most frequently uncertain, followed by child and adult; gender outputs were most frequently uncertain, followed by male and female; and ethnicity outputs were most frequently uncertain, followed by Hispanic and White as the most frequent identified categories. Dataset-level results vary by source type, with people-centric image and video datasets containing higher person-present and demographic-taxonomy signals than document-, object-, robotics-, or scene-focused datasets. To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, task-specific fairness evaluation, and red-teaming, along with fine-tuning with demographically balanced datasets and counterfactual data augmentation to align with the desired model behavior. This evaluation used a baseline of 200 samples across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses, identified as optimal thresholds for maximizing embedder accuracy.