File size: 4,720 Bytes
## Bias

| Field | Response |
| :---- | :---- |
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None. |
| Measures taken to mitigate against unwanted bias | Training, evaluation, and testing data are curated before release to filter restricted content, including content relating to protected classes. Model behavior is evaluated across Physical AI domains — robotics, autonomous vehicles, human-centric scenes, common scenes, industry, miscellaneous, and physics-oriented benchmarks — with attention to coverage across diverse demographic and contextual characteristics that affect protected-class outcomes. |
| Which characteristic (feature) show(s) the greatest difference in performance?: | Greatest performance differences are observed in tasks requiring long-horizon temporal consistency, fine-grained physical interactions, and embodiment-specific action generation. Performance is generally stronger on common visual reasoning and world-generation tasks than on complex multi-agent, robotics-control, or tightly synchronized multimodal generation scenarios. |
| Which feature(s) have the worst performance overall? | Performance is generally weakest in tasks requiring long-horizon temporal consistency, precise physical interactions, embodiment-specific action control, and strict audio-visual synchronization. |
| If using internal data, description of methods implemented in data acquisition or processing, if any, to address the prevalence of identifiable biases in the training, testing, and validation data: | Bias-specific methods applied during data processing include person-presence screening, demographic-taxonomy classification (age, gender, ethnicity), embedding-based diversity analysis, and dataset balancing across sources. Internal analysis surfaced: non-person scenes are more prevalent than person-centric content; demographic-taxonomy outputs on person-present samples are most frequently "uncertain" across age, gender, and ethnicity dimensions; and source-type variation, with people-centric image and video datasets showing higher demographic signal than document-, object-, robotics-, or scene-focused datasets. *(Quantitative details in the row below.)* Downstream deployments should add bias audits, fairness evaluation, red-teaming, demographically balanced fine-tuning, or counterfactual augmentation as mitigations. |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | Dataset analytics pipelines, metadata distribution analysis, heuristic quality checks, embedding-based clustering, model-assisted filtering systems, and benchmark evaluation suites are used to assess statistical imbalances and identify patterns that may introduce bias into model behavior. |
| Tools used to assess statistical imbalances and highlight patterns that may introduce bias into AI models: | These datasets, such as OpenImages-derived detection-to-NLP datasets, visual grounding and VQA datasets, document/image understanding datasets, video/action understanding datasets, and NVIDIA-created or curated visual datasets, do not collectively or exhaustively represent all demographic groups (and proportionally therein). For instance, automated person-presence screening did not identify a person in approximately 58% of visual samples analyzed across approximately 400 datasets, while person-present signals were identified in approximately 42% of analyzed samples. In the subset where person-present signals were identified, these datasets contain uneven representation splits across the measured visual taxonomies: age outputs were most frequently uncertain, followed by child and adult; gender outputs were most frequently uncertain, followed by male and female; and ethnicity outputs were most frequently uncertain, followed by Hispanic and White as the most frequent identified categories. Dataset-level results vary by source type, with people-centric image and video datasets containing higher person-present and demographic-taxonomy signals than document-, object-, robotics-, or scene-focused datasets. To mitigate these imbalances, we recommend considering evaluation techniques such as bias audits, task-specific fairness evaluation, and red-teaming, along with fine-tuning with demographically balanced datasets and counterfactual data augmentation to align with the desired model behavior. This evaluation used a baseline of 200 samples across all datasets, with larger subsets of up to 3,000 samples utilized for certain in-depth analyses, identified as optimal thresholds for maximizing embedder accuracy. |