clarify dataset sources and add note about citation
#2
by
egrace479
- opened
README.md
CHANGED
|
@@ -129,7 +129,7 @@ results[0].plot()
|
|
| 129 |
|
| 130 |
### Training Data
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
#### Dataset splitting strategy
|
| 135 |
We applied a stratified 60/40 train-test split across species and locations to evaluate model generalizability. Data was collected from three distinct environments: Mpala Research Centre (location_1), Ol Pejeta Conservancy (location_2), and The Wilds Conservation Center (location_3). The dataset includes four target classes: Zebra, Giraffe, Onager, and African Wild Dog.
|
|
@@ -137,13 +137,13 @@ We applied a stratified 60/40 train-test split across species and locations to e
|
|
| 137 |
To prevent overlap in individual animals or environmental conditions between training and testing, we split video sessions at the file level—ensuring that no frames from a given session appear in both train and test sets. This also allows consistent per-frame sampling at a fixed interval (every 10th frame).
|
| 138 |
|
| 139 |
Training set includes:
|
| 140 |
-
- Mpala (location_1): Multiple full sessions for Giraffes, Plains Zebras, and Grevy’s Zebras, including mixed-species scenes.-
|
| 141 |
-
- Ol Pejeta (location_2): Full sessions of Plains Zebras.
|
| 142 |
-
- The Wilds (location_3): 70% of sessions for Painted Dogs, Giraffes, and Persian Onagers.
|
| 143 |
|
| 144 |
Test set includes:
|
| 145 |
-
- The Wilds (location_3): The remaining 30% of sessions, including additional Grevy’s Zebra sessions used exclusively for testing.
|
| 146 |
-
- Mpala (location_1) and Ol Pejeta (location_2): Separate zebra and mixed-species sessions not used during training.
|
| 147 |
|
| 148 |
This careful division by session and location ensures that the model is evaluated on unseen environments, individuals, and contexts, making it a robust benchmark for testing generalization across ecological and geographic domains.
|
| 149 |
|
|
@@ -189,7 +189,7 @@ results = model.train(
|
|
| 189 |
|
| 190 |
#### Testing Data
|
| 191 |
|
| 192 |
-
The model was evaluated on a held-out test set located at `images/test` containing:
|
| 193 |
- 7658 test images with instances of Zebra, Giraffe, Onager, and Dog
|
| 194 |
|
| 195 |
|
|
@@ -240,6 +240,8 @@ The model was evaluated using standard object detection metrics:
|
|
| 240 |
|
| 241 |
## Citation
|
| 242 |
|
|
|
|
|
|
|
| 243 |
**BibTeX:**
|
| 244 |
|
| 245 |
```
|
|
|
|
| 129 |
|
| 130 |
### Training Data
|
| 131 |
|
| 132 |
+
The three datasets are available in the [MMLA Data Collection](https://huggingface.co/collections/imageomics/mmla). See `prepare_yolo_dataset.py` for details on train/test splits; the script runs on standard Python 3.10+ packages, and generates the splits.
|
| 133 |
|
| 134 |
#### Dataset splitting strategy
|
| 135 |
We applied a stratified 60/40 train-test split across species and locations to evaluate model generalizability. Data was collected from three distinct environments: Mpala Research Centre (location_1), Ol Pejeta Conservancy (location_2), and The Wilds Conservation Center (location_3). The dataset includes four target classes: Zebra, Giraffe, Onager, and African Wild Dog.
|
|
|
|
| 137 |
To prevent overlap in individual animals or environmental conditions between training and testing, we split video sessions at the file level—ensuring that no frames from a given session appear in both train and test sets. This also allows consistent per-frame sampling at a fixed interval (every 10th frame).
|
| 138 |
|
| 139 |
Training set includes:
|
| 140 |
+
- [Mpala](https://huggingface.co/datasets/imageomics/mmla_mpala) (location_1): Multiple full sessions for Giraffes, Plains Zebras, and Grevy’s Zebras, including mixed-species scenes.-
|
| 141 |
+
- [Ol Pejeta](https://huggingface.co/datasets/imageomics/mmla_opc) (location_2): Full sessions of Plains Zebras.
|
| 142 |
+
- [The Wilds](https://huggingface.co/datasets/imageomics/mmla_wilds) (location_3): 70% of sessions for Painted Dogs, Giraffes, and Persian Onagers.
|
| 143 |
|
| 144 |
Test set includes:
|
| 145 |
+
- [The Wilds](https://huggingface.co/datasets/imageomics/mmla_wilds) (location_3): The remaining 30% of sessions, including additional Grevy’s Zebra sessions used exclusively for testing.
|
| 146 |
+
- [Mpala](https://huggingface.co/datasets/imageomics/mmla_mpala) (location_1) and [Ol Pejeta](https://huggingface.co/datasets/imageomics/mmla_opc) (location_2): Separate zebra and mixed-species sessions not used during training.
|
| 147 |
|
| 148 |
This careful division by session and location ensures that the model is evaluated on unseen environments, individuals, and contexts, making it a robust benchmark for testing generalization across ecological and geographic domains.
|
| 149 |
|
|
|
|
| 189 |
|
| 190 |
#### Testing Data
|
| 191 |
|
| 192 |
+
The model was evaluated on a held-out test set located at `images/test` (created by running the [data prep script](https://huggingface.co/imageomics/mmla/blob/main/prepare_yolo_dataset.py)) containing:
|
| 193 |
- 7658 test images with instances of Zebra, Giraffe, Onager, and Dog
|
| 194 |
|
| 195 |
|
|
|
|
| 240 |
|
| 241 |
## Citation
|
| 242 |
|
| 243 |
+
If you use this model in your work, please cite both it and our associated paper as described below.
|
| 244 |
+
|
| 245 |
**BibTeX:**
|
| 246 |
|
| 247 |
```
|