ddecosmo commited on
Commit
324226e
·
verified ·
1 Parent(s): 121b884

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pretty_name: Lanternfly Image Classifier Training Dataset
6
+ ---
7
+
8
+
9
+ # Dataset Card for {{ pretty_name | default("Dataset Name", true) }}
10
+
11
+ This dataset is the training dataset for 24-679 Project 1: Lanternfly Tracker
12
+ It is composed of 360 original lanternfly photos, 150 original photos with no lanternflies, and 800 original photos
13
+ from nature, urban, and other insect datasets listed below.
14
+
15
+ These were augmented 50X to 65.1k augmented images.
16
+
17
+
18
+ ## Dataset Details
19
+
20
+ ### Dataset Description
21
+
22
+ - **Curated by:** Carnegie Mellon University: 24-679
23
+ - **Shared by [optional]:** Devin DeCosmo
24
+ - **Language(s) (NLP):** English
25
+ - **License:** MIT
26
+
27
+ ### Dataset Sources [optional]
28
+
29
+ Original Lanternfly Datasets
30
+ rlogh/lanternfly-data: Original Lanternfly Dataset, 229 unmarked photos
31
+ rlogh/lanternfly_swatter_training: Dataset with geolocal data: 165 photos
32
+
33
+ Original Negative Datasets:
34
+ rlogh/negativesirl: Negatives dataset, images of outdoor environements and people with no lanternflies. 107 photos
35
+
36
+
37
+ Total: 501 original images
38
+
39
+ Imported Datasets
40
+
41
+
42
+
43
+ ## Uses
44
+
45
+ These images were used to train the EfficientNetB1 model, ddecosmo/lanternfly_classifier, on how to classify images
46
+ as containing or not containing lanternflies.
47
+
48
+
49
+ ### Direct Use
50
+
51
+ The direct use is identifying photographs containing lanterflies so this could be used for tracking purposes.
52
+
53
+ ### Out-of-Scope Use
54
+
55
+ In future, this model could be adapted to identify other types of insect within this dataset.
56
+
57
+
58
+ ## Dataset Structure
59
+
60
+ This dataset consists of two splits
61
+ An original split with 1.3k photos
62
+ An artificial split with 65.1k photos
63
+
64
+ The tasks fall into 3 categories based on the building pictured
65
+ 1. Lanternflies, all original photos
66
+ 2. Other Insect, all 3rd party datasets
67
+ 3. No insect, original photos and 3rd party datasets
68
+
69
+ ## Dataset Creation
70
+
71
+ ### Source Data
72
+
73
+ This data is sourced by the creators, Devin and Rumi for all original photos
74
+
75
+ Additional datasets can be found here,
76
+ uoft-cs/cifar100
77
+ AI-Lab-Makerere/beans
78
+ Francesco/insects-mytwu
79
+
80
+ #### Data Collection and Processing
81
+
82
+ Original datasets were collected using the mobile phones of the authors.
83
+
84
+ Additional datasets were recommended by Gemini AI and then validated as fitting the purpose, type, and scope of this process.
85
+ uoft-cs/cifar100: This is a general image identifier with no insect class. Used for no insect for generalizability
86
+ AI-Lab-Makerere/beans: This dataset is focused on vegetation with and without disease, this is used to train the model to recognize
87
+ vegetation without insects/lanterflies.
88
+ Francesco/insects-mytwu: This is an object detection dataset used for identifying insects as subjects, not including lanterflies.
89
+ We are using it train a seperate non-lanternfly insect class.
90
+
91
+ #### Who are the source data producers?
92
+
93
+ Original data was produced by the authors.
94
+
95
+ Additional datasets were produced by,
96
+ uoft-cs/cifar100: Created by University of Toronto Computer Science
97
+ AI-Lab-Makerere/beans: Created by AI Lab Makere
98
+ Francesco/insects-mytwu: Created by Fanscesco Sovrano
99
+
100
+ ## Bias, Risks, and Limitations
101
+
102
+ The main risk of this dataset is the lanternfly split. It contains only images of singular lanternflies on the ground.
103
+ Normally on concrete or asphalt. This severly limits the scope of the environments these creatures appear in.
104
+ Incorporating blob detection or YOLO into future models could mitigate this by focusing on the subject.
105
+
106
+ ### Recommendations
107
+
108
+ This is a large dataset, and has been shown to accurately classify lanternflies, but there are many edge cases when it does not work correctly.
109
+ In order to take this into account, using new types of models with subject detection can make use of the many images while improving model accuracy.