File size: 6,391 Bytes
5788577
 
 
 
 
 
 
 
 
 
 
 
f7e2bc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---
title: Remote audit
emoji: 📚
colorFrom: blue
colorTo: yellow
sdk: docker
pinned: false
license: openrail
short_description: Remote auditor
---


# Remote Audit App - Setup Instructions

This Hugging Face Space performs design-based tests of randomization integrity using pre-treatment satellite imagery, implementing the conditional randomization test from your paper.

## Quick Start

### Option 1: Use Pre-computed Satellite Data (Recommended for HF)

The app expects satellite features (NDVI, EVI, VIIRS) to be pre-computed. To replicate the Begum et al. 2022 audit:

1. Download the pre-processed dataset with satellite features
2. Place `Islam2019_WithGeocodesAndSatData.Rdata` in the app directory
3. Select "Use Example (Islam 2019)" in the app

### Option 2: Upload Your Own CSV

Your CSV should include:
- Treatment assignment column (e.g., `begum_treat` with values 1=control, 2=treatment)
- Satellite features: `ndvi_median`, `viirs_median` (or similar)
- Any other columns for reference

Example CSV structure:
```
id,begum_treat,ndvi_median,viirs_median
1,1,0.45,2.3
2,2,0.52,3.1
...
```

## Setting Up GEE for New Data

The app uses **pre-computed** satellite features. To add GEE capabilities for computing features on-the-fly:

### Prerequisites
1. Google Earth Engine account (free): https://earthengine.google.com/signup/
2. Python environment with `earthengine-api`

### Installation Steps

```bash
# Install Earth Engine Python API
pip install earthengine-api

# Authenticate (first time only)
earthengine authenticate

# Initialize in your script
import ee
ee.Initialize()
```

### Computing Satellite Features

Use the GEE code from `RemoteAuditOfBrokenRCT.R` to compute features:

```python
import ee
import pandas as pd

def compute_satellite_features(lat, lon, start_date, end_date):
    """
    Compute NDVI, EVI, and VIIRS features for a location
    
    Args:
        lat, lon: Coordinates
        start_date, end_date: Date range (YYYY-MM-DD)
    
    Returns:
        dict with ndvi_median, viirs_median, etc.
    """
    point = ee.Geometry.Point([lon, lat])
    
    # MODIS vegetation indices
    modis = (ee.ImageCollection('MODIS/061/MOD13Q1')
             .filterDate(start_date, end_date)
             .select(['NDVI', 'EVI'])
             .map(lambda img: img.multiply(0.0001)))
    
    ndvi_median = modis.select('NDVI').median()
    evi_median = modis.select('EVI').median()
    
    # VIIRS nighttime lights
    viirs = (ee.ImageCollection('NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG')
             .filterDate(start_date, end_date)
             .select(['avg_rad']))
    
    viirs_median = viirs.median()
    
    # Sample at location
    sample = (ndvi_median.addBands([evi_median, viirs_median])
              .sample(point, 250)
              .first()
              .getInfo())
    
    return {
        'ndvi_median': sample['properties'].get('NDVI'),
        'evi_median': sample['properties'].get('EVI'),
        'viirs_median': sample['properties'].get('avg_rad')
    }
```

### Integration with R (via reticulate)

To integrate GEE in your R workflow:

```r
library(reticulate)

# Set Python environment
Sys.setenv(RETICULATE_PYTHON = "/path/to/python")

# Import Earth Engine
ee <- import("ee")
ee$Initialize()

# Call Python function from R
compute_features <- py_run_string("
def get_features(lat, lon):
    # Your GEE code here
    return {'ndvi_median': ..., 'viirs_median': ...}
")

features <- compute_features$get_features(lat, lon)
```

## App Configuration on Hugging Face

### Required Files
- `app.R` - Main Shiny application
- `Dockerfile` - Container configuration
- `Islam2019_WithGeocodesAndSatData.Rdata` - Example dataset (optional)

### Environment Variables
None required for basic functionality.

### Secrets (if adding GEE)
If you want to enable on-the-fly GEE queries:
1. Add `GOOGLE_APPLICATION_CREDENTIALS` secret in HF Space settings
2. Upload service account JSON
3. Modify app to call GEE API

## Usage Guide

### Running a Randomization Audit

1. **Load Data**: Upload CSV or use example
2. **Configure Audit**:
   - Audit Type: "Randomization"
   - Treatment Column: Select column (e.g., `begum_treat`)
   - Control Value: 1
   - Treatment Value: 2
3. **Select Features**: Check `ndvi_median` and `viirs_median`
4. **Choose Learner**: Logistic (fast) or XGBoost (more flexible)
5. **Set Parameters**:
   - K-Folds: 5-10 (higher = more robust)
   - Resamples: 1000-2000 (higher = more precise p-value)
6. **Run Audit**: Click "Run Audit" button

### Running a Missingness Audit

Same steps but:
- Audit Type: "Missingness"
- Select variable to check for missing data patterns

### Interpreting Results

- **p < 0.05**: Assignment is MORE predictable from satellite features than expected → potential deviation from stated randomization
- **p ≥ 0.05**: No evidence of deviation detected (but doesn't prove perfect randomization)

## Technical Notes

### Computation Time
- Logistic: ~1-3 minutes for 500 units, 1000 resamples
- XGBoost: ~3-10 minutes (depends on tree settings)

### Memory Requirements
- Small datasets (<1000 units): 2GB RAM sufficient
- Large datasets (>5000 units): Consider 4GB+ RAM

### Handling Missing Satellite Data

If your CSV has missing satellite features:
- App will drop rows with missing values
- Consider imputation before upload, or
- Use GEE to compute features for missing locations

## Troubleshooting

### "Feature not found" error
- Check that your CSV has columns named exactly: `ndvi_median`, `viirs_median`
- Column names are case-sensitive

### "Too few complete cases" error
- Ensure at least 10 units have both valid treatment assignment and satellite features
- Check for NA values in your data

### GEE authentication issues
```bash
# Re-authenticate
earthengine authenticate

# Check credentials
python -c "import ee; ee.Initialize(); print('Success!')"
```

### Dockerfile build fails
```bash
# Test locally
docker build -t remote-audit .
docker run -p 7860:7860 remote-audit
```

## Citation

If you use this app, please cite:

```
Jerzak, C. T., & Daoud, A. (2025). Remote Auditing: Design-Based Tests 
of Randomization, Selection, and Missingness with Broadly Accessible 
Satellite Imagery.
```

## Support

For issues or questions:
- Check the paper's technical appendix
- Review example code in `RemoteAuditOfBrokenRCT.R`
- Contact: [your contact info]