|
|
--- |
|
|
title: Remote audit |
|
|
emoji: 📚 |
|
|
colorFrom: blue |
|
|
colorTo: yellow |
|
|
sdk: docker |
|
|
pinned: false |
|
|
license: openrail |
|
|
short_description: Remote auditor |
|
|
--- |
|
|
|
|
|
|
|
|
# Remote Audit App - Setup Instructions |
|
|
|
|
|
This Hugging Face Space performs design-based tests of randomization integrity using pre-treatment satellite imagery, implementing the conditional randomization test from your paper. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Option 1: Use Pre-computed Satellite Data (Recommended for HF) |
|
|
|
|
|
The app expects satellite features (NDVI, EVI, VIIRS) to be pre-computed. To replicate the Begum et al. 2022 audit: |
|
|
|
|
|
1. Download the pre-processed dataset with satellite features |
|
|
2. Place `Islam2019_WithGeocodesAndSatData.Rdata` in the app directory |
|
|
3. Select "Use Example (Islam 2019)" in the app |
|
|
|
|
|
### Option 2: Upload Your Own CSV |
|
|
|
|
|
Your CSV should include: |
|
|
- Treatment assignment column (e.g., `begum_treat` with values 1=control, 2=treatment) |
|
|
- Satellite features: `ndvi_median`, `viirs_median` (or similar) |
|
|
- Any other columns for reference |
|
|
|
|
|
Example CSV structure: |
|
|
``` |
|
|
id,begum_treat,ndvi_median,viirs_median |
|
|
1,1,0.45,2.3 |
|
|
2,2,0.52,3.1 |
|
|
... |
|
|
``` |
|
|
|
|
|
## Setting Up GEE for New Data |
|
|
|
|
|
The app uses **pre-computed** satellite features. To add GEE capabilities for computing features on-the-fly: |
|
|
|
|
|
### Prerequisites |
|
|
1. Google Earth Engine account (free): https://earthengine.google.com/signup/ |
|
|
2. Python environment with `earthengine-api` |
|
|
|
|
|
### Installation Steps |
|
|
|
|
|
```bash |
|
|
# Install Earth Engine Python API |
|
|
pip install earthengine-api |
|
|
|
|
|
# Authenticate (first time only) |
|
|
earthengine authenticate |
|
|
|
|
|
# Initialize in your script |
|
|
import ee |
|
|
ee.Initialize() |
|
|
``` |
|
|
|
|
|
### Computing Satellite Features |
|
|
|
|
|
Use the GEE code from `RemoteAuditOfBrokenRCT.R` to compute features: |
|
|
|
|
|
```python |
|
|
import ee |
|
|
import pandas as pd |
|
|
|
|
|
def compute_satellite_features(lat, lon, start_date, end_date): |
|
|
""" |
|
|
Compute NDVI, EVI, and VIIRS features for a location |
|
|
|
|
|
Args: |
|
|
lat, lon: Coordinates |
|
|
start_date, end_date: Date range (YYYY-MM-DD) |
|
|
|
|
|
Returns: |
|
|
dict with ndvi_median, viirs_median, etc. |
|
|
""" |
|
|
point = ee.Geometry.Point([lon, lat]) |
|
|
|
|
|
# MODIS vegetation indices |
|
|
modis = (ee.ImageCollection('MODIS/061/MOD13Q1') |
|
|
.filterDate(start_date, end_date) |
|
|
.select(['NDVI', 'EVI']) |
|
|
.map(lambda img: img.multiply(0.0001))) |
|
|
|
|
|
ndvi_median = modis.select('NDVI').median() |
|
|
evi_median = modis.select('EVI').median() |
|
|
|
|
|
# VIIRS nighttime lights |
|
|
viirs = (ee.ImageCollection('NOAA/VIIRS/DNB/MONTHLY_V1/VCMSLCFG') |
|
|
.filterDate(start_date, end_date) |
|
|
.select(['avg_rad'])) |
|
|
|
|
|
viirs_median = viirs.median() |
|
|
|
|
|
# Sample at location |
|
|
sample = (ndvi_median.addBands([evi_median, viirs_median]) |
|
|
.sample(point, 250) |
|
|
.first() |
|
|
.getInfo()) |
|
|
|
|
|
return { |
|
|
'ndvi_median': sample['properties'].get('NDVI'), |
|
|
'evi_median': sample['properties'].get('EVI'), |
|
|
'viirs_median': sample['properties'].get('avg_rad') |
|
|
} |
|
|
``` |
|
|
|
|
|
### Integration with R (via reticulate) |
|
|
|
|
|
To integrate GEE in your R workflow: |
|
|
|
|
|
```r |
|
|
library(reticulate) |
|
|
|
|
|
# Set Python environment |
|
|
Sys.setenv(RETICULATE_PYTHON = "/path/to/python") |
|
|
|
|
|
# Import Earth Engine |
|
|
ee <- import("ee") |
|
|
ee$Initialize() |
|
|
|
|
|
# Call Python function from R |
|
|
compute_features <- py_run_string(" |
|
|
def get_features(lat, lon): |
|
|
# Your GEE code here |
|
|
return {'ndvi_median': ..., 'viirs_median': ...} |
|
|
") |
|
|
|
|
|
features <- compute_features$get_features(lat, lon) |
|
|
``` |
|
|
|
|
|
## App Configuration on Hugging Face |
|
|
|
|
|
### Required Files |
|
|
- `app.R` - Main Shiny application |
|
|
- `Dockerfile` - Container configuration |
|
|
- `Islam2019_WithGeocodesAndSatData.Rdata` - Example dataset (optional) |
|
|
|
|
|
### Environment Variables |
|
|
None required for basic functionality. |
|
|
|
|
|
### Secrets (if adding GEE) |
|
|
If you want to enable on-the-fly GEE queries: |
|
|
1. Add `GOOGLE_APPLICATION_CREDENTIALS` secret in HF Space settings |
|
|
2. Upload service account JSON |
|
|
3. Modify app to call GEE API |
|
|
|
|
|
## Usage Guide |
|
|
|
|
|
### Running a Randomization Audit |
|
|
|
|
|
1. **Load Data**: Upload CSV or use example |
|
|
2. **Configure Audit**: |
|
|
- Audit Type: "Randomization" |
|
|
- Treatment Column: Select column (e.g., `begum_treat`) |
|
|
- Control Value: 1 |
|
|
- Treatment Value: 2 |
|
|
3. **Select Features**: Check `ndvi_median` and `viirs_median` |
|
|
4. **Choose Learner**: Logistic (fast) or XGBoost (more flexible) |
|
|
5. **Set Parameters**: |
|
|
- K-Folds: 5-10 (higher = more robust) |
|
|
- Resamples: 1000-2000 (higher = more precise p-value) |
|
|
6. **Run Audit**: Click "Run Audit" button |
|
|
|
|
|
### Running a Missingness Audit |
|
|
|
|
|
Same steps but: |
|
|
- Audit Type: "Missingness" |
|
|
- Select variable to check for missing data patterns |
|
|
|
|
|
### Interpreting Results |
|
|
|
|
|
- **p < 0.05**: Assignment is MORE predictable from satellite features than expected → potential deviation from stated randomization |
|
|
- **p ≥ 0.05**: No evidence of deviation detected (but doesn't prove perfect randomization) |
|
|
|
|
|
## Technical Notes |
|
|
|
|
|
### Computation Time |
|
|
- Logistic: ~1-3 minutes for 500 units, 1000 resamples |
|
|
- XGBoost: ~3-10 minutes (depends on tree settings) |
|
|
|
|
|
### Memory Requirements |
|
|
- Small datasets (<1000 units): 2GB RAM sufficient |
|
|
- Large datasets (>5000 units): Consider 4GB+ RAM |
|
|
|
|
|
### Handling Missing Satellite Data |
|
|
|
|
|
If your CSV has missing satellite features: |
|
|
- App will drop rows with missing values |
|
|
- Consider imputation before upload, or |
|
|
- Use GEE to compute features for missing locations |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
### "Feature not found" error |
|
|
- Check that your CSV has columns named exactly: `ndvi_median`, `viirs_median` |
|
|
- Column names are case-sensitive |
|
|
|
|
|
### "Too few complete cases" error |
|
|
- Ensure at least 10 units have both valid treatment assignment and satellite features |
|
|
- Check for NA values in your data |
|
|
|
|
|
### GEE authentication issues |
|
|
```bash |
|
|
# Re-authenticate |
|
|
earthengine authenticate |
|
|
|
|
|
# Check credentials |
|
|
python -c "import ee; ee.Initialize(); print('Success!')" |
|
|
``` |
|
|
|
|
|
### Dockerfile build fails |
|
|
```bash |
|
|
# Test locally |
|
|
docker build -t remote-audit . |
|
|
docker run -p 7860:7860 remote-audit |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this app, please cite: |
|
|
|
|
|
``` |
|
|
Jerzak, C. T., & Daoud, A. (2025). Remote Auditing: Design-Based Tests |
|
|
of Randomization, Selection, and Missingness with Broadly Accessible |
|
|
Satellite Imagery. |
|
|
``` |
|
|
|
|
|
## Support |
|
|
|
|
|
For issues or questions: |
|
|
- Check the paper's technical appendix |
|
|
- Review example code in `RemoteAuditOfBrokenRCT.R` |
|
|
- Contact: [your contact info] |
|
|
|