# Dataset Upload & Download - Implementation Complete Dataset upload and download functionality has been implemented for ARKit datasets. ## ✅ Implemented Features ### 1. Dataset Upload (`ylff/utils/dataset_upload.py`) **Functions:** - ✅ `validate_arkit_zip()` - Validate zip file contains valid ARKit video-metadata pairs - ✅ `extract_arkit_zip()` - Extract and organize ARKit zip file into sequence directories - ✅ `process_uploaded_dataset()` - Complete upload processing pipeline **Features:** - Validates zip file format - Checks for matching video-metadata pairs (same base name) - Validates JSON metadata format - Organizes files into sequence directories - Reports validation errors and statistics ### 2. Dataset Download (`ylff/utils/dataset_download.py`) **S3DatasetDownloader Class:** - ✅ S3 client initialization with credentials - ✅ `list_datasets()` - List available datasets in S3 bucket - ✅ `download_dataset()` - Download dataset from S3 with progress - ✅ `download_and_extract()` - Download and extract dataset **Features:** - AWS credentials support (access key or credentials chain) - Progress bar for downloads - Automatic extraction (zip, tar.gz, tar) - Error handling and reporting ## 📋 API Endpoints ### `/api/v1/dataset/upload` (POST) **Request**: Multipart form data - `file`: Zip file containing ARKit video and metadata pairs - `output_dir`: Directory to extract dataset (default: "data/uploaded_datasets") - `validate`: Validate ARKit pairs before extraction (default: true) **Response**: `JobResponse` (async job) **Example:** ```bash curl -X POST "http://localhost:8000/api/v1/dataset/upload" \ -F "file=@arkit_dataset.zip" \ -F "output_dir=data/uploaded_datasets" \ -F "validate=true" ``` ### `/api/v1/dataset/download` (POST) **Request Model**: `DownloadDatasetRequest` ```json { "bucket_name": "my-datasets-bucket", "s3_key": "datasets/arkit_sequences.zip", "output_dir": "data/downloaded_datasets", "extract": true, "aws_access_key_id": null, "aws_secret_access_key": null, "region_name": "us-east-1" } ``` **Response**: `DownloadDatasetResponse` - `success`: Boolean - `output_path`: Path to downloaded file (if not extracted) - `output_dir`: Directory where dataset was extracted (if extracted) - `file_size`: Size of downloaded file in bytes - `error`: Error message if download failed ## 🔧 CLI Commands ### `ylff dataset upload` ```bash ylff dataset upload arkit_dataset.zip \ --output-dir data/uploaded_datasets \ --validate ``` **Options:** - `zip_path`: Path to zip file (required) - `--output-dir`: Directory to extract dataset (default: "data/uploaded_datasets") - `--validate`: Validate ARKit pairs before extraction (default: true) ### `ylff dataset download` ```bash ylff dataset download my-bucket datasets/arkit.zip \ --output-dir data/downloaded_datasets \ --extract \ --region-name us-east-1 ``` **Options:** - `bucket_name`: S3 bucket name (required) - `s3_key`: S3 object key (required) - `--output-dir`: Directory to save dataset (default: "data/downloaded_datasets") - `--extract`: Extract downloaded archive (default: true) - `--aws-access-key-id`: AWS access key ID (optional) - `--aws-secret-access-key`: AWS secret access key (optional) - `--region-name`: AWS region name (default: "us-east-1") ## 📦 Requirements ### Upload - No additional dependencies (uses standard library) ### Download - `boto3` - AWS SDK for Python ```bash pip install boto3 ``` ## 🔄 Usage Examples ### Upload ARKit Dataset **CLI:** ```bash ylff dataset upload my_arkit_data.zip --output-dir data/sequences ``` **API:** ```python import requests with open("my_arkit_data.zip", "rb") as f: response = requests.post( "http://localhost:8000/api/v1/dataset/upload", files={"file": f}, data={"output_dir": "data/sequences", "validate": "true"} ) job_id = response.json()["job_id"] ``` ### Download from S3 **CLI:** ```bash ylff dataset download my-bucket datasets/v1.zip \ --output-dir data/downloaded \ --extract ``` **API:** ```python import requests response = requests.post( "http://localhost:8000/api/v1/dataset/download", json={ "bucket_name": "my-bucket", "s3_key": "datasets/v1.zip", "output_dir": "data/downloaded", "extract": True, } ) result = response.json() ``` ## 📊 Validation The upload process validates: - ✅ Zip file format - ✅ Matching video-metadata pairs (same base name) - ✅ Valid JSON metadata format - ✅ File organization **Validation Report:** - Total files in zip - Video files count - Metadata files count - Valid pairs count - Invalid pairs list - Organized sequences count ## 🔐 AWS Credentials The download functionality supports multiple credential methods: 1. **Explicit credentials** (via API/CLI parameters) 2. **Environment variables** (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) 3. **IAM role** (when running on EC2/ECS) 4. **Credentials file** (`~/.aws/credentials`) All methods are supported via boto3's default credentials chain. ## 🚀 Next Steps 1. **S3 Upload** - Add ability to upload datasets to S3 2. **Dataset Listing** - API endpoint to list available datasets in S3 3. **Incremental Downloads** - Support for partial dataset downloads 4. **Compression Options** - Configurable compression for uploads 5. **Metadata Validation** - Enhanced ARKit metadata schema validation All core functionality is implemented and ready to use! 🎉