3d_model / docs /DATASET_UPLOAD_DOWNLOAD.md
Azan
Clean deployment build (Squashed)
7a87926
# Dataset Upload & Download - Implementation Complete
Dataset upload and download functionality has been implemented for ARKit datasets.
## βœ… Implemented Features
### 1. Dataset Upload (`ylff/utils/dataset_upload.py`)
**Functions:**
- βœ… `validate_arkit_zip()` - Validate zip file contains valid ARKit video-metadata pairs
- βœ… `extract_arkit_zip()` - Extract and organize ARKit zip file into sequence directories
- βœ… `process_uploaded_dataset()` - Complete upload processing pipeline
**Features:**
- Validates zip file format
- Checks for matching video-metadata pairs (same base name)
- Validates JSON metadata format
- Organizes files into sequence directories
- Reports validation errors and statistics
### 2. Dataset Download (`ylff/utils/dataset_download.py`)
**S3DatasetDownloader Class:**
- βœ… S3 client initialization with credentials
- βœ… `list_datasets()` - List available datasets in S3 bucket
- βœ… `download_dataset()` - Download dataset from S3 with progress
- βœ… `download_and_extract()` - Download and extract dataset
**Features:**
- AWS credentials support (access key or credentials chain)
- Progress bar for downloads
- Automatic extraction (zip, tar.gz, tar)
- Error handling and reporting
## πŸ“‹ API Endpoints
### `/api/v1/dataset/upload` (POST)
**Request**: Multipart form data
- `file`: Zip file containing ARKit video and metadata pairs
- `output_dir`: Directory to extract dataset (default: "data/uploaded_datasets")
- `validate`: Validate ARKit pairs before extraction (default: true)
**Response**: `JobResponse` (async job)
**Example:**
```bash
curl -X POST "http://localhost:8000/api/v1/dataset/upload" \
-F "file=@arkit_dataset.zip" \
-F "output_dir=data/uploaded_datasets" \
-F "validate=true"
```
### `/api/v1/dataset/download` (POST)
**Request Model**: `DownloadDatasetRequest`
```json
{
"bucket_name": "my-datasets-bucket",
"s3_key": "datasets/arkit_sequences.zip",
"output_dir": "data/downloaded_datasets",
"extract": true,
"aws_access_key_id": null,
"aws_secret_access_key": null,
"region_name": "us-east-1"
}
```
**Response**: `DownloadDatasetResponse`
- `success`: Boolean
- `output_path`: Path to downloaded file (if not extracted)
- `output_dir`: Directory where dataset was extracted (if extracted)
- `file_size`: Size of downloaded file in bytes
- `error`: Error message if download failed
## πŸ”§ CLI Commands
### `ylff dataset upload`
```bash
ylff dataset upload arkit_dataset.zip \
--output-dir data/uploaded_datasets \
--validate
```
**Options:**
- `zip_path`: Path to zip file (required)
- `--output-dir`: Directory to extract dataset (default: "data/uploaded_datasets")
- `--validate`: Validate ARKit pairs before extraction (default: true)
### `ylff dataset download`
```bash
ylff dataset download my-bucket datasets/arkit.zip \
--output-dir data/downloaded_datasets \
--extract \
--region-name us-east-1
```
**Options:**
- `bucket_name`: S3 bucket name (required)
- `s3_key`: S3 object key (required)
- `--output-dir`: Directory to save dataset (default: "data/downloaded_datasets")
- `--extract`: Extract downloaded archive (default: true)
- `--aws-access-key-id`: AWS access key ID (optional)
- `--aws-secret-access-key`: AWS secret access key (optional)
- `--region-name`: AWS region name (default: "us-east-1")
## πŸ“¦ Requirements
### Upload
- No additional dependencies (uses standard library)
### Download
- `boto3` - AWS SDK for Python
```bash
pip install boto3
```
## πŸ”„ Usage Examples
### Upload ARKit Dataset
**CLI:**
```bash
ylff dataset upload my_arkit_data.zip --output-dir data/sequences
```
**API:**
```python
import requests
with open("my_arkit_data.zip", "rb") as f:
response = requests.post(
"http://localhost:8000/api/v1/dataset/upload",
files={"file": f},
data={"output_dir": "data/sequences", "validate": "true"}
)
job_id = response.json()["job_id"]
```
### Download from S3
**CLI:**
```bash
ylff dataset download my-bucket datasets/v1.zip \
--output-dir data/downloaded \
--extract
```
**API:**
```python
import requests
response = requests.post(
"http://localhost:8000/api/v1/dataset/download",
json={
"bucket_name": "my-bucket",
"s3_key": "datasets/v1.zip",
"output_dir": "data/downloaded",
"extract": True,
}
)
result = response.json()
```
## πŸ“Š Validation
The upload process validates:
- βœ… Zip file format
- βœ… Matching video-metadata pairs (same base name)
- βœ… Valid JSON metadata format
- βœ… File organization
**Validation Report:**
- Total files in zip
- Video files count
- Metadata files count
- Valid pairs count
- Invalid pairs list
- Organized sequences count
## πŸ” AWS Credentials
The download functionality supports multiple credential methods:
1. **Explicit credentials** (via API/CLI parameters)
2. **Environment variables** (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
3. **IAM role** (when running on EC2/ECS)
4. **Credentials file** (`~/.aws/credentials`)
All methods are supported via boto3's default credentials chain.
## πŸš€ Next Steps
1. **S3 Upload** - Add ability to upload datasets to S3
2. **Dataset Listing** - API endpoint to list available datasets in S3
3. **Incremental Downloads** - Support for partial dataset downloads
4. **Compression Options** - Configurable compression for uploads
5. **Metadata Validation** - Enhanced ARKit metadata schema validation
All core functionality is implemented and ready to use! πŸŽ‰