whats2000 commited on
Commit
cee344c
·
1 Parent(s): 874e4c6

feat(recovery): add corrupted file redownload script and documentation

Browse files
Files changed (2) hide show
  1. README.md +54 -0
  2. scripts/redownload_corrupted.py +244 -0
README.md CHANGED
@@ -41,6 +41,12 @@ This creates `output/cache/enhanced_metadata.parquet` with:
41
 
42
  Cache is incremental - only new/changed files are rescanned. Use `--force-rescan` to rebuild.
43
 
 
 
 
 
 
 
44
  ### 3) Run EDA pipeline
45
 
46
  Single command to run everything:
@@ -221,6 +227,54 @@ The notebook provides:
221
 
222
  ## Troubleshooting
223
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
224
  ### Metadata cache not found
225
 
226
  ```bash
 
41
 
42
  Cache is incremental - only new/changed files are rescanned. Use `--force-rescan` to rebuild.
43
 
44
+ **Handling corrupted files**: If some files fail during scanning (status='failed' or 'corrupted'), you can:
45
+ 1. Retry them with a more robust strategy: `uv run python scripts/retry_failed_cache.py --cache output/cache/enhanced_metadata.parquet`
46
+ 2. Re-download corrupted files from CELLxGENE: `uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml`
47
+
48
+ See [Troubleshooting](#troubleshooting) for details.
49
+
50
  ### 3) Run EDA pipeline
51
 
52
  Single command to run everything:
 
227
 
228
  ## Troubleshooting
229
 
230
+ ### Corrupted or failed datasets
231
+
232
+ If the metadata cache builder reports failed or corrupted files:
233
+
234
+ **Step 1: Retry with robust strategy**
235
+
236
+ Some files may fail due to transient issues or need special handling:
237
+
238
+ ```bash
239
+ uv run python scripts/retry_failed_cache.py --cache output/cache/enhanced_metadata.parquet
240
+ ```
241
+
242
+ This script:
243
+ - Retries failed datasets with progressively safer strategies (anndata backed mode → h5py direct)
244
+ - Categorizes truly corrupted files (truncated/damaged HDF5 structure)
245
+ - Merges retry results back into the cache
246
+ - Reports final statistics (successful recoveries vs truly corrupted)
247
+
248
+ **Step 2: Re-download corrupted files**
249
+
250
+ For files that are truly corrupted (status='corrupted'), re-download fresh copies from CELLxGENE:
251
+
252
+ ```bash
253
+ uv run python scripts/redownload_corrupted.py --config configs/eda_optimized.yaml
254
+ ```
255
+
256
+ This script:
257
+ - Identifies corrupted files from the metadata cache
258
+ - Looks up dataset IDs and download URLs from CELLxGENE metadata CSVs
259
+ - Downloads files to `output/temp/` for safety
260
+ - Verifies each downloaded file is valid HDF5
261
+ - Moves verified files to replace corrupted originals
262
+ - Keeps failed downloads in temp for inspection
263
+
264
+ After re-downloading, rebuild the metadata cache to update the status:
265
+
266
+ ```bash
267
+ uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml --force-rescan
268
+ ```
269
+
270
+ **Typical corruption causes:**
271
+ - Interrupted downloads during dataset collection
272
+ - HDF5 file not properly closed/finalized during creation
273
+ - Storage/filesystem errors
274
+ - Network transfer errors from original source
275
+
276
+ **Note:** Files marked as 'corrupted' have HDF5 structural issues (truncated superblock, missing data blocks) and cannot be repaired - they must be re-downloaded from the source.
277
+
278
  ### Metadata cache not found
279
 
280
  ```bash
scripts/redownload_corrupted.py ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Re-download corrupted files based on metadata."""
3
+
4
+ import argparse
5
+ import yaml
6
+ import pandas as pd
7
+ import requests
8
+ import h5py
9
+ from pathlib import Path
10
+ from tqdm import tqdm
11
+
12
+
13
+ def load_config(config_path: Path) -> dict:
14
+ """Load YAML configuration."""
15
+ with open(config_path) as f:
16
+ return yaml.safe_load(f)
17
+
18
+
19
+ def download_file(url: str, output_path: Path, temp_dir: Path, chunk_size: int = 8192) -> bool:
20
+ """Download a file with progress bar and return success status."""
21
+ try:
22
+ print(f"\n Downloading {output_path.name}...")
23
+
24
+ # Download to project temp directory first
25
+ temp_dir.mkdir(parents=True, exist_ok=True)
26
+ temp_path = temp_dir / output_path.name
27
+
28
+ response = requests.get(url, stream=True)
29
+ response.raise_for_status()
30
+
31
+ total_size = int(response.headers.get('content-length', 0))
32
+
33
+ with open(temp_path, 'wb') as f, tqdm(
34
+ total=total_size,
35
+ unit='B',
36
+ unit_scale=True,
37
+ unit_divisor=1024,
38
+ desc=" Progress",
39
+ ) as pbar:
40
+ for chunk in response.iter_content(chunk_size=chunk_size):
41
+ if chunk:
42
+ f.write(chunk)
43
+ pbar.update(len(chunk))
44
+
45
+ # Verify it's a valid HDF5 file
46
+ print(f" Verifying HDF5 integrity...")
47
+ try:
48
+ with h5py.File(temp_path, 'r') as f:
49
+ # Try to access basic structure
50
+ _ = list(f.keys())
51
+ print(f" ✓ Valid HDF5 file")
52
+
53
+ # Move to final destination, replacing the corrupted file
54
+ print(f" Moving to {output_path}...")
55
+ if output_path.exists():
56
+ output_path.unlink()
57
+ temp_path.rename(output_path)
58
+ print(f" ✓ Successfully replaced corrupted file")
59
+ return True
60
+
61
+ except Exception as e:
62
+ print(f" ✗ Downloaded file is corrupted: {e}")
63
+ print(f" Keeping temp file for inspection: {temp_path}")
64
+ return False
65
+
66
+ except requests.exceptions.RequestException as e:
67
+ print(f" ✗ Download failed: {e}")
68
+ return False
69
+ except Exception as e:
70
+ print(f" ✗ Error: {e}")
71
+ return False
72
+
73
+
74
+ def main():
75
+ parser = argparse.ArgumentParser(
76
+ description="Generate download script for corrupted datasets"
77
+ )
78
+ parser.add_argument(
79
+ "--config",
80
+ type=Path,
81
+ default=Path("configs/eda_optimized.yaml"),
82
+ help="Path to YAML configuration file",
83
+ )
84
+
85
+ args = parser.parse_args()
86
+
87
+ # Load configuration
88
+ print(f"Loading configuration from {args.config}...")
89
+ config = load_config(args.config)
90
+
91
+ # Extract paths from config
92
+ cache_path = Path(config["paths"]["enhanced_metadata_cache"])
93
+ metadata_csvs = [Path(p) for p in config["paths"]["metadata_csvs"]]
94
+ input_dirs = [Path(p) for p in config["paths"]["input_dirs"]]
95
+
96
+ print(f"Cache: {cache_path}")
97
+ print(f"Metadata CSVs: {len(metadata_csvs)} files")
98
+ print(f"Input dirs: {len(input_dirs)} directories\n")
99
+
100
+ # Load corrupted files from cache
101
+ print("Loading metadata cache...")
102
+ cache_df = pd.read_parquet(cache_path)
103
+
104
+ # Get corrupted files
105
+ corrupted = cache_df[cache_df["status"] == "corrupted"].copy()
106
+ print(f"Found {len(corrupted)} corrupted files\n")
107
+
108
+ # Load CELLxGENE metadata
109
+ print("Loading CELLxGENE metadata...")
110
+ metadata_dfs = []
111
+ for csv_path in metadata_csvs:
112
+ if csv_path.exists():
113
+ df = pd.read_csv(csv_path)
114
+ metadata_dfs.append(df)
115
+ print(f" Loaded {len(df)} records from {csv_path.name}")
116
+ else:
117
+ print(f" ⚠ Not found: {csv_path}")
118
+
119
+ metadata = pd.concat(metadata_dfs, ignore_index=True)
120
+
121
+ # Create organism to directory mapping
122
+ organism_to_dir = {}
123
+ for input_dir in input_dirs:
124
+ if "homo_sapiens" in str(input_dir).lower():
125
+ organism_to_dir["Homo sapiens"] = input_dir
126
+ elif "mus_musculus" in str(input_dir).lower():
127
+ organism_to_dir["Mus musculus"] = input_dir
128
+
129
+ # Extract dataset IDs from filenames
130
+ print("\nMatching corrupted files with metadata...\n")
131
+ results = []
132
+
133
+ for _, row in corrupted.iterrows():
134
+ filename = row["dataset_file"]
135
+ # Extract dataset_id from filename (format: {dataset_id}__{title}.h5ad)
136
+ dataset_id = filename.split("__")[0]
137
+
138
+ # Find in metadata
139
+ match = metadata[metadata["dataset_id"] == dataset_id]
140
+
141
+ if len(match) > 0:
142
+ record = match.iloc[0]
143
+ dataset_version_id = record["dataset_version_id"]
144
+ title = record["dataset_title"]
145
+ organism = record["organism"]
146
+
147
+ # CELLxGENE download URL format
148
+ # CELLxGENE download URL format
149
+ download_url = f"https://datasets.cellxgene.cziscience.com/{dataset_version_id}.h5ad"
150
+
151
+ # Output path based on organism
152
+ output_dir = organism_to_dir.get(organism)
153
+ if not output_dir:
154
+ print(f" ⚠ Unknown organism: {organism}, skipping")
155
+ continue
156
+
157
+ output_path = output_dir / filename
158
+
159
+ results.append({
160
+ "dataset_id": dataset_id,
161
+ "version_id": dataset_version_id,
162
+ "title": title,
163
+ "organism": organism,
164
+ "filename": filename,
165
+ "size_gb": row["file_size_gib"],
166
+ "download_url": download_url,
167
+ "output_path": str(output_path),
168
+ })
169
+
170
+ print(f"✓ {dataset_id}")
171
+ print(f" Title: {title}")
172
+ print(f" Size: {row['file_size_gib']:.2f} GB")
173
+ print(f" URL: {download_url}")
174
+ print()
175
+ else:
176
+ print(f"✗ {dataset_id} - NOT FOUND in metadata")
177
+ print()
178
+
179
+ if not results:
180
+ print("No corrupted files found!")
181
+ return
182
+
183
+ # Summary
184
+ print("\n" + "=" * 80)
185
+ print("DOWNLOAD SUMMARY")
186
+ print("=" * 80)
187
+ total_size = sum(r['size_gb'] for r in results)
188
+ print(f"\nFound {len(results)} corrupted files to re-download")
189
+ print(f"Total download size: {total_size:.2f} GB\n")
190
+
191
+ for i, r in enumerate(results, 1):
192
+ print(f"{i}. {r['title']} ({r['size_gb']:.2f} GB)")
193
+
194
+ # Save CSV for reference
195
+ results_df = pd.DataFrame(results)
196
+ csv_path = Path("output/corrupted_files_redownload_info.csv")
197
+ csv_path.parent.mkdir(parents=True, exist_ok=True)
198
+ results_df.to_csv(csv_path, index=False)
199
+ print(f"\nDetails saved to: {csv_path}")
200
+
201
+ # Download files
202
+ print("\n" + "=" * 80)
203
+ print("DOWNLOADING FILES")
204
+ print("=" * 80)
205
+
206
+ # Use project temp directory
207
+ temp_dir = Path("output/temp")
208
+
209
+ success_count = 0
210
+ failed_files = []
211
+
212
+ for i, r in enumerate(results, 1):
213
+ print(f"\n[{i}/{len(results)}] {r['title']} ({r['size_gb']:.2f} GB)")
214
+
215
+ output_path = Path(r['output_path'])
216
+ success = download_file(r['download_url'], output_path, temp_dir)
217
+
218
+ if success:
219
+ success_count += 1
220
+ else:
221
+ failed_files.append(r['filename'])
222
+
223
+ # Final summary
224
+ print("\n" + "=" * 80)
225
+ print("FINAL RESULTS")
226
+ print("=" * 80)
227
+ print(f"\nSuccessfully downloaded: {success_count}/{len(results)}")
228
+
229
+ if failed_files:
230
+ print(f"\nFailed downloads:")
231
+ for fname in failed_files:
232
+ print(f" - {fname}")
233
+ print("\nYou can retry failed downloads by running this script again.")
234
+ else:
235
+ print("\n✓ All files downloaded successfully!")
236
+ print("\nNext steps:")
237
+ print(" 1. Re-run the metadata cache builder to update the cache:")
238
+ print(" uv run python scripts/build_metadata_cache.py --config configs/eda_optimized.yaml")
239
+ print(" 2. Or re-run the retry script to update just these files:")
240
+ print(" uv run python scripts/retry_failed_cache.py --cache output/cache/enhanced_metadata.parquet")
241
+
242
+
243
+ if __name__ == "__main__":
244
+ main()