Spaces:
Sleeping
Sleeping
| # π Quick Start: Download Dataset | |
| ## β Script Works! (Tested Successfully) | |
| The download script works perfectly! Here are all the ways to use it: | |
| --- | |
| ## Method 1: Use the Script (Easiest) β | |
| ```bash | |
| # Download the default dataset | |
| python scripts/download_kagglehub.py | |
| # Or specify a different dataset | |
| python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset | |
| ``` | |
| **Output:** Dataset saved to `data/ai_vs_human_text.csv` | |
| --- | |
| ## Method 2: Direct in Your Code (Simple) | |
| Just copy-paste this into your Python script: | |
| ```python | |
| import kagglehub | |
| import pandas as pd | |
| from pathlib import Path | |
| # Download dataset (no API token needed!) | |
| path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset") | |
| print("Path to dataset files:", path) | |
| # Load the CSV | |
| csv_files = list(Path(path).glob("*.csv")) | |
| df = pd.read_csv(csv_files[0]) | |
| # Save to your data directory | |
| df.to_csv("data/dataset.csv", index=False) | |
| ``` | |
| **See:** `examples/simple_download.py` for a complete example | |
| --- | |
| ## Method 3: Use the Integrated Function | |
| ```python | |
| from ai_text_detector.download_data import download_ai_vs_human_dataset | |
| # Download and get the path | |
| csv_path = download_ai_vs_human_dataset() | |
| print(f"Dataset at: {csv_path}") | |
| # Now use it in your training | |
| from ai_text_detector.config import load_config | |
| cfg = load_config("configs/default.yaml") | |
| cfg.data_path = csv_path | |
| ``` | |
| **See:** `examples/download_and_train.py` for a complete training example | |
| --- | |
| ## Method 4: Download Any Dataset | |
| ```python | |
| from ai_text_detector.download_data import download_kaggle_dataset | |
| # Download any Kaggle dataset | |
| csv_path = download_kaggle_dataset( | |
| "shamimhasan8/ai-vs-human-text-dataset", | |
| output_path="data/my_dataset.csv" | |
| ) | |
| ``` | |
| --- | |
| ## π What Was Downloaded | |
| - **Dataset:** `shamimhasan8/ai-vs-human-text-dataset` | |
| - **Size:** 1,000 samples | |
| - **Columns:** `id`, `text`, `label`, `prompt`, `model`, `date` | |
| - **Labels:** "AI-generated" or "Human-written" | |
| - **Saved to:** `data/ai_vs_human_text.csv` | |
| --- | |
| ## π― Next Steps | |
| 1. **Dataset is ready!** It's at `data/ai_vs_human_text.csv` | |
| 2. **Config updated!** `configs/default.yaml` already points to it | |
| 3. **Train your model:** | |
| ```bash | |
| python scripts/run_train.py | |
| ``` | |
| --- | |
| ## π‘ Tips | |
| - **Small dataset (1k samples):** Good for quick testing | |
| - **Want more data?** Look for larger datasets on Kaggle | |
| - **Already downloaded?** The script won't re-download (uses cache) | |
| - **No API token needed!** `kagglehub` handles everything | |
| --- | |
| ## π Verify It Works | |
| ```bash | |
| # Check the dataset | |
| head -5 data/ai_vs_human_text.csv | |
| # Or in Python | |
| import pandas as pd | |
| df = pd.read_csv("data/ai_vs_human_text.csv") | |
| print(f"Rows: {len(df):,}") | |
| print(df.head()) | |
| ``` | |