AITextDetector / QUICK_START_DOWNLOAD.md
ChauHPham's picture
Upload folder using huggingface_hub
25faba3 verified
# πŸš€ Quick Start: Download Dataset
## βœ… Script Works! (Tested Successfully)
The download script works perfectly! Here are all the ways to use it:
---
## Method 1: Use the Script (Easiest) ⭐
```bash
# Download the default dataset
python scripts/download_kagglehub.py
# Or specify a different dataset
python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset
```
**Output:** Dataset saved to `data/ai_vs_human_text.csv`
---
## Method 2: Direct in Your Code (Simple)
Just copy-paste this into your Python script:
```python
import kagglehub
import pandas as pd
from pathlib import Path
# Download dataset (no API token needed!)
path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset")
print("Path to dataset files:", path)
# Load the CSV
csv_files = list(Path(path).glob("*.csv"))
df = pd.read_csv(csv_files[0])
# Save to your data directory
df.to_csv("data/dataset.csv", index=False)
```
**See:** `examples/simple_download.py` for a complete example
---
## Method 3: Use the Integrated Function
```python
from ai_text_detector.download_data import download_ai_vs_human_dataset
# Download and get the path
csv_path = download_ai_vs_human_dataset()
print(f"Dataset at: {csv_path}")
# Now use it in your training
from ai_text_detector.config import load_config
cfg = load_config("configs/default.yaml")
cfg.data_path = csv_path
```
**See:** `examples/download_and_train.py` for a complete training example
---
## Method 4: Download Any Dataset
```python
from ai_text_detector.download_data import download_kaggle_dataset
# Download any Kaggle dataset
csv_path = download_kaggle_dataset(
"shamimhasan8/ai-vs-human-text-dataset",
output_path="data/my_dataset.csv"
)
```
---
## πŸ“Š What Was Downloaded
- **Dataset:** `shamimhasan8/ai-vs-human-text-dataset`
- **Size:** 1,000 samples
- **Columns:** `id`, `text`, `label`, `prompt`, `model`, `date`
- **Labels:** "AI-generated" or "Human-written"
- **Saved to:** `data/ai_vs_human_text.csv`
---
## 🎯 Next Steps
1. **Dataset is ready!** It's at `data/ai_vs_human_text.csv`
2. **Config updated!** `configs/default.yaml` already points to it
3. **Train your model:**
```bash
python scripts/run_train.py
```
---
## πŸ’‘ Tips
- **Small dataset (1k samples):** Good for quick testing
- **Want more data?** Look for larger datasets on Kaggle
- **Already downloaded?** The script won't re-download (uses cache)
- **No API token needed!** `kagglehub` handles everything
---
## πŸ” Verify It Works
```bash
# Check the dataset
head -5 data/ai_vs_human_text.csv
# Or in Python
import pandas as pd
df = pd.read_csv("data/ai_vs_human_text.csv")
print(f"Rows: {len(df):,}")
print(df.head())
```