File size: 2,730 Bytes
25faba3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# πŸš€ Quick Start: Download Dataset

## βœ… Script Works! (Tested Successfully)

The download script works perfectly! Here are all the ways to use it:

---

## Method 1: Use the Script (Easiest) ⭐

```bash
# Download the default dataset
python scripts/download_kagglehub.py

# Or specify a different dataset
python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset
```

**Output:** Dataset saved to `data/ai_vs_human_text.csv`

---

## Method 2: Direct in Your Code (Simple)

Just copy-paste this into your Python script:

```python
import kagglehub
import pandas as pd
from pathlib import Path

# Download dataset (no API token needed!)
path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset")
print("Path to dataset files:", path)

# Load the CSV
csv_files = list(Path(path).glob("*.csv"))
df = pd.read_csv(csv_files[0])

# Save to your data directory
df.to_csv("data/dataset.csv", index=False)
```

**See:** `examples/simple_download.py` for a complete example

---

## Method 3: Use the Integrated Function

```python
from ai_text_detector.download_data import download_ai_vs_human_dataset

# Download and get the path
csv_path = download_ai_vs_human_dataset()
print(f"Dataset at: {csv_path}")

# Now use it in your training
from ai_text_detector.config import load_config
cfg = load_config("configs/default.yaml")
cfg.data_path = csv_path
```

**See:** `examples/download_and_train.py` for a complete training example

---

## Method 4: Download Any Dataset

```python
from ai_text_detector.download_data import download_kaggle_dataset

# Download any Kaggle dataset
csv_path = download_kaggle_dataset(
    "shamimhasan8/ai-vs-human-text-dataset",
    output_path="data/my_dataset.csv"
)
```

---

## πŸ“Š What Was Downloaded

- **Dataset:** `shamimhasan8/ai-vs-human-text-dataset`
- **Size:** 1,000 samples
- **Columns:** `id`, `text`, `label`, `prompt`, `model`, `date`
- **Labels:** "AI-generated" or "Human-written"
- **Saved to:** `data/ai_vs_human_text.csv`

---

## 🎯 Next Steps

1. **Dataset is ready!** It's at `data/ai_vs_human_text.csv`
2. **Config updated!** `configs/default.yaml` already points to it
3. **Train your model:**
   ```bash
   python scripts/run_train.py
   ```

---

## πŸ’‘ Tips

- **Small dataset (1k samples):** Good for quick testing
- **Want more data?** Look for larger datasets on Kaggle
- **Already downloaded?** The script won't re-download (uses cache)
- **No API token needed!** `kagglehub` handles everything

---

## πŸ” Verify It Works

```bash
# Check the dataset
head -5 data/ai_vs_human_text.csv

# Or in Python
import pandas as pd
df = pd.read_csv("data/ai_vs_human_text.csv")
print(f"Rows: {len(df):,}")
print(df.head())
```