Spaces:

ChauHPham
/

AITextDetector

Sleeping

App Files Files Community

AITextDetector / QUICK_START_DOWNLOAD.md

ChauHPham

Upload folder using huggingface_hub

25faba3 verified about 2 months ago

preview code

raw

history blame contribute delete

2.73 kB

	# 🚀 Quick Start: Download Dataset

	## ✅ Script Works! (Tested Successfully)

	The download script works perfectly! Here are all the ways to use it:

	---

	## Method 1: Use the Script (Easiest) ⭐

	```bash
	# Download the default dataset
	python scripts/download_kagglehub.py

	# Or specify a different dataset
	python scripts/download_kagglehub.py --dataset shamimhasan8/ai-vs-human-text-dataset
	```

	Output: Dataset saved to `data/ai_vs_human_text.csv`

	---

	## Method 2: Direct in Your Code (Simple)

	Just copy-paste this into your Python script:

	```python
	import kagglehub
	import pandas as pd
	from pathlib import Path

	# Download dataset (no API token needed!)
	path = kagglehub.dataset_download("shamimhasan8/ai-vs-human-text-dataset")
	print("Path to dataset files:", path)

	# Load the CSV
	csv_files = list(Path(path).glob("*.csv"))
	df = pd.read_csv(csv_files[0])

	# Save to your data directory
	df.to_csv("data/dataset.csv", index=False)
	```

	See: `examples/simple_download.py` for a complete example

	---

	## Method 3: Use the Integrated Function

	```python
	from ai_text_detector.download_data import download_ai_vs_human_dataset

	# Download and get the path
	csv_path = download_ai_vs_human_dataset()
	print(f"Dataset at: {csv_path}")

	# Now use it in your training
	from ai_text_detector.config import load_config
	cfg = load_config("configs/default.yaml")
	cfg.data_path = csv_path
	```

	See: `examples/download_and_train.py` for a complete training example

	---

	## Method 4: Download Any Dataset

	```python
	from ai_text_detector.download_data import download_kaggle_dataset

	# Download any Kaggle dataset
	csv_path = download_kaggle_dataset(
	"shamimhasan8/ai-vs-human-text-dataset",
	output_path="data/my_dataset.csv"
	)
	```

	---

	## 📊 What Was Downloaded

	- Dataset: `shamimhasan8/ai-vs-human-text-dataset`
	- Size: 1,000 samples
	- Columns: `id`, `text`, `label`, `prompt`, `model`, `date`
	- Labels: "AI-generated" or "Human-written"
	- Saved to: `data/ai_vs_human_text.csv`

	---

	## 🎯 Next Steps

	1. Dataset is ready! It's at `data/ai_vs_human_text.csv`
	2. Config updated! `configs/default.yaml` already points to it
	3. Train your model:
	```bash
	python scripts/run_train.py
	```

	---

	## 💡 Tips

	- Small dataset (1k samples): Good for quick testing
	- Want more data? Look for larger datasets on Kaggle
	- Already downloaded? The script won't re-download (uses cache)
	- No API token needed! `kagglehub` handles everything

	---

	## 🔍 Verify It Works

	```bash
	# Check the dataset
	head -5 data/ai_vs_human_text.csv

	# Or in Python
	import pandas as pd
	df = pd.read_csv("data/ai_vs_human_text.csv")
	print(f"Rows: {len(df):,}")
	print(df.head())
	```