code / md /DEEPFAKE_PIPELINE_GUIDE.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

6.96 kB

	# Deepfake Adapter Dataset Processing - Quick Start Guide

	## Overview

	This pipeline processes the `real_person_adapters.csv` dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: Qwen, Llama, and Mistral.

	## Quick Start

	### 1. Prerequisites

	```bash
	# Install required packages
	pip install pandas numpy emoji requests tqdm spacy

	# Download spaCy English model (for NER)
	python -m spacy download en_core_web_sm
	```

	Note: The spaCy model will be automatically downloaded when you run the notebook if not already installed.

	### 2. Set Up API Keys

	Choose at least ONE LLM provider and get an API key:

	\| Provider \| Model \| Sign Up Link \| Est. Cost (10k entries) \|
	\|----------\|-------\|--------------\|-------------------------\|
	\| Qwen \| Qwen-Max \| https://dashscope.aliyun.com/ \| Varies \|
	\| Llama \| Llama-3.1-70B \| https://www.together.ai/ \| ~$5-10 \|
	\| Mistral \| Mistral Large \| https://mistral.ai/ \| ~$40-80 \|

	Create your API key file in `misc/credentials/`:

	```bash
	# For Qwen
	echo "your-api-key-here" > misc/credentials/qwen_api_key.txt

	# For Llama (via Together AI)
	echo "your-api-key-here" > misc/credentials/together_api_key.txt

	# For Mistral
	echo "your-api-key-here" > misc/credentials/mistral_api_key.txt
	```

	### 3. Run the Notebook

	Open `Section_2-3-4_Figure_8_deepfake_adapters.ipynb` and:

	1. Run all cells sequentially from top to bottom
	2. The default configuration uses Qwen in test mode (10 samples)
	3. Review the test results
	4. To process the full dataset, change in the LLM annotation cell:
	```python
	TEST_MODE = False
	```

	## Pipeline Stages

	### Stage 1: NER & Name Cleaning
	- Input: `data/CSV/real_person_adapters.csv`
	- Output: `data/CSV/NER_POI_step01_pre_annotation.csv`
	- Function: Cleans adapter names to extract real person names
	- Removes: emoji, "lora", "v1", special characters
	- Example: "IU LoRA v2 🎤" → "IU"

	### Stage 2: Country/Nationality Mapping
	- Input: Step 1 output + `misc/lists/countries.csv`
	- Output: `data/CSV/NER_POI_step02_annotated.csv`
	- Function: Maps tags to standardized countries
	- Example: "korean" → "South Korea"
	- Excludes uninhabited territories

	### Stage 3: LLM Profession Annotation
	- Input: Step 2 output + `misc/lists/professions.csv`
	- Output: `data/CSV/{llm}_annotated_POI_test.csv` (test) or `{llm}_annotated_POI.csv` (full)
	- Function: Uses LLM to identify:
	- Full name
	- Gender
	- Up to 3 professions (from profession list)
	- Country
	- Progress: Automatically saves every 10 rows
	- Resumable: Can continue from last saved progress if interrupted

	## Configuration Options

	In the LLM annotation cell, you can configure:

	```python
	# Choose LLM provider
	SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral'

	# Test mode (recommended for first run)
	TEST_MODE = True # True = test on small sample
	TEST_SIZE = 10 # Number of rows for testing

	# Processing limits
	MAX_ROWS = 20000 # Maximum rows to process (None = all)
	SAVE_INTERVAL = 10 # Save progress every N rows
	```

	## Expected Output Format

	The final dataset will include all original columns plus:

	\| Column \| Description \| Example \|
	\|--------\|-------------\|---------\|
	\| `real_name` \| Cleaned name \| "IU" \|
	\| `full_name` \| Full name from LLM \| "Lee Ji-eun (IU)" \|
	\| `gender` \| Gender from LLM \| "Female" \|
	\| `profession_llm` \| Up to 3 professions \| "singer, actor, celebrity" \|
	\| `country` \| Country from LLM \| "South Korea" \|
	\| `likely_country` \| Country from tags \| "South Korea" \|
	\| `likely_nationality` \| Nationality from tags \| "South Korean" \|
	\| `tags` \| Combined tags \| "['korean', 'celebrity', 'singer']" \|

	## Troubleshooting

	### API Key Errors
	```
	Warning: No API key for qwen
	```
	Solution: Ensure your API key file exists and contains only the key (no extra whitespace)

	### Rate Limiting
	```
	Qwen API error (attempt 1/3): 429 Too Many Requests
	```
	Solution: The code automatically retries with exponential backoff. You can also:
	- Increase `time.sleep(0.5)` to a higher value
	- Process in smaller batches

	### Progress Lost
	Solution: The pipeline saves progress automatically. Check:
	- `data/CSV/{llm}_annotated_POI_test.csv` - your partial results
	- `misc/{llm}_query_index.txt` - last processed index
	- Just re-run the cell and it will resume from the last saved progress

	### JSON Parse Errors from LLM
	```
	Qwen API error: JSONDecodeError
	```
	Solution: This is usually temporary. The code:
	- Returns "Unknown" for failed queries
	- Continues processing
	- You can manually review/reprocess failed entries later

	## Cost Management

	### Estimate Costs Before Processing

	For a dataset with N entries:
	- Qwen: Contact Alibaba Cloud for pricing
	- Llama: ~N × $0.0005 = ~$5 per 10k entries
	- Mistral: ~N × $0.004 = ~$40 per 10k entries

	### Best Practices

	1. Always test first: Run with `TEST_MODE = True` on 10 samples
	2. Monitor API usage: Check your API provider's dashboard
	3. Use cheaper models first: Try Llama before Mistral
	4. Process in batches: Set `MAX_ROWS` to process incrementally
	5. Save intermediate results: The automatic saving feature helps prevent data loss

	## Comparing Multiple LLMs

	To compare results from different LLMs:

	1. Run the pipeline with `SELECTED_LLM = 'qwen'`
	2. Change to `SELECTED_LLM = 'llama'` and run again
	3. Change to `SELECTED_LLM = 'mistral'` and run again
	4. Compare the three output files:
	- `qwen_annotated_POI.csv`
	- `llama_annotated_POI.csv`
	- `mistral_annotated_POI.csv`

	## Files Created

	The pipeline creates these files:

	```
	data/CSV/
	├── NER_POI_step01_pre_annotation.csv # After name cleaning
	├── NER_POI_step02_annotated.csv # After country mapping
	├── qwen_annotated_POI_test.csv # Test results (Qwen)
	├── qwen_annotated_POI.csv # Full results (Qwen)
	├── llama_annotated_POI.csv # Full results (Llama)
	└── mistral_annotated_POI.csv # Full results (Mistral)

	misc/
	├── qwen_query_index.txt # Progress tracking
	├── llama_query_index.txt # Progress tracking
	└── mistral_query_index.txt # Progress tracking
	```

	## Support

	For issues or questions:
	1. Check this guide for common problems
	2. Review `misc/credentials/README.md` for API setup
	3. Read the notebook documentation (first cell)
	4. Check API provider documentation for service-specific issues

	## Ethical Considerations

	This research documents ethical problems with AI deepfake models. The dataset and analysis help:
	- Understand the scope of unauthorized person likeness usage
	- Document professions/demographics most affected
	- Inform policy and technical solutions
	- Raise awareness about deepfake technology misuse

	Use this data responsibly and respect individual privacy and consent.