code / md /DEEPFAKE_PIPELINE_GUIDE.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

preview code

raw

history blame contribute delete

6.96 kB

Deepfake Adapter Dataset Processing - Quick Start Guide

Overview

This pipeline processes the real_person_adapters.csv dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: Qwen, Llama, and Mistral.

Quick Start

1. Prerequisites

# Install required packages
pip install pandas numpy emoji requests tqdm spacy

# Download spaCy English model (for NER)
python -m spacy download en_core_web_sm

Note: The spaCy model will be automatically downloaded when you run the notebook if not already installed.

2. Set Up API Keys

Choose at least ONE LLM provider and get an API key:

Provider	Model	Sign Up Link	Est. Cost (10k entries)
Qwen	Qwen-Max	https://dashscope.aliyun.com/	Varies
Llama	Llama-3.1-70B	https://www.together.ai/	~$5-10
Mistral	Mistral Large	https://mistral.ai/	~$40-80

Create your API key file in misc/credentials/:

# For Qwen
echo "your-api-key-here" > misc/credentials/qwen_api_key.txt

# For Llama (via Together AI)
echo "your-api-key-here" > misc/credentials/together_api_key.txt

# For Mistral
echo "your-api-key-here" > misc/credentials/mistral_api_key.txt

3. Run the Notebook

Open Section_2-3-4_Figure_8_deepfake_adapters.ipynb and:

Run all cells sequentially from top to bottom
The default configuration uses Qwen in test mode (10 samples)
Review the test results
To process the full dataset, change in the LLM annotation cell:
```
TEST_MODE = False
```

Pipeline Stages

Stage 1: NER & Name Cleaning

Input: data/CSV/real_person_adapters.csv
Output: data/CSV/NER_POI_step01_pre_annotation.csv
Function: Cleans adapter names to extract real person names
- Removes: emoji, "lora", "v1", special characters
- Example: "IU LoRA v2 🎤" → "IU"

Stage 2: Country/Nationality Mapping

Input: Step 1 output + misc/lists/countries.csv
Output: data/CSV/NER_POI_step02_annotated.csv
Function: Maps tags to standardized countries
- Example: "korean" → "South Korea"
- Excludes uninhabited territories

Stage 3: LLM Profession Annotation

Input: Step 2 output + misc/lists/professions.csv
Output: data/CSV/{llm}_annotated_POI_test.csv (test) or {llm}_annotated_POI.csv (full)
Function: Uses LLM to identify:
- Full name
- Gender
- Up to 3 professions (from profession list)
- Country
Progress: Automatically saves every 10 rows
Resumable: Can continue from last saved progress if interrupted

Configuration Options

In the LLM annotation cell, you can configure:

# Choose LLM provider
SELECTED_LLM = 'qwen'  # Options: 'qwen', 'llama', 'mistral'

# Test mode (recommended for first run)
TEST_MODE = True       # True = test on small sample
TEST_SIZE = 10         # Number of rows for testing

# Processing limits
MAX_ROWS = 20000      # Maximum rows to process (None = all)
SAVE_INTERVAL = 10    # Save progress every N rows

Expected Output Format

The final dataset will include all original columns plus:

Column	Description	Example
`real_name`	Cleaned name	"IU"
`full_name`	Full name from LLM	"Lee Ji-eun (IU)"
`gender`	Gender from LLM	"Female"
`profession_llm`	Up to 3 professions	"singer, actor, celebrity"
`country`	Country from LLM	"South Korea"
`likely_country`	Country from tags	"South Korea"
`likely_nationality`	Nationality from tags	"South Korean"
`tags`	Combined tags	"['korean', 'celebrity', 'singer']"

Troubleshooting

API Key Errors

Warning: No API key for qwen

Solution: Ensure your API key file exists and contains only the key (no extra whitespace)

Rate Limiting

Qwen API error (attempt 1/3): 429 Too Many Requests

Solution: The code automatically retries with exponential backoff. You can also:

Increase time.sleep(0.5) to a higher value
Process in smaller batches

Progress Lost

Solution: The pipeline saves progress automatically. Check:

data/CSV/{llm}_annotated_POI_test.csv - your partial results
misc/{llm}_query_index.txt - last processed index
Just re-run the cell and it will resume from the last saved progress

JSON Parse Errors from LLM

Qwen API error: JSONDecodeError

Solution: This is usually temporary. The code:

Returns "Unknown" for failed queries
Continues processing
You can manually review/reprocess failed entries later

Cost Management

Estimate Costs Before Processing

For a dataset with N entries:

Qwen: Contact Alibaba Cloud for pricing
Llama: ~N × $0.0005 = ~$5 per 10k entries
Mistral: ~N × $0.004 = ~$40 per 10k entries

Best Practices

Always test first: Run with TEST_MODE = True on 10 samples
Monitor API usage: Check your API provider's dashboard
Use cheaper models first: Try Llama before Mistral
Process in batches: Set MAX_ROWS to process incrementally
Save intermediate results: The automatic saving feature helps prevent data loss

Comparing Multiple LLMs

To compare results from different LLMs:

Run the pipeline with SELECTED_LLM = 'qwen'
Change to SELECTED_LLM = 'llama' and run again
Change to SELECTED_LLM = 'mistral' and run again
Compare the three output files:
- qwen_annotated_POI.csv
- llama_annotated_POI.csv
- mistral_annotated_POI.csv

Files Created

The pipeline creates these files:

data/CSV/
├── NER_POI_step01_pre_annotation.csv     # After name cleaning
├── NER_POI_step02_annotated.csv          # After country mapping
├── qwen_annotated_POI_test.csv           # Test results (Qwen)
├── qwen_annotated_POI.csv                # Full results (Qwen)
├── llama_annotated_POI.csv               # Full results (Llama)
└── mistral_annotated_POI.csv             # Full results (Mistral)

misc/
├── qwen_query_index.txt                  # Progress tracking
├── llama_query_index.txt                 # Progress tracking
└── mistral_query_index.txt               # Progress tracking

Support

For issues or questions:

Check this guide for common problems
Review misc/credentials/README.md for API setup
Read the notebook documentation (first cell)
Check API provider documentation for service-specific issues

Ethical Considerations

This research documents ethical problems with AI deepfake models. The dataset and analysis help:

Understand the scope of unauthorized person likeness usage
Document professions/demographics most affected
Inform policy and technical solutions
Raise awareness about deepfake technology misuse

Use this data responsibly and respect individual privacy and consent.