Deepfake Adapter Dataset Processing - Quick Start Guide
Overview
This pipeline processes the real_person_adapters.csv dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: Qwen, Llama, and Mistral.
Quick Start
1. Prerequisites
# Install required packages
pip install pandas numpy emoji requests tqdm spacy
# Download spaCy English model (for NER)
python -m spacy download en_core_web_sm
Note: The spaCy model will be automatically downloaded when you run the notebook if not already installed.
2. Set Up API Keys
Choose at least ONE LLM provider and get an API key:
| Provider | Model | Sign Up Link | Est. Cost (10k entries) |
|---|---|---|---|
| Qwen | Qwen-Max | https://dashscope.aliyun.com/ | Varies |
| Llama | Llama-3.1-70B | https://www.together.ai/ | ~$5-10 |
| Mistral | Mistral Large | https://mistral.ai/ | ~$40-80 |
Create your API key file in misc/credentials/:
# For Qwen
echo "your-api-key-here" > misc/credentials/qwen_api_key.txt
# For Llama (via Together AI)
echo "your-api-key-here" > misc/credentials/together_api_key.txt
# For Mistral
echo "your-api-key-here" > misc/credentials/mistral_api_key.txt
3. Run the Notebook
Open Section_2-3-4_Figure_8_deepfake_adapters.ipynb and:
- Run all cells sequentially from top to bottom
- The default configuration uses Qwen in test mode (10 samples)
- Review the test results
- To process the full dataset, change in the LLM annotation cell:
TEST_MODE = False
Pipeline Stages
Stage 1: NER & Name Cleaning
- Input:
data/CSV/real_person_adapters.csv - Output:
data/CSV/NER_POI_step01_pre_annotation.csv - Function: Cleans adapter names to extract real person names
- Removes: emoji, "lora", "v1", special characters
- Example: "IU LoRA v2 π€" β "IU"
Stage 2: Country/Nationality Mapping
- Input: Step 1 output +
misc/lists/countries.csv - Output:
data/CSV/NER_POI_step02_annotated.csv - Function: Maps tags to standardized countries
- Example: "korean" β "South Korea"
- Excludes uninhabited territories
Stage 3: LLM Profession Annotation
- Input: Step 2 output +
misc/lists/professions.csv - Output:
data/CSV/{llm}_annotated_POI_test.csv(test) or{llm}_annotated_POI.csv(full) - Function: Uses LLM to identify:
- Full name
- Gender
- Up to 3 professions (from profession list)
- Country
- Progress: Automatically saves every 10 rows
- Resumable: Can continue from last saved progress if interrupted
Configuration Options
In the LLM annotation cell, you can configure:
# Choose LLM provider
SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral'
# Test mode (recommended for first run)
TEST_MODE = True # True = test on small sample
TEST_SIZE = 10 # Number of rows for testing
# Processing limits
MAX_ROWS = 20000 # Maximum rows to process (None = all)
SAVE_INTERVAL = 10 # Save progress every N rows
Expected Output Format
The final dataset will include all original columns plus:
| Column | Description | Example |
|---|---|---|
real_name |
Cleaned name | "IU" |
full_name |
Full name from LLM | "Lee Ji-eun (IU)" |
gender |
Gender from LLM | "Female" |
profession_llm |
Up to 3 professions | "singer, actor, celebrity" |
country |
Country from LLM | "South Korea" |
likely_country |
Country from tags | "South Korea" |
likely_nationality |
Nationality from tags | "South Korean" |
tags |
Combined tags | "['korean', 'celebrity', 'singer']" |
Troubleshooting
API Key Errors
Warning: No API key for qwen
Solution: Ensure your API key file exists and contains only the key (no extra whitespace)
Rate Limiting
Qwen API error (attempt 1/3): 429 Too Many Requests
Solution: The code automatically retries with exponential backoff. You can also:
- Increase
time.sleep(0.5)to a higher value - Process in smaller batches
Progress Lost
Solution: The pipeline saves progress automatically. Check:
data/CSV/{llm}_annotated_POI_test.csv- your partial resultsmisc/{llm}_query_index.txt- last processed index- Just re-run the cell and it will resume from the last saved progress
JSON Parse Errors from LLM
Qwen API error: JSONDecodeError
Solution: This is usually temporary. The code:
- Returns "Unknown" for failed queries
- Continues processing
- You can manually review/reprocess failed entries later
Cost Management
Estimate Costs Before Processing
For a dataset with N entries:
- Qwen: Contact Alibaba Cloud for pricing
- Llama: ~N Γ $0.0005 = ~$5 per 10k entries
- Mistral: ~N Γ $0.004 = ~$40 per 10k entries
Best Practices
- Always test first: Run with
TEST_MODE = Trueon 10 samples - Monitor API usage: Check your API provider's dashboard
- Use cheaper models first: Try Llama before Mistral
- Process in batches: Set
MAX_ROWSto process incrementally - Save intermediate results: The automatic saving feature helps prevent data loss
Comparing Multiple LLMs
To compare results from different LLMs:
- Run the pipeline with
SELECTED_LLM = 'qwen' - Change to
SELECTED_LLM = 'llama'and run again - Change to
SELECTED_LLM = 'mistral'and run again - Compare the three output files:
qwen_annotated_POI.csvllama_annotated_POI.csvmistral_annotated_POI.csv
Files Created
The pipeline creates these files:
data/CSV/
βββ NER_POI_step01_pre_annotation.csv # After name cleaning
βββ NER_POI_step02_annotated.csv # After country mapping
βββ qwen_annotated_POI_test.csv # Test results (Qwen)
βββ qwen_annotated_POI.csv # Full results (Qwen)
βββ llama_annotated_POI.csv # Full results (Llama)
βββ mistral_annotated_POI.csv # Full results (Mistral)
misc/
βββ qwen_query_index.txt # Progress tracking
βββ llama_query_index.txt # Progress tracking
βββ mistral_query_index.txt # Progress tracking
Support
For issues or questions:
- Check this guide for common problems
- Review
misc/credentials/README.mdfor API setup - Read the notebook documentation (first cell)
- Check API provider documentation for service-specific issues
Ethical Considerations
This research documents ethical problems with AI deepfake models. The dataset and analysis help:
- Understand the scope of unauthorized person likeness usage
- Document professions/demographics most affected
- Inform policy and technical solutions
- Raise awareness about deepfake technology misuse
Use this data responsibly and respect individual privacy and consent.