code / md /DEEPFAKE_PIPELINE_GUIDE.md
Laura Wagner
to commit or not commit that is the question
5f5806d

Deepfake Adapter Dataset Processing - Quick Start Guide

Overview

This pipeline processes the real_person_adapters.csv dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: Qwen, Llama, and Mistral.

Quick Start

1. Prerequisites

# Install required packages
pip install pandas numpy emoji requests tqdm spacy

# Download spaCy English model (for NER)
python -m spacy download en_core_web_sm

Note: The spaCy model will be automatically downloaded when you run the notebook if not already installed.

2. Set Up API Keys

Choose at least ONE LLM provider and get an API key:

Provider Model Sign Up Link Est. Cost (10k entries)
Qwen Qwen-Max https://dashscope.aliyun.com/ Varies
Llama Llama-3.1-70B https://www.together.ai/ ~$5-10
Mistral Mistral Large https://mistral.ai/ ~$40-80

Create your API key file in misc/credentials/:

# For Qwen
echo "your-api-key-here" > misc/credentials/qwen_api_key.txt

# For Llama (via Together AI)
echo "your-api-key-here" > misc/credentials/together_api_key.txt

# For Mistral
echo "your-api-key-here" > misc/credentials/mistral_api_key.txt

3. Run the Notebook

Open Section_2-3-4_Figure_8_deepfake_adapters.ipynb and:

  1. Run all cells sequentially from top to bottom
  2. The default configuration uses Qwen in test mode (10 samples)
  3. Review the test results
  4. To process the full dataset, change in the LLM annotation cell:
    TEST_MODE = False
    

Pipeline Stages

Stage 1: NER & Name Cleaning

  • Input: data/CSV/real_person_adapters.csv
  • Output: data/CSV/NER_POI_step01_pre_annotation.csv
  • Function: Cleans adapter names to extract real person names
    • Removes: emoji, "lora", "v1", special characters
    • Example: "IU LoRA v2 🎀" β†’ "IU"

Stage 2: Country/Nationality Mapping

  • Input: Step 1 output + misc/lists/countries.csv
  • Output: data/CSV/NER_POI_step02_annotated.csv
  • Function: Maps tags to standardized countries
    • Example: "korean" β†’ "South Korea"
    • Excludes uninhabited territories

Stage 3: LLM Profession Annotation

  • Input: Step 2 output + misc/lists/professions.csv
  • Output: data/CSV/{llm}_annotated_POI_test.csv (test) or {llm}_annotated_POI.csv (full)
  • Function: Uses LLM to identify:
    • Full name
    • Gender
    • Up to 3 professions (from profession list)
    • Country
  • Progress: Automatically saves every 10 rows
  • Resumable: Can continue from last saved progress if interrupted

Configuration Options

In the LLM annotation cell, you can configure:

# Choose LLM provider
SELECTED_LLM = 'qwen'  # Options: 'qwen', 'llama', 'mistral'

# Test mode (recommended for first run)
TEST_MODE = True       # True = test on small sample
TEST_SIZE = 10         # Number of rows for testing

# Processing limits
MAX_ROWS = 20000      # Maximum rows to process (None = all)
SAVE_INTERVAL = 10    # Save progress every N rows

Expected Output Format

The final dataset will include all original columns plus:

Column Description Example
real_name Cleaned name "IU"
full_name Full name from LLM "Lee Ji-eun (IU)"
gender Gender from LLM "Female"
profession_llm Up to 3 professions "singer, actor, celebrity"
country Country from LLM "South Korea"
likely_country Country from tags "South Korea"
likely_nationality Nationality from tags "South Korean"
tags Combined tags "['korean', 'celebrity', 'singer']"

Troubleshooting

API Key Errors

Warning: No API key for qwen

Solution: Ensure your API key file exists and contains only the key (no extra whitespace)

Rate Limiting

Qwen API error (attempt 1/3): 429 Too Many Requests

Solution: The code automatically retries with exponential backoff. You can also:

  • Increase time.sleep(0.5) to a higher value
  • Process in smaller batches

Progress Lost

Solution: The pipeline saves progress automatically. Check:

  • data/CSV/{llm}_annotated_POI_test.csv - your partial results
  • misc/{llm}_query_index.txt - last processed index
  • Just re-run the cell and it will resume from the last saved progress

JSON Parse Errors from LLM

Qwen API error: JSONDecodeError

Solution: This is usually temporary. The code:

  • Returns "Unknown" for failed queries
  • Continues processing
  • You can manually review/reprocess failed entries later

Cost Management

Estimate Costs Before Processing

For a dataset with N entries:

  • Qwen: Contact Alibaba Cloud for pricing
  • Llama: ~N Γ— $0.0005 = ~$5 per 10k entries
  • Mistral: ~N Γ— $0.004 = ~$40 per 10k entries

Best Practices

  1. Always test first: Run with TEST_MODE = True on 10 samples
  2. Monitor API usage: Check your API provider's dashboard
  3. Use cheaper models first: Try Llama before Mistral
  4. Process in batches: Set MAX_ROWS to process incrementally
  5. Save intermediate results: The automatic saving feature helps prevent data loss

Comparing Multiple LLMs

To compare results from different LLMs:

  1. Run the pipeline with SELECTED_LLM = 'qwen'
  2. Change to SELECTED_LLM = 'llama' and run again
  3. Change to SELECTED_LLM = 'mistral' and run again
  4. Compare the three output files:
    • qwen_annotated_POI.csv
    • llama_annotated_POI.csv
    • mistral_annotated_POI.csv

Files Created

The pipeline creates these files:

data/CSV/
β”œβ”€β”€ NER_POI_step01_pre_annotation.csv     # After name cleaning
β”œβ”€β”€ NER_POI_step02_annotated.csv          # After country mapping
β”œβ”€β”€ qwen_annotated_POI_test.csv           # Test results (Qwen)
β”œβ”€β”€ qwen_annotated_POI.csv                # Full results (Qwen)
β”œβ”€β”€ llama_annotated_POI.csv               # Full results (Llama)
└── mistral_annotated_POI.csv             # Full results (Mistral)

misc/
β”œβ”€β”€ qwen_query_index.txt                  # Progress tracking
β”œβ”€β”€ llama_query_index.txt                 # Progress tracking
└── mistral_query_index.txt               # Progress tracking

Support

For issues or questions:

  1. Check this guide for common problems
  2. Review misc/credentials/README.md for API setup
  3. Read the notebook documentation (first cell)
  4. Check API provider documentation for service-specific issues

Ethical Considerations

This research documents ethical problems with AI deepfake models. The dataset and analysis help:

  • Understand the scope of unauthorized person likeness usage
  • Document professions/demographics most affected
  • Inform policy and technical solutions
  • Raise awareness about deepfake technology misuse

Use this data responsibly and respect individual privacy and consent.