| # Deepfake Adapter Dataset Processing - Quick Start Guide | |
| ## Overview | |
| This pipeline processes the `real_person_adapters.csv` dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: **Qwen**, **Llama**, and **Mistral**. | |
| ## Quick Start | |
| ### 1. Prerequisites | |
| ```bash | |
| # Install required packages | |
| pip install pandas numpy emoji requests tqdm spacy | |
| # Download spaCy English model (for NER) | |
| python -m spacy download en_core_web_sm | |
| ``` | |
| **Note**: The spaCy model will be automatically downloaded when you run the notebook if not already installed. | |
| ### 2. Set Up API Keys | |
| Choose at least ONE LLM provider and get an API key: | |
| | Provider | Model | Sign Up Link | Est. Cost (10k entries) | | |
| |----------|-------|--------------|-------------------------| | |
| | **Qwen** | Qwen-Max | https://dashscope.aliyun.com/ | Varies | | |
| | **Llama** | Llama-3.1-70B | https://www.together.ai/ | ~$5-10 | | |
| | **Mistral** | Mistral Large | https://mistral.ai/ | ~$40-80 | | |
| Create your API key file in `misc/credentials/`: | |
| ```bash | |
| # For Qwen | |
| echo "your-api-key-here" > misc/credentials/qwen_api_key.txt | |
| # For Llama (via Together AI) | |
| echo "your-api-key-here" > misc/credentials/together_api_key.txt | |
| # For Mistral | |
| echo "your-api-key-here" > misc/credentials/mistral_api_key.txt | |
| ``` | |
| ### 3. Run the Notebook | |
| Open `Section_2-3-4_Figure_8_deepfake_adapters.ipynb` and: | |
| 1. **Run all cells sequentially** from top to bottom | |
| 2. The default configuration uses Qwen in test mode (10 samples) | |
| 3. Review the test results | |
| 4. To process the full dataset, change in the LLM annotation cell: | |
| ```python | |
| TEST_MODE = False | |
| ``` | |
| ## Pipeline Stages | |
| ### Stage 1: NER & Name Cleaning | |
| - **Input**: `data/CSV/real_person_adapters.csv` | |
| - **Output**: `data/CSV/NER_POI_step01_pre_annotation.csv` | |
| - **Function**: Cleans adapter names to extract real person names | |
| - Removes: emoji, "lora", "v1", special characters | |
| - Example: "IU LoRA v2 π€" β "IU" | |
| ### Stage 2: Country/Nationality Mapping | |
| - **Input**: Step 1 output + `misc/lists/countries.csv` | |
| - **Output**: `data/CSV/NER_POI_step02_annotated.csv` | |
| - **Function**: Maps tags to standardized countries | |
| - Example: "korean" β "South Korea" | |
| - Excludes uninhabited territories | |
| ### Stage 3: LLM Profession Annotation | |
| - **Input**: Step 2 output + `misc/lists/professions.csv` | |
| - **Output**: `data/CSV/{llm}_annotated_POI_test.csv` (test) or `{llm}_annotated_POI.csv` (full) | |
| - **Function**: Uses LLM to identify: | |
| - Full name | |
| - Gender | |
| - Up to 3 professions (from profession list) | |
| - Country | |
| - **Progress**: Automatically saves every 10 rows | |
| - **Resumable**: Can continue from last saved progress if interrupted | |
| ## Configuration Options | |
| In the LLM annotation cell, you can configure: | |
| ```python | |
| # Choose LLM provider | |
| SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral' | |
| # Test mode (recommended for first run) | |
| TEST_MODE = True # True = test on small sample | |
| TEST_SIZE = 10 # Number of rows for testing | |
| # Processing limits | |
| MAX_ROWS = 20000 # Maximum rows to process (None = all) | |
| SAVE_INTERVAL = 10 # Save progress every N rows | |
| ``` | |
| ## Expected Output Format | |
| The final dataset will include all original columns plus: | |
| | Column | Description | Example | | |
| |--------|-------------|---------| | |
| | `real_name` | Cleaned name | "IU" | | |
| | `full_name` | Full name from LLM | "Lee Ji-eun (IU)" | | |
| | `gender` | Gender from LLM | "Female" | | |
| | `profession_llm` | Up to 3 professions | "singer, actor, celebrity" | | |
| | `country` | Country from LLM | "South Korea" | | |
| | `likely_country` | Country from tags | "South Korea" | | |
| | `likely_nationality` | Nationality from tags | "South Korean" | | |
| | `tags` | Combined tags | "['korean', 'celebrity', 'singer']" | | |
| ## Troubleshooting | |
| ### API Key Errors | |
| ``` | |
| Warning: No API key for qwen | |
| ``` | |
| **Solution**: Ensure your API key file exists and contains only the key (no extra whitespace) | |
| ### Rate Limiting | |
| ``` | |
| Qwen API error (attempt 1/3): 429 Too Many Requests | |
| ``` | |
| **Solution**: The code automatically retries with exponential backoff. You can also: | |
| - Increase `time.sleep(0.5)` to a higher value | |
| - Process in smaller batches | |
| ### Progress Lost | |
| **Solution**: The pipeline saves progress automatically. Check: | |
| - `data/CSV/{llm}_annotated_POI_test.csv` - your partial results | |
| - `misc/{llm}_query_index.txt` - last processed index | |
| - Just re-run the cell and it will resume from the last saved progress | |
| ### JSON Parse Errors from LLM | |
| ``` | |
| Qwen API error: JSONDecodeError | |
| ``` | |
| **Solution**: This is usually temporary. The code: | |
| - Returns "Unknown" for failed queries | |
| - Continues processing | |
| - You can manually review/reprocess failed entries later | |
| ## Cost Management | |
| ### Estimate Costs Before Processing | |
| For a dataset with N entries: | |
| - **Qwen**: Contact Alibaba Cloud for pricing | |
| - **Llama**: ~N Γ $0.0005 = ~$5 per 10k entries | |
| - **Mistral**: ~N Γ $0.004 = ~$40 per 10k entries | |
| ### Best Practices | |
| 1. **Always test first**: Run with `TEST_MODE = True` on 10 samples | |
| 2. **Monitor API usage**: Check your API provider's dashboard | |
| 3. **Use cheaper models first**: Try Llama before Mistral | |
| 4. **Process in batches**: Set `MAX_ROWS` to process incrementally | |
| 5. **Save intermediate results**: The automatic saving feature helps prevent data loss | |
| ## Comparing Multiple LLMs | |
| To compare results from different LLMs: | |
| 1. Run the pipeline with `SELECTED_LLM = 'qwen'` | |
| 2. Change to `SELECTED_LLM = 'llama'` and run again | |
| 3. Change to `SELECTED_LLM = 'mistral'` and run again | |
| 4. Compare the three output files: | |
| - `qwen_annotated_POI.csv` | |
| - `llama_annotated_POI.csv` | |
| - `mistral_annotated_POI.csv` | |
| ## Files Created | |
| The pipeline creates these files: | |
| ``` | |
| data/CSV/ | |
| βββ NER_POI_step01_pre_annotation.csv # After name cleaning | |
| βββ NER_POI_step02_annotated.csv # After country mapping | |
| βββ qwen_annotated_POI_test.csv # Test results (Qwen) | |
| βββ qwen_annotated_POI.csv # Full results (Qwen) | |
| βββ llama_annotated_POI.csv # Full results (Llama) | |
| βββ mistral_annotated_POI.csv # Full results (Mistral) | |
| misc/ | |
| βββ qwen_query_index.txt # Progress tracking | |
| βββ llama_query_index.txt # Progress tracking | |
| βββ mistral_query_index.txt # Progress tracking | |
| ``` | |
| ## Support | |
| For issues or questions: | |
| 1. Check this guide for common problems | |
| 2. Review `misc/credentials/README.md` for API setup | |
| 3. Read the notebook documentation (first cell) | |
| 4. Check API provider documentation for service-specific issues | |
| ## Ethical Considerations | |
| This research documents ethical problems with AI deepfake models. The dataset and analysis help: | |
| - Understand the scope of unauthorized person likeness usage | |
| - Document professions/demographics most affected | |
| - Inform policy and technical solutions | |
| - Raise awareness about deepfake technology misuse | |
| Use this data responsibly and respect individual privacy and consent. | |