File size: 6,956 Bytes
5f5806d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 |
# Deepfake Adapter Dataset Processing - Quick Start Guide
## Overview
This pipeline processes the `real_person_adapters.csv` dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: **Qwen**, **Llama**, and **Mistral**.
## Quick Start
### 1. Prerequisites
```bash
# Install required packages
pip install pandas numpy emoji requests tqdm spacy
# Download spaCy English model (for NER)
python -m spacy download en_core_web_sm
```
**Note**: The spaCy model will be automatically downloaded when you run the notebook if not already installed.
### 2. Set Up API Keys
Choose at least ONE LLM provider and get an API key:
| Provider | Model | Sign Up Link | Est. Cost (10k entries) |
|----------|-------|--------------|-------------------------|
| **Qwen** | Qwen-Max | https://dashscope.aliyun.com/ | Varies |
| **Llama** | Llama-3.1-70B | https://www.together.ai/ | ~$5-10 |
| **Mistral** | Mistral Large | https://mistral.ai/ | ~$40-80 |
Create your API key file in `misc/credentials/`:
```bash
# For Qwen
echo "your-api-key-here" > misc/credentials/qwen_api_key.txt
# For Llama (via Together AI)
echo "your-api-key-here" > misc/credentials/together_api_key.txt
# For Mistral
echo "your-api-key-here" > misc/credentials/mistral_api_key.txt
```
### 3. Run the Notebook
Open `Section_2-3-4_Figure_8_deepfake_adapters.ipynb` and:
1. **Run all cells sequentially** from top to bottom
2. The default configuration uses Qwen in test mode (10 samples)
3. Review the test results
4. To process the full dataset, change in the LLM annotation cell:
```python
TEST_MODE = False
```
## Pipeline Stages
### Stage 1: NER & Name Cleaning
- **Input**: `data/CSV/real_person_adapters.csv`
- **Output**: `data/CSV/NER_POI_step01_pre_annotation.csv`
- **Function**: Cleans adapter names to extract real person names
- Removes: emoji, "lora", "v1", special characters
- Example: "IU LoRA v2 π€" β "IU"
### Stage 2: Country/Nationality Mapping
- **Input**: Step 1 output + `misc/lists/countries.csv`
- **Output**: `data/CSV/NER_POI_step02_annotated.csv`
- **Function**: Maps tags to standardized countries
- Example: "korean" β "South Korea"
- Excludes uninhabited territories
### Stage 3: LLM Profession Annotation
- **Input**: Step 2 output + `misc/lists/professions.csv`
- **Output**: `data/CSV/{llm}_annotated_POI_test.csv` (test) or `{llm}_annotated_POI.csv` (full)
- **Function**: Uses LLM to identify:
- Full name
- Gender
- Up to 3 professions (from profession list)
- Country
- **Progress**: Automatically saves every 10 rows
- **Resumable**: Can continue from last saved progress if interrupted
## Configuration Options
In the LLM annotation cell, you can configure:
```python
# Choose LLM provider
SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral'
# Test mode (recommended for first run)
TEST_MODE = True # True = test on small sample
TEST_SIZE = 10 # Number of rows for testing
# Processing limits
MAX_ROWS = 20000 # Maximum rows to process (None = all)
SAVE_INTERVAL = 10 # Save progress every N rows
```
## Expected Output Format
The final dataset will include all original columns plus:
| Column | Description | Example |
|--------|-------------|---------|
| `real_name` | Cleaned name | "IU" |
| `full_name` | Full name from LLM | "Lee Ji-eun (IU)" |
| `gender` | Gender from LLM | "Female" |
| `profession_llm` | Up to 3 professions | "singer, actor, celebrity" |
| `country` | Country from LLM | "South Korea" |
| `likely_country` | Country from tags | "South Korea" |
| `likely_nationality` | Nationality from tags | "South Korean" |
| `tags` | Combined tags | "['korean', 'celebrity', 'singer']" |
## Troubleshooting
### API Key Errors
```
Warning: No API key for qwen
```
**Solution**: Ensure your API key file exists and contains only the key (no extra whitespace)
### Rate Limiting
```
Qwen API error (attempt 1/3): 429 Too Many Requests
```
**Solution**: The code automatically retries with exponential backoff. You can also:
- Increase `time.sleep(0.5)` to a higher value
- Process in smaller batches
### Progress Lost
**Solution**: The pipeline saves progress automatically. Check:
- `data/CSV/{llm}_annotated_POI_test.csv` - your partial results
- `misc/{llm}_query_index.txt` - last processed index
- Just re-run the cell and it will resume from the last saved progress
### JSON Parse Errors from LLM
```
Qwen API error: JSONDecodeError
```
**Solution**: This is usually temporary. The code:
- Returns "Unknown" for failed queries
- Continues processing
- You can manually review/reprocess failed entries later
## Cost Management
### Estimate Costs Before Processing
For a dataset with N entries:
- **Qwen**: Contact Alibaba Cloud for pricing
- **Llama**: ~N Γ $0.0005 = ~$5 per 10k entries
- **Mistral**: ~N Γ $0.004 = ~$40 per 10k entries
### Best Practices
1. **Always test first**: Run with `TEST_MODE = True` on 10 samples
2. **Monitor API usage**: Check your API provider's dashboard
3. **Use cheaper models first**: Try Llama before Mistral
4. **Process in batches**: Set `MAX_ROWS` to process incrementally
5. **Save intermediate results**: The automatic saving feature helps prevent data loss
## Comparing Multiple LLMs
To compare results from different LLMs:
1. Run the pipeline with `SELECTED_LLM = 'qwen'`
2. Change to `SELECTED_LLM = 'llama'` and run again
3. Change to `SELECTED_LLM = 'mistral'` and run again
4. Compare the three output files:
- `qwen_annotated_POI.csv`
- `llama_annotated_POI.csv`
- `mistral_annotated_POI.csv`
## Files Created
The pipeline creates these files:
```
data/CSV/
βββ NER_POI_step01_pre_annotation.csv # After name cleaning
βββ NER_POI_step02_annotated.csv # After country mapping
βββ qwen_annotated_POI_test.csv # Test results (Qwen)
βββ qwen_annotated_POI.csv # Full results (Qwen)
βββ llama_annotated_POI.csv # Full results (Llama)
βββ mistral_annotated_POI.csv # Full results (Mistral)
misc/
βββ qwen_query_index.txt # Progress tracking
βββ llama_query_index.txt # Progress tracking
βββ mistral_query_index.txt # Progress tracking
```
## Support
For issues or questions:
1. Check this guide for common problems
2. Review `misc/credentials/README.md` for API setup
3. Read the notebook documentation (first cell)
4. Check API provider documentation for service-specific issues
## Ethical Considerations
This research documents ethical problems with AI deepfake models. The dataset and analysis help:
- Understand the scope of unauthorized person likeness usage
- Document professions/demographics most affected
- Inform policy and technical solutions
- Raise awareness about deepfake technology misuse
Use this data responsibly and respect individual privacy and consent.
|