File size: 3,992 Bytes
5f5806d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# Quick Testing Instructions

## Start Here! πŸš€

You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs.

## Step-by-Step Testing

### 1. Make sure your Deepseek API key is in place

Check if this file exists:
```bash
cat misc/credentials/deepseek_api_key.txt
```

If not, create it:
```bash
echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt
```

### 2. Open the notebook

```bash
jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb
```

### 3. Run the cells in order

1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed)
2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`)
3. **Cell 7**: Country/Nationality Mapping
4. **Cell 10**: 🌟 **DEEPSEEK ANNOTATION** (TEST THIS FIRST!)
   - Default: `TEST_MODE = True` (10 samples)
   - Will create: `data/CSV/deepseek_annotated_POI_test.csv`
5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works)

### 4. Review Deepseek Results

After Cell 10 completes, check:
- Console output shows summary statistics
- Output file: `data/CSV/deepseek_annotated_POI_test.csv`

Example output should look like:
```
βœ… Progress saved after 10 rows
βœ… Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv

=== Summary Statistics ===
Total processed: 10

Gender distribution:
Female    8
Male      2
...
```

### 5. If Deepseek Works Well

Once you're satisfied with the Deepseek results:

**Option A: Process full dataset with Deepseek**
```python
# In Cell 10, change:
TEST_MODE = False
```

**Option B: Try other LLMs for comparison**
1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`)
2. Run Cell 12 with your chosen LLM:
   ```python
   SELECTED_LLM = 'qwen'  # or 'llama' or 'mistral'
   TEST_MODE = True       # Test first!
   ```

## Expected Cost (Deepseek)

- **10 samples** (test): ~$0.01 or less
- **1,000 entries**: ~$0.10-0.20
- **10,000 entries**: ~$1-2

Much cheaper than the other options, making it perfect for testing!

## Troubleshooting

### "deepseek_api_key.txt not found"
```bash
# Create the file with your key
echo "your-api-key" > misc/credentials/deepseek_api_key.txt
```

### "File does not exist: real_person_adapters.csv"
Make sure the input file exists:
```bash
ls -lh data/CSV/real_person_adapters.csv
```

### API Rate Limiting
The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited:
- Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)`

### Pipeline Interrupted
No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.

## What's Next?

After testing with Deepseek:

1. **If results look good**: Scale up to full dataset with Deepseek
2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results
3. **Production run**: Choose your preferred LLM and process the full dataset

## File Outputs

The pipeline creates these files:

```
data/CSV/
β”œβ”€β”€ NER_POI_step01_pre_annotation.csv       # After Cell 5 (name cleaning)
β”œβ”€β”€ NER_POI_step02_annotated.csv            # After Cell 7 (country mapping)
β”œβ”€β”€ deepseek_annotated_POI_test.csv         # After Cell 10 (test mode)
β”œβ”€β”€ deepseek_annotated_POI.csv              # After Cell 10 (full mode)
β”œβ”€β”€ qwen_annotated_POI_test.csv             # After Cell 12 (if using Qwen)
└── ...

misc/
β”œβ”€β”€ deepseek_query_index.txt                # Progress tracking
└── ...
```

## Quick Commands

```bash
# View first few results
head -20 data/CSV/deepseek_annotated_POI_test.csv

# Count processed rows
wc -l data/CSV/deepseek_annotated_POI_test.csv

# Check progress
cat misc/deepseek_query_index.txt

# Reset progress (start from scratch)
rm misc/deepseek_query_index.txt
```

---

**Ready to start?** Open the notebook and run Cell 5 β†’ Cell 7 β†’ Cell 10! πŸŽ‰