Quick Testing Instructions

Start Here! 🚀

You mentioned you have Deepseek credits, so start by testing with Deepseek first before trying the other LLMs.

Step-by-Step Testing

1. Make sure your Deepseek API key is in place

Check if this file exists:

cat misc/credentials/deepseek_api_key.txt

If not, create it:

echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt

2. Open the notebook

jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb

3. Run the cells in order

Cell 0-4: Introduction and setup (just markdown, no execution needed)
Cell 5: NER & Name Cleaning (processes real_person_adapters.csv)
Cell 7: Country/Nationality Mapping
Cell 10: 🌟 DEEPSEEK ANNOTATION (TEST THIS FIRST!)
- Default: TEST_MODE = True (10 samples)
- Will create: data/CSV/deepseek_annotated_POI_test.csv
Cell 12: Qwen/Llama/Mistral (run later after Deepseek works)

4. Review Deepseek Results

After Cell 10 completes, check:

Console output shows summary statistics
Output file: data/CSV/deepseek_annotated_POI_test.csv

Example output should look like:

✅ Progress saved after 10 rows
✅ Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv

=== Summary Statistics ===
Total processed: 10

Gender distribution:
Female    8
Male      2
...

5. If Deepseek Works Well

Once you're satisfied with the Deepseek results:

Option A: Process full dataset with Deepseek

# In Cell 10, change:
TEST_MODE = False

Option B: Try other LLMs for comparison

Set up API keys for Qwen/Llama/Mistral (see misc/credentials/README.md)

Run Cell 12 with your chosen LLM:

SELECTED_LLM = 'qwen'  # or 'llama' or 'mistral'
TEST_MODE = True       # Test first!

Expected Cost (Deepseek)

10 samples (test): ~$0.01 or less
1,000 entries: ~$0.10-0.20
10,000 entries: ~$1-2

Much cheaper than the other options, making it perfect for testing!

Troubleshooting

"deepseek_api_key.txt not found"

# Create the file with your key
echo "your-api-key" > misc/credentials/deepseek_api_key.txt

"File does not exist: real_person_adapters.csv"

Make sure the input file exists:

ls -lh data/CSV/real_person_adapters.csv

API Rate Limiting

The code includes automatic rate limiting (time.sleep(1) between requests). If you still get rate limited:

Increase the sleep time in Cell 10: change time.sleep(1) to time.sleep(2)

Pipeline Interrupted

No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.

What's Next?

After testing with Deepseek:

If results look good: Scale up to full dataset with Deepseek
Compare LLMs: Test Qwen/Llama/Mistral on the same sample to see which gives best results
Production run: Choose your preferred LLM and process the full dataset

File Outputs

The pipeline creates these files:

data/CSV/
├── NER_POI_step01_pre_annotation.csv       # After Cell 5 (name cleaning)
├── NER_POI_step02_annotated.csv            # After Cell 7 (country mapping)
├── deepseek_annotated_POI_test.csv         # After Cell 10 (test mode)
├── deepseek_annotated_POI.csv              # After Cell 10 (full mode)
├── qwen_annotated_POI_test.csv             # After Cell 12 (if using Qwen)
└── ...

misc/
├── deepseek_query_index.txt                # Progress tracking
└── ...

Quick Commands

# View first few results
head -20 data/CSV/deepseek_annotated_POI_test.csv

# Count processed rows
wc -l data/CSV/deepseek_annotated_POI_test.csv

# Check progress
cat misc/deepseek_query_index.txt

# Reset progress (start from scratch)
rm misc/deepseek_query_index.txt

Ready to start? Open the notebook and run Cell 5 → Cell 7 → Cell 10! 🎉