code / md /TESTING_INSTRUCTIONS.md
Laura Wagner
to commit or not commit that is the question
5f5806d

Quick Testing Instructions

Start Here! πŸš€

You mentioned you have Deepseek credits, so start by testing with Deepseek first before trying the other LLMs.

Step-by-Step Testing

1. Make sure your Deepseek API key is in place

Check if this file exists:

cat misc/credentials/deepseek_api_key.txt

If not, create it:

echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt

2. Open the notebook

jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb

3. Run the cells in order

  1. Cell 0-4: Introduction and setup (just markdown, no execution needed)
  2. Cell 5: NER & Name Cleaning (processes real_person_adapters.csv)
  3. Cell 7: Country/Nationality Mapping
  4. Cell 10: 🌟 DEEPSEEK ANNOTATION (TEST THIS FIRST!)
    • Default: TEST_MODE = True (10 samples)
    • Will create: data/CSV/deepseek_annotated_POI_test.csv
  5. Cell 12: Qwen/Llama/Mistral (run later after Deepseek works)

4. Review Deepseek Results

After Cell 10 completes, check:

  • Console output shows summary statistics
  • Output file: data/CSV/deepseek_annotated_POI_test.csv

Example output should look like:

βœ… Progress saved after 10 rows
βœ… Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv

=== Summary Statistics ===
Total processed: 10

Gender distribution:
Female    8
Male      2
...

5. If Deepseek Works Well

Once you're satisfied with the Deepseek results:

Option A: Process full dataset with Deepseek

# In Cell 10, change:
TEST_MODE = False

Option B: Try other LLMs for comparison

  1. Set up API keys for Qwen/Llama/Mistral (see misc/credentials/README.md)
  2. Run Cell 12 with your chosen LLM:
    SELECTED_LLM = 'qwen'  # or 'llama' or 'mistral'
    TEST_MODE = True       # Test first!
    

Expected Cost (Deepseek)

  • 10 samples (test): ~$0.01 or less
  • 1,000 entries: ~$0.10-0.20
  • 10,000 entries: ~$1-2

Much cheaper than the other options, making it perfect for testing!

Troubleshooting

"deepseek_api_key.txt not found"

# Create the file with your key
echo "your-api-key" > misc/credentials/deepseek_api_key.txt

"File does not exist: real_person_adapters.csv"

Make sure the input file exists:

ls -lh data/CSV/real_person_adapters.csv

API Rate Limiting

The code includes automatic rate limiting (time.sleep(1) between requests). If you still get rate limited:

  • Increase the sleep time in Cell 10: change time.sleep(1) to time.sleep(2)

Pipeline Interrupted

No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.

What's Next?

After testing with Deepseek:

  1. If results look good: Scale up to full dataset with Deepseek
  2. Compare LLMs: Test Qwen/Llama/Mistral on the same sample to see which gives best results
  3. Production run: Choose your preferred LLM and process the full dataset

File Outputs

The pipeline creates these files:

data/CSV/
β”œβ”€β”€ NER_POI_step01_pre_annotation.csv       # After Cell 5 (name cleaning)
β”œβ”€β”€ NER_POI_step02_annotated.csv            # After Cell 7 (country mapping)
β”œβ”€β”€ deepseek_annotated_POI_test.csv         # After Cell 10 (test mode)
β”œβ”€β”€ deepseek_annotated_POI.csv              # After Cell 10 (full mode)
β”œβ”€β”€ qwen_annotated_POI_test.csv             # After Cell 12 (if using Qwen)
└── ...

misc/
β”œβ”€β”€ deepseek_query_index.txt                # Progress tracking
└── ...

Quick Commands

# View first few results
head -20 data/CSV/deepseek_annotated_POI_test.csv

# Count processed rows
wc -l data/CSV/deepseek_annotated_POI_test.csv

# Check progress
cat misc/deepseek_query_index.txt

# Reset progress (start from scratch)
rm misc/deepseek_query_index.txt

Ready to start? Open the notebook and run Cell 5 β†’ Cell 7 β†’ Cell 10! πŸŽ‰