Laura Wagner commited on
Commit
5f5806d
·
1 Parent(s): e15d7a4

to commit or not commit that is the question

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitignore +12 -0
  2. README.md +73 -24
  3. gitlab-ci.yml +11 -0
  4. jupyter_notebooks/.ipynb_checkpoints/GEMMA_3-checkpoint.ipynb +400 -0
  5. jupyter_notebooks/.ipynb_checkpoints/MISTRAL-checkpoint.ipynb +6 -0
  6. jupyter_notebooks/.ipynb_checkpoints/QWEN-checkpoint.ipynb +645 -0
  7. jupyter_notebooks/.ipynb_checkpoints/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images-checkpoint.ipynb +1795 -0
  8. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-1_Tag_occurences-checkpoint.ipynb +6 -0
  9. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Bloomz_query-checkpoint.ipynb +370 -0
  10. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_1_LLM_annotation-checkpoint.ipynb +1941 -0
  11. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction-checkpoint.ipynb +0 -0
  12. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-Copy1-checkpoint.ipynb +0 -0
  13. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-checkpoint.ipynb +0 -0
  14. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8_Deepfake_victims-checkpoint.ipynb +668 -0
  15. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8b_sunburst_profession-checkpoint.ipynb +123 -0
  16. jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_compare-models-checkpoint.ipynb +237 -0
  17. jupyter_notebooks/.ipynb_checkpoints/Section_2-3_Figure_5_co-occurence_promotional_tags-checkpoint.ipynb +314 -0
  18. jupyter_notebooks/.ipynb_checkpoints/Section_2-4_Figure_9_ectract_LoRA_metadata_v2-checkpoint.ipynb +400 -0
  19. jupyter_notebooks/0_Scraping_image_metadata.ipynb +1345 -0
  20. jupyter_notebooks/0_Scraping_model_metadata.ipynb +643 -0
  21. jupyter_notebooks/Section_1_Figure_1_image_grid.ipynb +417 -0
  22. jupyter_notebooks/Section_2-2-2_Figure_3_histogram_monthly_images_nsfw_levels.ipynb +0 -0
  23. jupyter_notebooks/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images.ipynb +1795 -0
  24. jupyter_notebooks/Section_2-3-1_Tag_occurences.ipynb +801 -0
  25. jupyter_notebooks/Section_2-3-2_top_10_most_popular_checkpoints.ipynb +210 -0
  26. jupyter_notebooks/Section_2-3-3_Figure_7_top_30_adapters.ipynb +0 -0
  27. jupyter_notebooks/Section_2-3-4_Figure_8_Step_1_LLM_annotation.ipynb +1941 -0
  28. jupyter_notebooks/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction.ipynb +0 -0
  29. jupyter_notebooks/Section_2-3-4__Figure_8a_sunburst_gender.ipynb +129 -0
  30. jupyter_notebooks/Section_2-3-4__Figure_8b_sunburst_profession.ipynb +332 -0
  31. jupyter_notebooks/Section_2-3_Figure_5_co-occurence_promotional_tags.ipynb +314 -0
  32. jupyter_notebooks/Section_2-4_Figure_9_Training_tags_Sankey.ipynb +203 -0
  33. jupyter_notebooks/Section_2-4_Figure_9_ectract_LoRA_metadata_v2.ipynb +414 -0
  34. jupyter_notebooks/Section_2-4_Figure_9_extract_LoRA_metadata.ipynb +557 -0
  35. jupyter_notebooks/SuppM_Figure_13_Danbooru_categories.ipynb +141 -0
  36. jupyter_notebooks/SuppM_Figure_S12_asset_types.ipynb +129 -0
  37. jupyter_notebooks/SuppM_Figure_S13_Danbooru_Taxonomy.ipynb +1848 -0
  38. jupyter_notebooks/SuppM_Figure_S14_co-occurence_training_data.ipynb +152 -0
  39. md/DEEPFAKE_PIPELINE_GUIDE.md +210 -0
  40. md/LLM_MODELS_COMPARISON.md +326 -0
  41. md/QUICK_START_LOCAL.md +171 -0
  42. md/QWEN_LOCAL_SETUP.md +321 -0
  43. md/SPACY_NER_EXPLANATION.md +316 -0
  44. md/TESTING_INSTRUCTIONS.md +148 -0
  45. md/UPDATES_AND_FIXES.md +235 -0
  46. misc/assets/fonts/DejaVuSans.ttf +3 -0
  47. misc/assets/fonts/Noto_Sans.zip +3 -0
  48. misc/assets/fonts/Noto_Sans/NotoSans-Italic-VariableFont_wdth,wght.ttf +3 -0
  49. misc/assets/fonts/Noto_Sans/NotoSans-VariableFont_wdth,wght.ttf +3 -0
  50. misc/assets/fonts/Noto_Sans/OFL.txt +3 -0
.gitignore ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data
2
+ ext
3
+ ARCHIVE
4
+ misc/api_keys.txt
5
+ misc/training_tags_categories
6
+ misc/credentials/*
7
+ misc/credentials
8
+ scripts/ARCHIVE
9
+ scripts/CEMETARY
10
+ cemetary
11
+ .venv
12
+ logs
README.md CHANGED
@@ -1,29 +1,78 @@
1
- # Repository for "Perpetuating Misogyny with Generative AI: How Model Personalization Normalizes Gendered Harm"
2
 
3
- This repository contains the code for [Perpetuating Misogyny with Generative AI: How Model Personalization Normalizes Gendered Harm](https://arxiv.org/abs/2505.04600).
4
 
5
- ## Related Datasets
6
-
7
- This project uses three datasets, all hosted on Hugging Face:
8
- - Dataset 1: [dataset-name-1](https://huggingface.co/datasets/username/dataset-1)
9
- - Dataset 2: [dataset-name-2](https://huggingface.co/datasets/username/dataset-2)
10
- - Dataset 3: [dataset-name-3](https://huggingface.co/datasets/username/dataset-3)
11
 
12
- ## Installation
13
- ```python
14
- pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ```
16
 
17
- ## Quick Start
18
- ```python
19
- from datasets import load_dataset
20
-
21
- # Load all three datasets
22
- dataset1 = load_dataset("username/dataset-1")
23
- dataset2 = load_dataset("username/dataset-2")
24
- dataset3 = load_dataset("username/dataset-3")
25
-
26
- # Run your pipeline
27
- from src.pipeline import run_full_pipeline
28
- results = run_full_pipeline(dataset1, dataset2, dataset3)
29
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Code for the Paper titled ["Perpetual Misogyny: How Gendered Tropes Shape Text-To-Image AI Personalization"](http://arxiv.org/pdf/2505.04600)
2
 
 
3
 
 
 
 
 
 
 
4
 
5
+
6
+ ### Repository Structure
7
+
8
+ ```
9
+ CIVITAI_VISUALIZATIONS/
10
+ ├── .virtual_documents/ # Temporary files from Jupyter
11
+ ├── data/ # Final curated datasets
12
+ │ ├── subset1/ # Specific data splits or versions
13
+ │ ├── subset2/
14
+ │ └── ...
15
+ ├── misc/
16
+ │ └── credentials/ # API keys and sensitive config (excluded from versioning)
17
+ ├── plots/ # Output plots and figures used in the paper
18
+ ├── ├── jupyter_notebooks/ # Main analysis notebooks
19
+ │ ├── 0_Scraping_image_metadata.ipynb # Scrapes CivitAI metadata via API
20
+ │ ├── Section_1_Figure_1_image_grid.ipynb # Grid of sample images
21
+ │ ├── Section_3-2-1_Figure_3_histogram.ipynb # Histogram of upload trends
22
+ │ ├── Section_3-2-1_Figure_4_Mivolo.ipynb # Model activity plot (Mivolo-focused)
23
+ │ ├── Section_3-3-1_Figure_5_tags.ipynb # Tag frequency and usage visualizations
24
+ │ ├── Section_3-3-3_download_popular_models.ipynb # Download models for analysis
25
+ │ ├── Section_3-3-3_Figure_6.ipynb # Promotional tag usage patterns
26
+ │ ├── Section_3-3-4_Figure_8a.ipynb # Ranking of popular models
27
+ │ ├── Section_3-3-4_Figure_8b.ipynb # Continuation of model rankings
28
+ │ ├── Section_3-3-4_Figure_9_Sankey.ipynb # Sankey diagram: user-model contributions
29
+ │ ├── Section_3-3-4_LLM_annotation.ipynb # Annotations using large language models
30
+ │ ├── Section_3-4_extract_LoRA_metadata.ipynb # LoRA metadata extraction
31
+ │ ├── SuppM_Figure_12_Danbooru_Taxonomy.ipynb # Danbooru tag taxonomy: visualization
32
+ │ ├── SuppM_Figure_13_Danbooru_taxonomy.ipynb # Tag grouping and structure
33
+ │ └── SuppM_Figure_13.ipynb # Supplementary figure generation
34
+ ├── misc/ # Utility scripts and API credentials (excluded)
35
+ ├── plots/ # Output plots and visualizations
36
+ ├── public/ # Optional public-facing files
37
+ ├── .gitignore
38
+ ├── .gitmodules
39
+ ├── README.md # This file
40
+ └── requirements.txt # Python dependencies
41
  ```
42
 
43
+ # Project Setup
44
+
45
+ ## Requirements
46
+ This project requires Python 3.8 or higher. Ensure you have it installed before proceeding.
47
+
48
+ ## Installation
49
+
50
+ 1. **Clone the Repository**
51
+ ```sh
52
+ git clone https://gitlab.uzh.ch/latent-canon/pm-paper.git
53
+ cd pm-paper
54
+ ```
55
+
56
+ 2. **Create a Virtual Environment**
57
+ (Recommended to avoid dependency conflicts)
58
+ ```sh
59
+ python -m venv venv
60
+ source venv/bin/activate # On macOS/Linux
61
+ venv\Scripts\activate # On Windows
62
+ ```
63
+
64
+ 3. **Install Dependencies**
65
+ ```sh
66
+ pip install -r requirements.txt
67
+ ```
68
+
69
+ 4. **Jupyter Notebook Setup (Optional)**
70
+ If running Jupyter notebooks, ensure the environment is linked:
71
+ ```sh
72
+ python -m ipykernel install --user --name=venv --display-name "Python (venv)"
73
+ ```
74
+
75
+ ## Notes
76
+ - Ensure you have necessary system dependencies installed (e.g., `opencv` may require additional system libraries).
77
+ - If you encounter any issues, ensure you're using the correct Python environment (`venv` activated).
78
+
gitlab-ci.yml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # .gitlab-ci.yml
3
+ pages:
4
+ stage: deploy
5
+ script:
6
+ - echo "Nothing to build, serving public/"
7
+ artifacts:
8
+ paths:
9
+ - public
10
+ only:
11
+ - main
jupyter_notebooks/.ipynb_checkpoints/GEMMA_3-checkpoint.ipynb ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "471e0cef-678e-4403-8eca-f8e1991d86de",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import pandas as pd\n",
11
+ "import json\n",
12
+ "import time\n",
13
+ "import re\n",
14
+ "from pathlib import Path\n",
15
+ "from tqdm import tqdm\n",
16
+ "import torch\n",
17
+ "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
18
+ "\n",
19
+ "current_dir = Path.cwd()\n",
20
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
21
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
22
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
23
+ "# === PROCESS DATA ===\n",
24
+ "\n",
25
+ "\n",
26
+ "# === CONFIGURATION ===\n",
27
+ "TEST_MODE = False\n",
28
+ "TEST_SIZE = 100\n",
29
+ "MAX_ROWS = 50862\n",
30
+ "SAVE_INTERVAL = 10\n",
31
+ "\n",
32
+ "output_file = current_dir.parent / f\"data/CSV/gemma_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
33
+ "index_file = current_dir.parent / \"misc/query_indicies/gemma_local_query_index.txt\"\n",
34
+ "\n",
35
+ "# Model settings\n",
36
+ "MODEL_NAME = MODEL_NAME = \"google/gemma-3-27b-it\"\n",
37
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
38
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
39
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
40
+ "\n",
41
+ "# Define the SPECIFIC profession categories\n",
42
+ "PROFESSION_CATEGORIES = [\n",
43
+ " \"actor\",\n",
44
+ " \"adult performer\",\n",
45
+ " \"singer/musician\",\n",
46
+ " \"model\",\n",
47
+ " \"online personality\",\n",
48
+ " \"public figure\",\n",
49
+ " \"voice actor/ASMR\",\n",
50
+ " \"sports professional\",\n",
51
+ " \"tv personality\"\n",
52
+ "]\n",
53
+ "\n",
54
+ "# === LOAD MODEL ===\n",
55
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
56
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
57
+ "print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
58
+ "\n",
59
+ "# Check GPU availability\n",
60
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
61
+ "print(f\"Device: {device}\")\n",
62
+ "\n",
63
+ "if device == \"cpu\":\n",
64
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
65
+ " print(\" Consider using a GPU or reducing model size.\")\n",
66
+ "\n",
67
+ "# Load tokenizer\n",
68
+ "print(\"Loading tokenizer...\")\n",
69
+ "try:\n",
70
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
71
+ " MODEL_NAME,\n",
72
+ " cache_dir=str(CACHE_DIR),\n",
73
+ " use_fast=True\n",
74
+ " )\n",
75
+ "except Exception as e:\n",
76
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
77
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
78
+ " MODEL_NAME,\n",
79
+ " cache_dir=str(CACHE_DIR),\n",
80
+ " use_fast=False\n",
81
+ " )\n",
82
+ "\n",
83
+ "# Ensure pad token is set\n",
84
+ "if tokenizer.pad_token is None:\n",
85
+ " tokenizer.pad_token = tokenizer.eos_token\n",
86
+ "\n",
87
+ "print(\"✅ Tokenizer loaded\")\n",
88
+ "\n",
89
+ "# Load model with optimizations\n",
90
+ "print(\"Loading model (this may take several minutes)...\")\n",
91
+ "model = AutoModelForCausalLM.from_pretrained(\n",
92
+ " MODEL_NAME,\n",
93
+ " cache_dir=str(CACHE_DIR),\n",
94
+ " torch_dtype=torch.bfloat16,\n",
95
+ " device_map=\"auto\",\n",
96
+ " trust_remote_code=False\n",
97
+ ")\n",
98
+ "model.eval()\n",
99
+ "print(\"✅ Model loaded\")\n",
100
+ "\n",
101
+ "# Check VRAM usage\n",
102
+ "if torch.cuda.is_available():\n",
103
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
104
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
105
+ "\n",
106
+ "# === LOAD DATA ===\n",
107
+ "print(\"Loading raw input CSV...\")\n",
108
+ "df = pd.read_csv(input_file) # ALWAYS load the full input\n",
109
+ "print(f\"Loaded {len(df)} rows from input file\")\n",
110
+ "\n",
111
+ "# If we have previous annotations, merge them\n",
112
+ "if output_file.exists():\n",
113
+ " print(\"Found existing annotations, merging...\")\n",
114
+ " existing_df = pd.read_csv(output_file)\n",
115
+ " print(f\"Existing annotations has {len(existing_df)} rows\")\n",
116
+ " \n",
117
+ " # Update df with existing annotations\n",
118
+ " # Only update the columns that were annotated\n",
119
+ " annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
120
+ " for col in annotation_cols:\n",
121
+ " if col in existing_df.columns:\n",
122
+ " df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
123
+ " \n",
124
+ " print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
125
+ "\n",
126
+ "\n",
127
+ "# Try to load profession mapping files\n",
128
+ "try:\n",
129
+ " professions_df = pd.read_csv(professions_file)\n",
130
+ " print(f\"✅ Loaded professions.csv\")\n",
131
+ "except:\n",
132
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
133
+ "\n",
134
+ "try:\n",
135
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
136
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
137
+ "except:\n",
138
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
139
+ "\n",
140
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
141
+ "\n",
142
+ "print(f\"Loaded {len(df)} rows\")\n",
143
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
144
+ "for cat in PROFESSION_CATEGORIES:\n",
145
+ " print(f\" - {cat}\")\n",
146
+ "\n",
147
+ "if TEST_MODE:\n",
148
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
149
+ " df = df.head(TEST_SIZE).copy()\n",
150
+ "elif MAX_ROWS:\n",
151
+ " df = df.head(MAX_ROWS).copy()\n",
152
+ "\n",
153
+ "# === CREATE PROMPTS ===\n",
154
+ "def create_prompt(row):\n",
155
+ " \"\"\"Create prompt for Gemma annotation with specific profession categories.\"\"\"\n",
156
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
157
+ " \n",
158
+ " # Gather hints\n",
159
+ " hints = []\n",
160
+ " if pd.notna(row.get('likely_profession')):\n",
161
+ " hints.append(str(row['likely_profession']))\n",
162
+ " if pd.notna(row.get('likely_nationality')):\n",
163
+ " hints.append(str(row['likely_nationality']))\n",
164
+ " if pd.notna(row.get('likely_country')):\n",
165
+ " hints.append(str(row['likely_country']))\n",
166
+ " \n",
167
+ " # Add tags if we don't have enough hints\n",
168
+ " if len(hints) < 3:\n",
169
+ " for i in range(1, 8):\n",
170
+ " tag_col = f'tag_{i}'\n",
171
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
172
+ " tag_val = str(row[tag_col])\n",
173
+ " if tag_val not in hints:\n",
174
+ " hints.append(tag_val)\n",
175
+ " if len(hints) >= 5:\n",
176
+ " break\n",
177
+ " \n",
178
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
179
+ " \n",
180
+ " return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
181
+ "1. Full legal name (Western order if non-latin script)\n",
182
+ "2. Any stage names/aliases (comma separated)\n",
183
+ "3. Gender (Male/Female/Other/Unknown)\n",
184
+ "4. Top 3 most likely professions from ONLY these categories:\n",
185
+ " - actor\n",
186
+ " - adult performer\n",
187
+ " - singer/musician\n",
188
+ " - model\n",
189
+ " - online personality (includes streamers, cosplayers, influencers)\n",
190
+ " - public figure (includes politicians, activists, journalists, authors)\n",
191
+ " - voice actor/ASMR\n",
192
+ " - sports professional\n",
193
+ " - tv personality (includes hosts, presenters, reality TV)\n",
194
+ "\n",
195
+ "5. Primary country associated\n",
196
+ "\n",
197
+ "IMPORTANT:\n",
198
+ "- Choose professions ONLY from the 9 categories above\n",
199
+ "- Provide up to 3 professions, comma-separated, ordered by relevance\n",
200
+ "- Be SPECIFIC: choose the most accurate category for each role\n",
201
+ "- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
202
+ "- Use 'Unknown' when uncertain or for fictional characters/places\n",
203
+ "- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
204
+ "- For country respond with one word only, for example China or Columbia\n",
205
+ "- actress = actor\n",
206
+ "\n",
207
+ "Respond with exactly 5 numbered lines.\"\"\"\n",
208
+ "\n",
209
+ "# Create prompts\n",
210
+ "print(\"\\nCreating prompts...\")\n",
211
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
212
+ "print(\"✅ Prompts created\")\n",
213
+ "\n",
214
+ "# === QUERY Gemma LOCAL ===\n",
215
+ "def query_gemma_local(prompt: str) -> str:\n",
216
+ " \"\"\"Query Gemma locally via transformers.\"\"\"\n",
217
+ " try:\n",
218
+ " # Format as chat message for GEMMA\n",
219
+ " messages = [\n",
220
+ " {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
221
+ " {\"role\": \"user\", \"content\": prompt}\n",
222
+ " ]\n",
223
+ " \n",
224
+ " # Tokenize\n",
225
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
226
+ " text = tokenizer.apply_chat_template(\n",
227
+ " messages,\n",
228
+ " tokenize=False,\n",
229
+ " add_generation_prompt=True\n",
230
+ " )\n",
231
+ " else:\n",
232
+ " # Fallback for older tokenizers\n",
233
+ " text = f\"[INST] {prompt} [/INST]\"\n",
234
+ " \n",
235
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
236
+ " \n",
237
+ " # Generate\n",
238
+ " with torch.no_grad():\n",
239
+ " outputs = model.generate(\n",
240
+ " **inputs,\n",
241
+ " max_new_tokens=512,\n",
242
+ " temperature=0.1,\n",
243
+ " do_sample=True,\n",
244
+ " top_p=0.9,\n",
245
+ " pad_token_id=tokenizer.eos_token_id\n",
246
+ " )\n",
247
+ " \n",
248
+ " # Decode\n",
249
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
250
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
251
+ " \n",
252
+ " return response.strip()\n",
253
+ " \n",
254
+ " except Exception as e:\n",
255
+ " print(f\"Generation error: {e}\")\n",
256
+ " return None\n",
257
+ "\n",
258
+ "output_file = current_dir.parent / f\"data/CSV/gemma_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
259
+ "index_file = current_dir.parent / \"misc/query_indicies/gemma_local_query_index.txt\"\n",
260
+ "\n",
261
+ "\n",
262
+ "# === PARSE RESPONSE ===\n",
263
+ "def parse_response(response):\n",
264
+ " \"\"\"Parse Gemma response into structured fields.\"\"\"\n",
265
+ " if not response:\n",
266
+ " return {\n",
267
+ " 'full_name': 'Unknown',\n",
268
+ " 'aliases': 'Unknown',\n",
269
+ " 'gender': 'Unknown',\n",
270
+ " 'profession_llm': 'Unknown',\n",
271
+ " 'country': 'Unknown'\n",
272
+ " }\n",
273
+ " \n",
274
+ " # Split into lines and clean\n",
275
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
276
+ " \n",
277
+ " # Initialize with Unknown values\n",
278
+ " fields = {\n",
279
+ " 'full_name': 'Unknown',\n",
280
+ " 'aliases': 'Unknown',\n",
281
+ " 'gender': 'Unknown',\n",
282
+ " 'profession_llm': 'Unknown',\n",
283
+ " 'country': 'Unknown'\n",
284
+ " }\n",
285
+ " \n",
286
+ " # Extract information from each numbered line\n",
287
+ " for line in lines:\n",
288
+ " if line.startswith('1.'):\n",
289
+ " fields['full_name'] = line[2:].strip()\n",
290
+ " elif line.startswith('2.'):\n",
291
+ " fields['aliases'] = line[2:].strip()\n",
292
+ " elif line.startswith('3.'):\n",
293
+ " fields['gender'] = line[2:].strip()\n",
294
+ " elif line.startswith('4.'):\n",
295
+ " fields['profession_llm'] = line[2:].strip()\n",
296
+ " elif line.startswith('5.'):\n",
297
+ " fields['country'] = line[2:].strip()\n",
298
+ " \n",
299
+ " return fields\n",
300
+ "\n",
301
+ "\n",
302
+ "# === PROCESS DATA ===\n",
303
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
304
+ "\n",
305
+ "# Load index\n",
306
+ "current_index = 0\n",
307
+ "if index_file.exists():\n",
308
+ " try:\n",
309
+ " current_index = int(index_file.read_text().strip())\n",
310
+ " except:\n",
311
+ " current_index = 0\n",
312
+ "\n",
313
+ "print(f\"Resuming from index {current_index}\")\n",
314
+ "\n",
315
+ "start_time = time.time()\n",
316
+ "\n",
317
+ "for i in tqdm(range(current_index, len(df)), desc=\"Gemma Local\"):\n",
318
+ "\n",
319
+ " prompt = df.at[i, \"prompt\"]\n",
320
+ "\n",
321
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
322
+ " response = None\n",
323
+ " for attempt in range(3):\n",
324
+ " response = query_gemma_local(prompt)\n",
325
+ " \n",
326
+ " # Valid response?\n",
327
+ " if response and len(response.strip()) > 10:\n",
328
+ " break\n",
329
+ " \n",
330
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
331
+ " time.sleep(0.5)\n",
332
+ "\n",
333
+ " # If still invalid → DO NOT overwrite previous data\n",
334
+ " if not response or len(response.strip()) <= 10:\n",
335
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
336
+ " continue\n",
337
+ "\n",
338
+ " parsed = parse_response(response)\n",
339
+ "\n",
340
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
341
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
342
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
343
+ " continue\n",
344
+ "\n",
345
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
346
+ " for key, value in parsed.items():\n",
347
+ " df.at[i, key] = value\n",
348
+ "\n",
349
+ " # Advance progress ONLY after successful write\n",
350
+ " current_index = i + 1\n",
351
+ "\n",
352
+ " # -------- GPU MEMORY CLEANUP --------\n",
353
+ " if torch.cuda.is_available():\n",
354
+ " torch.cuda.empty_cache()\n",
355
+ " torch.cuda.synchronize()\n",
356
+ "\n",
357
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
358
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
359
+ " df.to_csv(output_file, index=False)\n",
360
+ " with open(index_file, \"w\") as f:\n",
361
+ " f.write(str(current_index))\n",
362
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
363
+ "\n",
364
+ "# Final save\n",
365
+ "df.to_csv(output_file, index=False)\n",
366
+ "index_file.write_text(str(current_index))\n",
367
+ "print(\"✅ Finished full dataset.\")\n"
368
+ ]
369
+ },
370
+ {
371
+ "cell_type": "code",
372
+ "execution_count": null,
373
+ "id": "7a1be7c0-ce54-4445-8534-bf3ab5e70197",
374
+ "metadata": {},
375
+ "outputs": [],
376
+ "source": []
377
+ }
378
+ ],
379
+ "metadata": {
380
+ "kernelspec": {
381
+ "display_name": "pm-paper",
382
+ "language": "python",
383
+ "name": "pm-paper"
384
+ },
385
+ "language_info": {
386
+ "codemirror_mode": {
387
+ "name": "ipython",
388
+ "version": 3
389
+ },
390
+ "file_extension": ".py",
391
+ "mimetype": "text/x-python",
392
+ "name": "python",
393
+ "nbconvert_exporter": "python",
394
+ "pygments_lexer": "ipython3",
395
+ "version": "3.11.13"
396
+ }
397
+ },
398
+ "nbformat": 4,
399
+ "nbformat_minor": 5
400
+ }
jupyter_notebooks/.ipynb_checkpoints/MISTRAL-checkpoint.ipynb ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [],
3
+ "metadata": {},
4
+ "nbformat": 4,
5
+ "nbformat_minor": 5
6
+ }
jupyter_notebooks/.ipynb_checkpoints/QWEN-checkpoint.ipynb ADDED
@@ -0,0 +1,645 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "0543d9c4-055b-49a3-a8d0-9dfb622b2b8c",
6
+ "metadata": {},
7
+ "source": [
8
+ "# QWEN 2.5-32B Local Inference\n",
9
+ "\n",
10
+ "## Hardware Requirements\n",
11
+ "- **GPU**: NVIDIA A100 (40GB or 80GB recommended)\n",
12
+ "- **VRAM Usage**: \n",
13
+ " - 8-bit quantization: ~32GB\n",
14
+ " - 4-bit quantization: ~16-20GB\n",
15
+ " - bfloat16 (no quantization): ~64GB\n",
16
+ "- **System RAM**: 32GB minimum, 64GB recommended\n",
17
+ "- **Storage**: ~65GB for model download\n",
18
+ "\n",
19
+ "## Configuration\n",
20
+ "This notebook uses **8-bit quantization** via `bitsandbytes` for optimal performance on A100 GPUs:\n",
21
+ "- Reduces VRAM usage from 64GB to ~32GB\n",
22
+ "- Minimal quality degradation\n",
23
+ "- Faster inference than bfloat16\n",
24
+ "\n",
25
+ "## Model Details\n",
26
+ "- **Model**: Qwen/Qwen2.5-32B-Instruct\n",
27
+ "- **Task**: Entity annotation and profession classification\n",
28
+ "- **Quantization**: LLM.int8() (8-bit)\n",
29
+ "- **Device**: CUDA (auto device mapping)\n",
30
+ "\n",
31
+ "## Dependencies\n",
32
+ "Make sure to install:\n",
33
+ "```bash\n",
34
+ "pip install transformers>=4.35.0 bitsandbytes>=0.41.0 accelerate torch pandas tqdm\n",
35
+ "```"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": 1,
41
+ "id": "fe6ba282-896b-4272-b82b-ef24810732fb",
42
+ "metadata": {
43
+ "execution": {
44
+ "iopub.execute_input": "2025-11-29T20:14:34.671159Z",
45
+ "iopub.status.busy": "2025-11-29T20:14:34.671015Z",
46
+ "iopub.status.idle": "2025-11-29T20:26:40.952267Z",
47
+ "shell.execute_reply": "2025-11-29T20:26:40.951486Z",
48
+ "shell.execute_reply.started": "2025-11-29T20:14:34.671146Z"
49
+ }
50
+ },
51
+ "outputs": [
52
+ {
53
+ "name": "stderr",
54
+ "output_type": "stream",
55
+ "text": [
56
+ "/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
57
+ " from .autonotebook import tqdm as notebook_tqdm\n"
58
+ ]
59
+ },
60
+ {
61
+ "name": "stdout",
62
+ "output_type": "stream",
63
+ "text": [
64
+ "Loading model: Qwen/Qwen2.5-32B-Instruct\n",
65
+ "Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
66
+ "This may take a while on first run (~65GB download)...\n",
67
+ "\n",
68
+ "Device: cuda\n",
69
+ "Loading tokenizer...\n",
70
+ "✅ Tokenizer loaded\n",
71
+ "Configuring 8-bit quantization...\n",
72
+ "Loading model with 8-bit quantization (this may take several minutes)...\n"
73
+ ]
74
+ },
75
+ {
76
+ "name": "stderr",
77
+ "output_type": "stream",
78
+ "text": [
79
+ "Loading checkpoint shards: 100%|██████████| 17/17 [03:49<00:00, 13.50s/it]\n"
80
+ ]
81
+ },
82
+ {
83
+ "name": "stdout",
84
+ "output_type": "stream",
85
+ "text": [
86
+ "✅ Model loaded with 8-bit quantization\n",
87
+ "VRAM used: 32.72 GB\n",
88
+ "\n",
89
+ "Loading raw input CSV...\n",
90
+ "✅ Loaded professions.csv\n",
91
+ "✅ Loaded profession mapping with 9 categories\n",
92
+ "Loaded 50861 rows\n",
93
+ "\n",
94
+ "Profession categories (9):\n",
95
+ " - actor\n",
96
+ " - adult performer\n",
97
+ " - singer/musician\n",
98
+ " - model\n",
99
+ " - online personality\n",
100
+ " - public figure\n",
101
+ " - voice actor/ASMR\n",
102
+ " - sports professional\n",
103
+ " - tv personality\n",
104
+ "\n",
105
+ "Creating prompts...\n",
106
+ "✅ Prompts created\n",
107
+ "Resuming from index 0\n"
108
+ ]
109
+ },
110
+ {
111
+ "name": "stderr",
112
+ "output_type": "stream",
113
+ "text": [
114
+ "Qwen Local: 0%| | 0/50861 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
115
+ "Qwen Local: 0%| | 10/50861 [01:35<115:51:08, 8.20s/it]"
116
+ ]
117
+ },
118
+ {
119
+ "name": "stdout",
120
+ "output_type": "stream",
121
+ "text": [
122
+ "💾 Progress saved after row 10\n"
123
+ ]
124
+ },
125
+ {
126
+ "name": "stderr",
127
+ "output_type": "stream",
128
+ "text": [
129
+ "Qwen Local: 0%| | 20/50861 [03:01<143:54:04, 10.19s/it]"
130
+ ]
131
+ },
132
+ {
133
+ "name": "stdout",
134
+ "output_type": "stream",
135
+ "text": [
136
+ "💾 Progress saved after row 20\n"
137
+ ]
138
+ },
139
+ {
140
+ "name": "stderr",
141
+ "output_type": "stream",
142
+ "text": [
143
+ "Qwen Local: 0%| | 24/50861 [03:52<151:28:05, 10.73s/it]"
144
+ ]
145
+ },
146
+ {
147
+ "name": "stdout",
148
+ "output_type": "stream",
149
+ "text": [
150
+ "❌ Row 23: parsed as all Unknown (likely model crash); skipping.\n"
151
+ ]
152
+ },
153
+ {
154
+ "name": "stderr",
155
+ "output_type": "stream",
156
+ "text": [
157
+ "Qwen Local: 0%| | 30/50861 [04:36<116:38:03, 8.26s/it]"
158
+ ]
159
+ },
160
+ {
161
+ "name": "stdout",
162
+ "output_type": "stream",
163
+ "text": [
164
+ "💾 Progress saved after row 30\n"
165
+ ]
166
+ },
167
+ {
168
+ "name": "stderr",
169
+ "output_type": "stream",
170
+ "text": [
171
+ "Qwen Local: 0%| | 40/50861 [05:55<142:16:40, 10.08s/it]"
172
+ ]
173
+ },
174
+ {
175
+ "name": "stdout",
176
+ "output_type": "stream",
177
+ "text": [
178
+ "💾 Progress saved after row 40\n"
179
+ ]
180
+ },
181
+ {
182
+ "name": "stderr",
183
+ "output_type": "stream",
184
+ "text": [
185
+ "Qwen Local: 0%| | 45/50861 [06:45<127:20:20, 9.02s/it]\n"
186
+ ]
187
+ },
188
+ {
189
+ "ename": "KeyboardInterrupt",
190
+ "evalue": "",
191
+ "output_type": "error",
192
+ "traceback": [
193
+ "\u001b[31m---------------------------------------------------------------------------\u001b[39m",
194
+ "\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)",
195
+ "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 330\u001b[39m\n\u001b[32m 328\u001b[39m response = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 329\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m attempt \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(\u001b[32m3\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m330\u001b[39m response = \u001b[43mquery_qwen_local\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprompt\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 332\u001b[39m \u001b[38;5;66;03m# Valid response?\u001b[39;00m\n\u001b[32m 333\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m response \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(response.strip()) > \u001b[32m10\u001b[39m:\n",
196
+ "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 245\u001b[39m, in \u001b[36mquery_qwen_local\u001b[39m\u001b[34m(prompt)\u001b[39m\n\u001b[32m 243\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m timeout(\u001b[32m60\u001b[39m):\n\u001b[32m 244\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m torch.no_grad():\n\u001b[32m--> \u001b[39m\u001b[32m245\u001b[39m outputs = \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgenerate\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 246\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 247\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_new_tokens\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m100\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 248\u001b[39m \u001b[43m \u001b[49m\u001b[43mtemperature\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m0.1\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 249\u001b[39m \u001b[43m \u001b[49m\u001b[43mdo_sample\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 250\u001b[39m \u001b[43m \u001b[49m\u001b[43mpad_token_id\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtokenizer\u001b[49m\u001b[43m.\u001b[49m\u001b[43meos_token_id\u001b[49m\n\u001b[32m 251\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 253\u001b[39m \u001b[38;5;66;03m# Decode\u001b[39;00m\n\u001b[32m 254\u001b[39m generated_ids = outputs[\u001b[32m0\u001b[39m][inputs[\u001b[33m'\u001b[39m\u001b[33minput_ids\u001b[39m\u001b[33m'\u001b[39m].shape[\u001b[32m1\u001b[39m]:]\n",
197
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py:120\u001b[39m, in \u001b[36mcontext_decorator.<locals>.decorate_context\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 117\u001b[39m \u001b[38;5;129m@functools\u001b[39m.wraps(func)\n\u001b[32m 118\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mdecorate_context\u001b[39m(*args, **kwargs):\n\u001b[32m 119\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m ctx_factory():\n\u001b[32m--> \u001b[39m\u001b[32m120\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
198
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:2564\u001b[39m, in \u001b[36mGenerationMixin.generate\u001b[39m\u001b[34m(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, use_model_defaults, custom_generate, **kwargs)\u001b[39m\n\u001b[32m 2561\u001b[39m model_kwargs[\u001b[33m\"\u001b[39m\u001b[33muse_cache\u001b[39m\u001b[33m\"\u001b[39m] = generation_config.use_cache\n\u001b[32m 2563\u001b[39m \u001b[38;5;66;03m# 9. Call generation mode\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m2564\u001b[39m result = \u001b[43mdecoding_method\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2565\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 2566\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2567\u001b[39m \u001b[43m \u001b[49m\u001b[43mlogits_processor\u001b[49m\u001b[43m=\u001b[49m\u001b[43mprepared_logits_processor\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2568\u001b[39m \u001b[43m \u001b[49m\u001b[43mstopping_criteria\u001b[49m\u001b[43m=\u001b[49m\u001b[43mprepared_stopping_criteria\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2569\u001b[39m \u001b[43m \u001b[49m\u001b[43mgeneration_config\u001b[49m\u001b[43m=\u001b[49m\u001b[43mgeneration_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2570\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mgeneration_mode_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2571\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mmodel_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2572\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2574\u001b[39m \u001b[38;5;66;03m# Convert to legacy cache format if requested\u001b[39;00m\n\u001b[32m 2575\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[32m 2576\u001b[39m generation_config.return_legacy_cache \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[32m 2577\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(result, \u001b[33m\"\u001b[39m\u001b[33mpast_key_values\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 2578\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(result.past_key_values, \u001b[33m\"\u001b[39m\u001b[33mto_legacy_cache\u001b[39m\u001b[33m\"\u001b[39m) \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 2579\u001b[39m ):\n",
199
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:2787\u001b[39m, in \u001b[36mGenerationMixin._sample\u001b[39m\u001b[34m(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)\u001b[39m\n\u001b[32m 2785\u001b[39m is_prefill = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m 2786\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2787\u001b[39m outputs = \u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mmodel_inputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[32m 2789\u001b[39m \u001b[38;5;66;03m# synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping\u001b[39;00m\n\u001b[32m 2790\u001b[39m model_kwargs = \u001b[38;5;28mself\u001b[39m._update_model_kwargs_for_generation(\n\u001b[32m 2791\u001b[39m outputs,\n\u001b[32m 2792\u001b[39m model_kwargs,\n\u001b[32m 2793\u001b[39m is_encoder_decoder=\u001b[38;5;28mself\u001b[39m.config.is_encoder_decoder,\n\u001b[32m 2794\u001b[39m )\n",
200
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
201
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
202
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/generic.py:918\u001b[39m, in \u001b[36mcan_return_tuple.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 916\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 917\u001b[39m return_dict = return_dict_passed\n\u001b[32m--> \u001b[39m\u001b[32m918\u001b[39m output = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 919\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 920\u001b[39m output = output.to_tuple()\n",
203
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:449\u001b[39m, in \u001b[36mQwen2ForCausalLM.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)\u001b[39m\n\u001b[32m 417\u001b[39m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[32m 418\u001b[39m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[32m 419\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\n\u001b[32m (...)\u001b[39m\u001b[32m 430\u001b[39m **kwargs: Unpack[TransformersKwargs],\n\u001b[32m 431\u001b[39m ) -> CausalLMOutputWithPast:\n\u001b[32m 432\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 433\u001b[39m \u001b[33;03m Example:\u001b[39;00m\n\u001b[32m 434\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 447\u001b[39m \u001b[33;03m \"Hey, are you conscious? Can you talk to me?\\nI'm not conscious, but I can talk to you.\"\u001b[39;00m\n\u001b[32m 448\u001b[39m \u001b[33;03m ```\"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m449\u001b[39m outputs: BaseModelOutputWithPast = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 450\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 451\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 452\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 453\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 454\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 455\u001b[39m \u001b[43m \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m=\u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 456\u001b[39m \u001b[43m \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 457\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 458\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 460\u001b[39m hidden_states = outputs.last_hidden_state\n\u001b[32m 461\u001b[39m \u001b[38;5;66;03m# Only compute necessary logits, and do not upcast them to float if we are not computing the loss\u001b[39;00m\n",
204
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
205
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
206
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/generic.py:1064\u001b[39m, in \u001b[36mcheck_model_inputs.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1061\u001b[39m monkey_patched_layers.append((module, original_forward))\n\u001b[32m 1063\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1064\u001b[39m outputs = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1065\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[32m 1066\u001b[39m \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[32m 1067\u001b[39m \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[32m 1068\u001b[39m \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[32m 1069\u001b[39m kwargs_without_recordable = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
207
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:384\u001b[39m, in \u001b[36mQwen2Model.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, cache_position, **kwargs)\u001b[39m\n\u001b[32m 381\u001b[39m position_embeddings = \u001b[38;5;28mself\u001b[39m.rotary_emb(hidden_states, position_ids)\n\u001b[32m 383\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m decoder_layer \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.layers[: \u001b[38;5;28mself\u001b[39m.config.num_hidden_layers]:\n\u001b[32m--> \u001b[39m\u001b[32m384\u001b[39m hidden_states = \u001b[43mdecoder_layer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 385\u001b[39m \u001b[43m \u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 386\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcausal_mask_mapping\u001b[49m\u001b[43m[\u001b[49m\u001b[43mdecoder_layer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mattention_type\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 387\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 388\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 389\u001b[39m \u001b[43m \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m=\u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 392\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 393\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 395\u001b[39m hidden_states = \u001b[38;5;28mself\u001b[39m.norm(hidden_states)\n\u001b[32m 396\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m BaseModelOutputWithPast(\n\u001b[32m 397\u001b[39m last_hidden_state=hidden_states,\n\u001b[32m 398\u001b[39m past_key_values=past_key_values \u001b[38;5;28;01mif\u001b[39;00m use_cache \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m 399\u001b[39m )\n",
208
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py:94\u001b[39m, in \u001b[36mGradientCheckpointingLayer.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 91\u001b[39m logger.warning_once(message)\n\u001b[32m 93\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._gradient_checkpointing_func(partial(\u001b[38;5;28msuper\u001b[39m().\u001b[34m__call__\u001b[39m, **kwargs), *args)\n\u001b[32m---> \u001b[39m\u001b[32m94\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[34;43m__call__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
209
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
210
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
211
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py:172\u001b[39m, in \u001b[36mdeprecate_kwarg.<locals>.wrapper.<locals>.wrapped_func\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 168\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m minimum_action \u001b[38;5;129;01min\u001b[39;00m (Action.NOTIFY, Action.NOTIFY_ALWAYS) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_torchdynamo_compiling():\n\u001b[32m 169\u001b[39m \u001b[38;5;66;03m# DeprecationWarning is ignored by default, so we use FutureWarning instead\u001b[39;00m\n\u001b[32m 170\u001b[39m warnings.warn(message, \u001b[38;5;167;01mFutureWarning\u001b[39;00m, stacklevel=\u001b[32m2\u001b[39m)\n\u001b[32m--> \u001b[39m\u001b[32m172\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
212
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:249\u001b[39m, in \u001b[36mQwen2DecoderLayer.forward\u001b[39m\u001b[34m(self, hidden_states, attention_mask, position_ids, past_key_values, use_cache, cache_position, position_embeddings, **kwargs)\u001b[39m\n\u001b[32m 247\u001b[39m residual = hidden_states\n\u001b[32m 248\u001b[39m hidden_states = \u001b[38;5;28mself\u001b[39m.post_attention_layernorm(hidden_states)\n\u001b[32m--> \u001b[39m\u001b[32m249\u001b[39m hidden_states = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmlp\u001b[49m\u001b[43m(\u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 250\u001b[39m hidden_states = residual + hidden_states\n\u001b[32m 251\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m hidden_states\n",
213
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
214
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
215
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:46\u001b[39m, in \u001b[36mQwen2MLP.forward\u001b[39m\u001b[34m(self, x)\u001b[39m\n\u001b[32m 45\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[32m---> \u001b[39m\u001b[32m46\u001b[39m down_proj = \u001b[38;5;28mself\u001b[39m.down_proj(\u001b[38;5;28mself\u001b[39m.act_fn(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mgate_proj\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m)\u001b[49m) * \u001b[38;5;28mself\u001b[39m.up_proj(x))\n\u001b[32m 47\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m down_proj\n",
216
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
217
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
218
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:1071\u001b[39m, in \u001b[36mLinear8bitLt.forward\u001b[39m\u001b[34m(self, x)\u001b[39m\n\u001b[32m 1068\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.bias \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.bias.dtype != x.dtype:\n\u001b[32m 1069\u001b[39m \u001b[38;5;28mself\u001b[39m.bias.data = \u001b[38;5;28mself\u001b[39m.bias.data.to(x.dtype)\n\u001b[32m-> \u001b[39m\u001b[32m1071\u001b[39m out = \u001b[43mbnb\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmatmul\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mweight\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbias\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbias\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1073\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m.state.has_fp16_weights \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.state.CB \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 1074\u001b[39m \u001b[38;5;28mself\u001b[39m.weight.data = \u001b[38;5;28mself\u001b[39m.state.CB\n",
219
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:424\u001b[39m, in \u001b[36mmatmul\u001b[39m\u001b[34m(A, B, out, state, threshold, bias)\u001b[39m\n\u001b[32m 422\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m A.device.type \u001b[38;5;129;01min\u001b[39;00m (\u001b[33m\"\u001b[39m\u001b[33mcpu\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mxpu\u001b[39m\u001b[33m\"\u001b[39m):\n\u001b[32m 423\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m MatMul8bitFp.apply(A, B, out, bias, state)\n\u001b[32m--> \u001b[39m\u001b[32m424\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mMatMul8bitLt\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mB\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mout\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbias\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m)\u001b[49m\n",
220
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/autograd/function.py:581\u001b[39m, in \u001b[36mFunction.apply\u001b[39m\u001b[34m(cls, *args, **kwargs)\u001b[39m\n\u001b[32m 578\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m torch._C._are_functorch_transforms_active():\n\u001b[32m 579\u001b[39m \u001b[38;5;66;03m# See NOTE: [functorch vjp and autograd interaction]\u001b[39;00m\n\u001b[32m 580\u001b[39m args = _functorch.utils.unwrap_dead_wrappers(args)\n\u001b[32m--> \u001b[39m\u001b[32m581\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 583\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_setup_ctx_defined:\n\u001b[32m 584\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[32m 585\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mIn order to use an autograd.Function with functorch transforms \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 586\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m(vmap, grad, jvp, jacrev, ...), it must override the setup_context \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 587\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mstaticmethod. For more details, please see \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 588\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mhttps://pytorch.org/docs/main/notes/extending.func.html\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 589\u001b[39m )\n",
221
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:192\u001b[39m, in \u001b[36mMatMul8bitLt.forward\u001b[39m\u001b[34m(ctx, A, B, out, bias, state)\u001b[39m\n\u001b[32m 189\u001b[39m CA, CAt, SCA, SCAt, outlier_cols = F.int8_double_quant(A.to(torch.float16), threshold=state.threshold)\n\u001b[32m 190\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 191\u001b[39m \u001b[38;5;66;03m# Fast path\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m192\u001b[39m CA, SCA, outlier_cols = \u001b[43mF\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint8_vectorwise_quant\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m.\u001b[49m\u001b[43mto\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfloat16\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m.\u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 193\u001b[39m CAt = SCAt = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 195\u001b[39m has_grad = \u001b[38;5;28;01mFalse\u001b[39;00m\n",
222
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/functional.py:2058\u001b[39m, in \u001b[36mint8_vectorwise_quant\u001b[39m\u001b[34m(A, threshold)\u001b[39m\n\u001b[32m 2040\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mint8_vectorwise_quant\u001b[39m(A: torch.Tensor, threshold=\u001b[32m0.0\u001b[39m):\n\u001b[32m 2041\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Quantizes a tensor with dtype `torch.float16` to `torch.int8` in accordance to the `LLM.int8()` algorithm.\u001b[39;00m\n\u001b[32m 2042\u001b[39m \n\u001b[32m 2043\u001b[39m \u001b[33;03m For more information, see the [LLM.int8() paper](https://arxiv.org/abs/2208.07339).\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 2056\u001b[39m \u001b[33;03m - `torch.Tensor` with dtype `torch.int32`, *optional*: A list of column indices which contain outlier features.\u001b[39;00m\n\u001b[32m 2057\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m2058\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mops\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbitsandbytes\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint8_vectorwise_quant\u001b[49m\u001b[43m.\u001b[49m\u001b[43mdefault\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m)\u001b[49m\n",
223
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_ops.py:841\u001b[39m, in \u001b[36mOpOverload.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 840\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__call__\u001b[39m(\u001b[38;5;28mself\u001b[39m, /, *args: _P.args, **kwargs: _P.kwargs) -> _T:\n\u001b[32m--> \u001b[39m\u001b[32m841\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_op\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
224
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_compile.py:53\u001b[39m, in \u001b[36m_disable_dynamo.<locals>.inner\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 50\u001b[39m disable_fn = torch._dynamo.disable(fn, recursive, wrapping=\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[32m 51\u001b[39m fn.__dynamo_disable = disable_fn \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m53\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mdisable_fn\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
225
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:1044\u001b[39m, in \u001b[36mDisableContext.__call__.<locals>._fn\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 1042\u001b[39m _maybe_set_eval_frame(_callback_from_stance(\u001b[38;5;28mself\u001b[39m.callback))\n\u001b[32m 1043\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1044\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfn\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1045\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m 1046\u001b[39m set_eval_frame(\u001b[38;5;28;01mNone\u001b[39;00m)\n",
226
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/library.py:732\u001b[39m, in \u001b[36m_impl.<locals>.register_.<locals>.func_no_dynamo\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 730\u001b[39m \u001b[38;5;129m@torch\u001b[39m._disable_dynamo\n\u001b[32m 731\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mfunc_no_dynamo\u001b[39m(*args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m732\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
227
+ "\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/backends/cuda/ops.py:148\u001b[39m, in \u001b[36m_\u001b[39m\u001b[34m(A, threshold)\u001b[39m\n\u001b[32m 145\u001b[39m outlier_cols = torch.argwhere(outliers.any(dim=\u001b[32m0\u001b[39m)).view(-\u001b[32m1\u001b[39m)\n\u001b[32m 146\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 147\u001b[39m \u001b[38;5;66;03m# Needed for torch.compile support.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m148\u001b[39m outlier_cols = \u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mempty\u001b[49m\u001b[43m(\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m=\u001b[49m\u001b[43mA\u001b[49m\u001b[43m.\u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint64\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 150\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m _cuda_device_of(A):\n\u001b[32m 151\u001b[39m lib.cint8_vector_quant(\n\u001b[32m 152\u001b[39m get_ptr(A),\n\u001b[32m 153\u001b[39m get_ptr(out_row),\n\u001b[32m (...)\u001b[39m\u001b[32m 158\u001b[39m _get_tensor_stream(A),\n\u001b[32m 159\u001b[39m )\n",
228
+ "\u001b[31mKeyboardInterrupt\u001b[39m: "
229
+ ]
230
+ }
231
+ ],
232
+ "source": [
233
+ "import pandas as pd\n",
234
+ "import json\n",
235
+ "import time\n",
236
+ "import re\n",
237
+ "from pathlib import Path\n",
238
+ "from tqdm import tqdm\n",
239
+ "import torch\n",
240
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
241
+ "import signal\n",
242
+ "from contextlib import contextmanager\n",
243
+ "\n",
244
+ "current_dir = Path.cwd()\n",
245
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
246
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
247
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
248
+ "# === PROCESS DATA ===\n",
249
+ "\n",
250
+ "\n",
251
+ "# === CONFIGURATION ===\n",
252
+ "TEST_MODE = False\n",
253
+ "TEST_SIZE = 100\n",
254
+ "MAX_ROWS = 50862\n",
255
+ "SAVE_INTERVAL = 10\n",
256
+ "\n",
257
+ "\n",
258
+ "index_file = current_dir.parent / \"misc/query_indicies/qwen_local_query_index.txt\"\n",
259
+ "output_file = current_dir.parent / f\"data/CSV/qwen_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
260
+ "\n",
261
+ "# Model settings\n",
262
+ "MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
263
+ "#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
264
+ "#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
265
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
266
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
267
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
268
+ "\n",
269
+ "# Define the SPECIFIC profession categories\n",
270
+ "PROFESSION_CATEGORIES = [\n",
271
+ " \"actor\",\n",
272
+ " \"adult performer\",\n",
273
+ " \"singer/musician\",\n",
274
+ " \"model\",\n",
275
+ " \"online personality\",\n",
276
+ " \"public figure\",\n",
277
+ " \"voice actor/ASMR\",\n",
278
+ " \"sports professional\",\n",
279
+ " \"tv personality\"\n",
280
+ "]\n",
281
+ "\n",
282
+ "# === LOAD MODEL ===\n",
283
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
284
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
285
+ "print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
286
+ "\n",
287
+ "# Check GPU availability\n",
288
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
289
+ "print(f\"Device: {device}\")\n",
290
+ "\n",
291
+ "if device == \"cpu\":\n",
292
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
293
+ " print(\" Consider using a GPU or reducing model size.\")\n",
294
+ "\n",
295
+ "# Load tokenizer\n",
296
+ "print(\"Loading tokenizer...\")\n",
297
+ "try:\n",
298
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
299
+ " MODEL_NAME,\n",
300
+ " cache_dir=str(CACHE_DIR),\n",
301
+ " use_fast=True\n",
302
+ " )\n",
303
+ "except Exception as e:\n",
304
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
305
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
306
+ " MODEL_NAME,\n",
307
+ " cache_dir=str(CACHE_DIR),\n",
308
+ " use_fast=False\n",
309
+ " )\n",
310
+ "\n",
311
+ "# Ensure pad token is set\n",
312
+ "if tokenizer.pad_token is None:\n",
313
+ " tokenizer.pad_token = tokenizer.eos_token\n",
314
+ "\n",
315
+ "print(\"✅ Tokenizer loaded\")\n",
316
+ "\n",
317
+ "# Configure 8-bit quantization for A100\n",
318
+ "print(\"Configuring 8-bit quantization...\")\n",
319
+ "quantization_config = BitsAndBytesConfig(\n",
320
+ " load_in_8bit=True,\n",
321
+ " llm_int8_threshold=6.0,\n",
322
+ " llm_int8_has_fp16_weight=False\n",
323
+ ")\n",
324
+ "\n",
325
+ "# Load model with 8-bit quantization\n",
326
+ "print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
327
+ "model = AutoModelForCausalLM.from_pretrained(\n",
328
+ " MODEL_NAME,\n",
329
+ " cache_dir=str(CACHE_DIR),\n",
330
+ " quantization_config=quantization_config,\n",
331
+ " device_map=\"auto\",\n",
332
+ " trust_remote_code=False\n",
333
+ ")\n",
334
+ "model.eval()\n",
335
+ "print(\"✅ Model loaded with 8-bit quantization\")\n",
336
+ "\n",
337
+ "# Check VRAM usage\n",
338
+ "if torch.cuda.is_available():\n",
339
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
340
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
341
+ "\n",
342
+ "# === LOAD DATA ===\n",
343
+ "if output_file.exists():\n",
344
+ " print(\"Loading annotated CSV...\")\n",
345
+ " df = pd.read_csv(output_file)\n",
346
+ "else:\n",
347
+ " print(\"Loading raw input CSV...\")\n",
348
+ " df = pd.read_csv(input_file)\n",
349
+ "\n",
350
+ "\n",
351
+ "# Try to load profession mapping files\n",
352
+ "try:\n",
353
+ " professions_df = pd.read_csv(professions_file)\n",
354
+ " print(f\"✅ Loaded professions.csv\")\n",
355
+ "except:\n",
356
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
357
+ "\n",
358
+ "try:\n",
359
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
360
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
361
+ "except:\n",
362
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
363
+ "\n",
364
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
365
+ "\n",
366
+ "print(f\"Loaded {len(df)} rows\")\n",
367
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
368
+ "for cat in PROFESSION_CATEGORIES:\n",
369
+ " print(f\" - {cat}\")\n",
370
+ "\n",
371
+ "if TEST_MODE:\n",
372
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
373
+ " df = df.head(TEST_SIZE).copy()\n",
374
+ "elif MAX_ROWS:\n",
375
+ " df = df.head(MAX_ROWS).copy()\n",
376
+ "\n",
377
+ "# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
378
+ "def create_prompt(row):\n",
379
+ " \"\"\"Create prompt for Qwen annotation with strict formatting requirements.\"\"\"\n",
380
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
381
+ " \n",
382
+ " # Gather hints\n",
383
+ " hints = []\n",
384
+ " if pd.notna(row.get('likely_profession')):\n",
385
+ " hints.append(str(row['likely_profession']))\n",
386
+ " if pd.notna(row.get('likely_nationality')):\n",
387
+ " hints.append(str(row['likely_nationality']))\n",
388
+ " if pd.notna(row.get('likely_country')):\n",
389
+ " hints.append(str(row['likely_country']))\n",
390
+ " \n",
391
+ " # Add tags if we don't have enough hints\n",
392
+ " if len(hints) < 3:\n",
393
+ " for i in range(1, 8):\n",
394
+ " tag_col = f'tag_{i}'\n",
395
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
396
+ " tag_val = str(row[tag_col])\n",
397
+ " if tag_val not in hints:\n",
398
+ " hints.append(tag_val)\n",
399
+ " if len(hints) >= 5:\n",
400
+ " break\n",
401
+ " \n",
402
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
403
+ " \n",
404
+ " return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
405
+ "\n",
406
+ "Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
407
+ "\n",
408
+ "FORMAT REQUIREMENTS:\n",
409
+ "1. Full legal name in Western order (first last). VALUE ONLY.\n",
410
+ "2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
411
+ "3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
412
+ "4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
413
+ "5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
414
+ "\n",
415
+ "RULES:\n",
416
+ "- Professions MUST match the exact categories listed (actress = actor)\n",
417
+ "- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
418
+ "- \"public figure\" includes politicians, activists, journalists, authors\n",
419
+ "- Use \"Unknown\" when uncertain or for fictional characters\n",
420
+ "- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
421
+ "- For multi-role people, list up to 3 categories by relevance\n",
422
+ "\n",
423
+ "EXAMPLE FORMAT:\n",
424
+ "1. Taylor Swift\n",
425
+ "2. None\n",
426
+ "3. Female\n",
427
+ "4. singer/musician, public figure\n",
428
+ "5. United States\"\"\"\n",
429
+ "\n",
430
+ "# Create prompts\n",
431
+ "print(\"\\nCreating prompts...\")\n",
432
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
433
+ "print(\"✅ Prompts created\")\n",
434
+ "\n",
435
+ "@contextmanager\n",
436
+ "def timeout(duration):\n",
437
+ " def handler(signum, frame):\n",
438
+ " raise TimeoutError(\"Operation timed out\")\n",
439
+ " \n",
440
+ " # Set the signal handler and alarm\n",
441
+ " signal.signal(signal.SIGALRM, handler)\n",
442
+ " signal.alarm(duration)\n",
443
+ " try:\n",
444
+ " yield\n",
445
+ " finally:\n",
446
+ " signal.alarm(0) # Disable the alarm\n",
447
+ "\n",
448
+ "\n",
449
+ "def query_qwen_local(prompt: str) -> str:\n",
450
+ " \"\"\"Query Qwen locally via transformers.\"\"\"\n",
451
+ " try:\n",
452
+ " # Format as chat message for Qwen with strict system prompt\n",
453
+ " messages = [\n",
454
+ " {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
455
+ " {\"role\": \"user\", \"content\": prompt}\n",
456
+ " ]\n",
457
+ " \n",
458
+ " # Tokenize\n",
459
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
460
+ " text = tokenizer.apply_chat_template(\n",
461
+ " messages,\n",
462
+ " tokenize=False,\n",
463
+ " add_generation_prompt=True\n",
464
+ " )\n",
465
+ " else:\n",
466
+ " # Fallback for older tokenizers\n",
467
+ " text = f\"[INST] {prompt} [/INST]\"\n",
468
+ " \n",
469
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
470
+ " \n",
471
+ " # Generate with timeout\n",
472
+ " with timeout(60):\n",
473
+ " with torch.no_grad():\n",
474
+ " outputs = model.generate(\n",
475
+ " **inputs,\n",
476
+ " max_new_tokens=100,\n",
477
+ " temperature=0.1,\n",
478
+ " do_sample=False,\n",
479
+ " pad_token_id=tokenizer.eos_token_id\n",
480
+ " )\n",
481
+ " \n",
482
+ " # Decode\n",
483
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
484
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
485
+ " \n",
486
+ " return response.strip()\n",
487
+ " \n",
488
+ " except TimeoutError:\n",
489
+ " print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
490
+ " return None\n",
491
+ " except Exception as e:\n",
492
+ " print(f\"Generation error: {e}\")\n",
493
+ " import traceback\n",
494
+ " traceback.print_exc()\n",
495
+ " return None\n",
496
+ "\n",
497
+ " \n",
498
+ "# === PARSE RESPONSE WITH CLEANING ===\n",
499
+ "def parse_response(response):\n",
500
+ " \"\"\"Parse Qwen response into structured fields with cleaning.\"\"\"\n",
501
+ " if not response:\n",
502
+ " return {\n",
503
+ " 'full_name': 'Unknown',\n",
504
+ " 'aliases': 'Unknown',\n",
505
+ " 'gender': 'Unknown',\n",
506
+ " 'profession_llm': 'Unknown',\n",
507
+ " 'country': 'Unknown'\n",
508
+ " }\n",
509
+ " \n",
510
+ " # Split into lines and clean\n",
511
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
512
+ " \n",
513
+ " # Initialize with Unknown values\n",
514
+ " fields = {\n",
515
+ " 'full_name': 'Unknown',\n",
516
+ " 'aliases': 'Unknown',\n",
517
+ " 'gender': 'Unknown',\n",
518
+ " 'profession_llm': 'Unknown',\n",
519
+ " 'country': 'Unknown'\n",
520
+ " }\n",
521
+ " \n",
522
+ " # Extract information from each numbered line\n",
523
+ " for line in lines:\n",
524
+ " if line.startswith('1.'):\n",
525
+ " fields['full_name'] = line[2:].strip()\n",
526
+ " elif line.startswith('2.'):\n",
527
+ " fields['aliases'] = line[2:].strip()\n",
528
+ " elif line.startswith('3.'):\n",
529
+ " # Clean gender field - remove any labels\n",
530
+ " gender_raw = line[2:].strip()\n",
531
+ " # Remove common prefixes\n",
532
+ " gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
533
+ " # Extract just the gender word\n",
534
+ " gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
535
+ " fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
536
+ " elif line.startswith('4.'):\n",
537
+ " fields['profession_llm'] = line[2:].strip()\n",
538
+ " elif line.startswith('5.'):\n",
539
+ " # Clean country field - remove any labels\n",
540
+ " country_raw = line[2:].strip()\n",
541
+ " # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
542
+ " country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
543
+ " fields['country'] = country_raw\n",
544
+ " \n",
545
+ " return fields\n",
546
+ "\n",
547
+ "# === PROCESS DATA ===\n",
548
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
549
+ "\n",
550
+ "# Load index\n",
551
+ "current_index = 0\n",
552
+ "if index_file.exists():\n",
553
+ " try:\n",
554
+ " current_index = int(index_file.read_text().strip())\n",
555
+ " except:\n",
556
+ " current_index = 0\n",
557
+ "\n",
558
+ "print(f\"Resuming from index {current_index}\")\n",
559
+ "\n",
560
+ "start_time = time.time()\n",
561
+ "\n",
562
+ "for i in tqdm(range(current_index, len(df)), desc=\"Qwen Local\"):\n",
563
+ "\n",
564
+ " prompt = df.at[i, \"prompt\"]\n",
565
+ "\n",
566
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
567
+ " response = None\n",
568
+ " for attempt in range(3):\n",
569
+ " response = query_qwen_local(prompt)\n",
570
+ " \n",
571
+ " # Valid response?\n",
572
+ " if response and len(response.strip()) > 10:\n",
573
+ " break\n",
574
+ " \n",
575
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
576
+ " time.sleep(0.5)\n",
577
+ "\n",
578
+ " # If still invalid → DO NOT overwrite previous data\n",
579
+ " if not response or len(response.strip()) <= 10:\n",
580
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
581
+ " continue\n",
582
+ "\n",
583
+ " parsed = parse_response(response)\n",
584
+ "\n",
585
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
586
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
587
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
588
+ " continue\n",
589
+ "\n",
590
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
591
+ " for key, value in parsed.items():\n",
592
+ " df.at[i, key] = value\n",
593
+ "\n",
594
+ " # Advance progress ONLY after successful write\n",
595
+ " current_index = i + 1\n",
596
+ "\n",
597
+ " # -------- GPU MEMORY CLEANUP --------\n",
598
+ " if torch.cuda.is_available():\n",
599
+ " torch.cuda.empty_cache()\n",
600
+ " torch.cuda.synchronize()\n",
601
+ "\n",
602
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
603
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
604
+ " df.to_csv(output_file, index=False)\n",
605
+ " with open(index_file, \"w\") as f:\n",
606
+ " f.write(str(current_index))\n",
607
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
608
+ "\n",
609
+ "# Final save\n",
610
+ "df.to_csv(output_file, index=False)\n",
611
+ "index_file.write_text(str(current_index))\n",
612
+ "print(\"✅ Finished full dataset.\")"
613
+ ]
614
+ },
615
+ {
616
+ "cell_type": "code",
617
+ "execution_count": null,
618
+ "id": "d9c7deb9-847a-472d-8055-f93dbfa6aa2e",
619
+ "metadata": {},
620
+ "outputs": [],
621
+ "source": []
622
+ }
623
+ ],
624
+ "metadata": {
625
+ "kernelspec": {
626
+ "display_name": "pm-paper",
627
+ "language": "python",
628
+ "name": "pm-paper"
629
+ },
630
+ "language_info": {
631
+ "codemirror_mode": {
632
+ "name": "ipython",
633
+ "version": 3
634
+ },
635
+ "file_extension": ".py",
636
+ "mimetype": "text/x-python",
637
+ "name": "python",
638
+ "nbconvert_exporter": "python",
639
+ "pygments_lexer": "ipython3",
640
+ "version": "3.11.13"
641
+ }
642
+ },
643
+ "nbformat": 4,
644
+ "nbformat_minor": 5
645
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images-checkpoint.ipynb ADDED
@@ -0,0 +1,1795 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "d8161f06-4fd2-436d-9e59-3f68b5a67f2c",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-06T18:30:16.974712Z",
9
+ "iopub.status.busy": "2025-02-06T18:30:16.974296Z",
10
+ "iopub.status.idle": "2025-02-06T18:30:16.976909Z",
11
+ "shell.execute_reply": "2025-02-06T18:30:16.976526Z",
12
+ "shell.execute_reply.started": "2025-02-06T18:30:16.974692Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# Section 6.2: Age and Gender Estimation using MiVOLO"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "64aaedec-ef56-4a62-b61e-12de2675a1ae",
22
+ "metadata": {
23
+ "execution": {
24
+ "iopub.execute_input": "2025-02-06T19:52:51.171282Z",
25
+ "iopub.status.busy": "2025-02-06T19:52:51.170711Z",
26
+ "iopub.status.idle": "2025-02-06T19:52:55.405039Z",
27
+ "shell.execute_reply": "2025-02-06T19:52:55.404308Z",
28
+ "shell.execute_reply.started": "2025-02-06T19:52:51.171245Z"
29
+ }
30
+ },
31
+ "source": [
32
+ "![Alt text](../plots/mivolo.svg)"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": 1,
38
+ "id": "4293f307-44fd-455e-90fe-6e6928be9af5",
39
+ "metadata": {
40
+ "execution": {
41
+ "iopub.execute_input": "2025-02-08T21:59:21.970807Z",
42
+ "iopub.status.busy": "2025-02-08T21:59:21.969931Z",
43
+ "iopub.status.idle": "2025-02-08T22:00:09.724295Z",
44
+ "shell.execute_reply": "2025-02-08T22:00:09.723583Z",
45
+ "shell.execute_reply.started": "2025-02-08T21:59:21.970784Z"
46
+ }
47
+ },
48
+ "outputs": [
49
+ {
50
+ "name": "stderr",
51
+ "output_type": "stream",
52
+ "text": [
53
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
54
+ " from .autonotebook import tqdm as notebook_tqdm\n"
55
+ ]
56
+ }
57
+ ],
58
+ "source": [
59
+ "import csv\n",
60
+ "from pathlib import Path\n",
61
+ "import logging\n",
62
+ "import os\n",
63
+ "import pandas as pd\n",
64
+ "import requests\n",
65
+ "import numpy as np\n",
66
+ "import torch\n",
67
+ "import cv2\n",
68
+ "from io import BytesIO\n",
69
+ "from PIL import Image, UnidentifiedImageError\n",
70
+ "from datetime import datetime, timedelta\n",
71
+ "from dateutil.relativedelta import relativedelta\n",
72
+ "from mivolo.predictor import Predictor\n",
73
+ "import matplotlib.pyplot as plt\n",
74
+ "import matplotlib.patches as mpatches"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": 2,
80
+ "id": "63a54f5f-900c-48dd-8932-2632e56c5670",
81
+ "metadata": {
82
+ "execution": {
83
+ "iopub.execute_input": "2025-02-08T22:00:09.726069Z",
84
+ "iopub.status.busy": "2025-02-08T22:00:09.725699Z",
85
+ "iopub.status.idle": "2025-02-08T22:00:09.730626Z",
86
+ "shell.execute_reply": "2025-02-08T22:00:09.730099Z",
87
+ "shell.execute_reply.started": "2025-02-08T22:00:09.726050Z"
88
+ }
89
+ },
90
+ "outputs": [],
91
+ "source": [
92
+ "current_dir = Path.cwd()\n",
93
+ "mini = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
94
+ "mivolo_in = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/'\n",
95
+ "(current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/').mkdir(parents=True, exist_ok=True)"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "code",
100
+ "execution_count": 3,
101
+ "id": "1fdfb89a-6094-4382-8755-fae213221ea5",
102
+ "metadata": {
103
+ "execution": {
104
+ "iopub.execute_input": "2025-02-08T22:00:09.731540Z",
105
+ "iopub.status.busy": "2025-02-08T22:00:09.731359Z",
106
+ "iopub.status.idle": "2025-02-08T22:00:09.825738Z",
107
+ "shell.execute_reply": "2025-02-08T22:00:09.825258Z",
108
+ "shell.execute_reply.started": "2025-02-08T22:00:09.731524Z"
109
+ }
110
+ },
111
+ "outputs": [],
112
+ "source": [
113
+ "def split_by_month(input_path, output_dir):\n",
114
+ " # Load the dataset\n",
115
+ " df = pd.read_csv(input_path)\n",
116
+ " \n",
117
+ " # Convert the 'createdAt' column to datetime\n",
118
+ " df['createdAt'] = pd.to_datetime(df['createdAt'], errors='coerce')\n",
119
+ " \n",
120
+ " # Extract year and month\n",
121
+ " df['year_month'] = df['createdAt'].dt.to_period('M')\n",
122
+ " \n",
123
+ " # Group the data by year and month and save each group as a CSV file\n",
124
+ " unique_months = df['year_month'].unique()\n",
125
+ "\n",
126
+ " for month in unique_months:\n",
127
+ " # Filter data for the specific month\n",
128
+ " df_month = df[df['year_month'] == month]\n",
129
+ " \n",
130
+ " # Define the file name based on the year and month\n",
131
+ " file_name = f'{output_dir}/Civiverse-{month}.csv'\n",
132
+ " \n",
133
+ " # Save the file\n",
134
+ " df_month.to_csv(file_name, index=False)\n",
135
+ "\n",
136
+ " print(f\"Data has been split and saved to {output_dir}\")"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": 4,
142
+ "id": "2c909d7c-7d16-4dc7-8364-7f1c0784414c",
143
+ "metadata": {
144
+ "execution": {
145
+ "iopub.execute_input": "2025-02-08T22:00:09.827095Z",
146
+ "iopub.status.busy": "2025-02-08T22:00:09.826919Z",
147
+ "iopub.status.idle": "2025-02-08T22:00:10.479484Z",
148
+ "shell.execute_reply": "2025-02-08T22:00:10.478777Z",
149
+ "shell.execute_reply.started": "2025-02-08T22:00:09.827079Z"
150
+ }
151
+ },
152
+ "outputs": [
153
+ {
154
+ "name": "stderr",
155
+ "output_type": "stream",
156
+ "text": [
157
+ "/sctmp/lauwag/ipykernel_1497673/1825509207.py:9: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.\n",
158
+ " df['year_month'] = df['createdAt'].dt.to_period('M')\n"
159
+ ]
160
+ },
161
+ {
162
+ "name": "stdout",
163
+ "output_type": "stream",
164
+ "text": [
165
+ "Data has been split and saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month\n"
166
+ ]
167
+ }
168
+ ],
169
+ "source": [
170
+ "split_by_month(mini, mivolo_in)"
171
+ ]
172
+ },
173
+ {
174
+ "cell_type": "code",
175
+ "execution_count": 5,
176
+ "id": "4c543306-ffc8-4b9c-a3df-b49b2271caa9",
177
+ "metadata": {
178
+ "execution": {
179
+ "iopub.execute_input": "2025-02-08T22:00:10.480505Z",
180
+ "iopub.status.busy": "2025-02-08T22:00:10.480310Z",
181
+ "iopub.status.idle": "2025-02-08T22:00:10.483961Z",
182
+ "shell.execute_reply": "2025-02-08T22:00:10.483400Z",
183
+ "shell.execute_reply.started": "2025-02-08T22:00:10.480486Z"
184
+ }
185
+ },
186
+ "outputs": [],
187
+ "source": [
188
+ "mivolo_out = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
189
+ "mivolo_out.mkdir(parents=True, exist_ok=True) # Create the output directory if it doesn't exist"
190
+ ]
191
+ },
192
+ {
193
+ "cell_type": "markdown",
194
+ "id": "ffb7dd23",
195
+ "metadata": {},
196
+ "source": [
197
+ "## MiVOLO gender and age inference"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": null,
203
+ "id": "304ed12f-c7b6-4129-b24d-7ccc793a62c7",
204
+ "metadata": {
205
+ "execution": {
206
+ "iopub.execute_input": "2025-02-08T22:00:10.484802Z",
207
+ "iopub.status.busy": "2025-02-08T22:00:10.484639Z"
208
+ }
209
+ },
210
+ "outputs": [
211
+ {
212
+ "name": "stderr",
213
+ "output_type": "stream",
214
+ "text": [
215
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/ultralytics/nn/tasks.py:634: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
216
+ " return torch.load(file, map_location=\"cpu\"), file # load\n"
217
+ ]
218
+ },
219
+ {
220
+ "name": "stdout",
221
+ "output_type": "stream",
222
+ "text": [
223
+ "Model summary (fused): 268 layers, 68125494 parameters, 0 gradients, 257.4 GFLOPs\n"
224
+ ]
225
+ },
226
+ {
227
+ "name": "stderr",
228
+ "output_type": "stream",
229
+ "text": [
230
+ "[W208 23:00:15.738708520 NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.\n",
231
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/mivolo/model/mi_volo.py:33: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
232
+ " state = torch.load(ckpt_path, map_location=\"cpu\")\n",
233
+ "INFO:MiVOLO:Model meta:\n",
234
+ "min_age: 1, max_age: 95, avg_age: 48.0, num_classes: 3, in_chans: 6, with_persons_model: True, disable_faces: False, use_persons: True, only_age: False, num_classes_gender: 2, input_size: 224, use_person_crops: True, use_face_crops: True\n",
235
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/timm/models/_helpers.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
236
+ " checkpoint = torch.load(checkpoint_path, map_location='cpu')\n",
237
+ "INFO:timm.models._helpers:Loaded state_dict from checkpoint '/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
238
+ "INFO:MiVOLO:Model mivolo_d1_224 created, param count: 27432414\n",
239
+ "INFO:timm.data.config:Data processing configuration for current model + dataset:\n",
240
+ "INFO:timm.data.config:\tinput_size: (3, 224, 224)\n",
241
+ "INFO:timm.data.config:\tinterpolation: bicubic\n",
242
+ "INFO:timm.data.config:\tmean: (0.485, 0.456, 0.406)\n",
243
+ "INFO:timm.data.config:\tstd: (0.229, 0.224, 0.225)\n",
244
+ "INFO:timm.data.config:\tcrop_pct: 0.96\n",
245
+ "INFO:timm.data.config:\tcrop_mode: center\n"
246
+ ]
247
+ },
248
+ {
249
+ "name": "stdout",
250
+ "output_type": "stream",
251
+ "text": [
252
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-11.csv\n",
253
+ "\n",
254
+ "0: 640x640 (no detections), 723.9ms\n",
255
+ "Speed: 12.1ms preprocess, 723.9ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n",
256
+ "Processed and saved 1 images so far.\n",
257
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2022-11.csv\n",
258
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-12.csv\n",
259
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-01.csv\n",
260
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-02.csv\n",
261
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-03.csv\n",
262
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-04.csv\n",
263
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-05.csv\n",
264
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-06.csv\n",
265
+ "\n",
266
+ "0: 416x640 1 person, 455.1ms\n",
267
+ "Speed: 3.5ms preprocess, 455.1ms inference, 33.5ms postprocess per image at shape (1, 3, 416, 640)\n"
268
+ ]
269
+ },
270
+ {
271
+ "name": "stderr",
272
+ "output_type": "stream",
273
+ "text": [
274
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
275
+ "INFO:MiVOLO:\tage: 32.89\n",
276
+ "INFO:MiVOLO:\tgender: male [99%]\n"
277
+ ]
278
+ },
279
+ {
280
+ "name": "stdout",
281
+ "output_type": "stream",
282
+ "text": [
283
+ "Processed and saved 1 images so far.\n",
284
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-06.csv\n",
285
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-07.csv\n",
286
+ "\n",
287
+ "0: 640x320 1 person, 395.7ms\n",
288
+ "Speed: 2.9ms preprocess, 395.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
289
+ ]
290
+ },
291
+ {
292
+ "name": "stderr",
293
+ "output_type": "stream",
294
+ "text": [
295
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
296
+ "INFO:MiVOLO:\tage: 33.49\n",
297
+ "INFO:MiVOLO:\tgender: female [99%]\n"
298
+ ]
299
+ },
300
+ {
301
+ "name": "stdout",
302
+ "output_type": "stream",
303
+ "text": [
304
+ "Processed and saved 1 images so far.\n",
305
+ "\n",
306
+ "0: 640x448 1 person, 1 face, 478.5ms\n",
307
+ "Speed: 1.9ms preprocess, 478.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
308
+ ]
309
+ },
310
+ {
311
+ "name": "stderr",
312
+ "output_type": "stream",
313
+ "text": [
314
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
315
+ "INFO:MiVOLO:\tage: 17.81\n",
316
+ "INFO:MiVOLO:\tgender: female [99%]\n"
317
+ ]
318
+ },
319
+ {
320
+ "name": "stdout",
321
+ "output_type": "stream",
322
+ "text": [
323
+ "Processed and saved 2 images so far.\n",
324
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-07.csv\n",
325
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-08.csv\n",
326
+ "\n",
327
+ "0: 640x448 1 person, 478.0ms\n",
328
+ "Speed: 2.9ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
329
+ ]
330
+ },
331
+ {
332
+ "name": "stderr",
333
+ "output_type": "stream",
334
+ "text": [
335
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
336
+ "INFO:MiVOLO:\tage: 40.62\n",
337
+ "INFO:MiVOLO:\tgender: male [99%]\n"
338
+ ]
339
+ },
340
+ {
341
+ "name": "stdout",
342
+ "output_type": "stream",
343
+ "text": [
344
+ "Processed and saved 1 images so far.\n",
345
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-08.csv\n",
346
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-09.csv\n",
347
+ "\n",
348
+ "0: 416x640 (no detections), 567.1ms\n",
349
+ "Speed: 2.4ms preprocess, 567.1ms inference, 0.4ms postprocess per image at shape (1, 3, 416, 640)\n",
350
+ "Processed and saved 1 images so far.\n",
351
+ "\n",
352
+ "0: 320x640 (no detections), 393.6ms\n",
353
+ "Speed: 1.7ms preprocess, 393.6ms inference, 0.4ms postprocess per image at shape (1, 3, 320, 640)\n",
354
+ "\n",
355
+ "0: 640x640 (no detections), 711.9ms\n",
356
+ "Speed: 3.4ms preprocess, 711.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
357
+ "\n",
358
+ "0: 640x640 (no detections), 699.8ms\n",
359
+ "Speed: 2.3ms preprocess, 699.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
360
+ "\n",
361
+ "0: 640x576 1 person, 629.6ms\n",
362
+ "Speed: 2.4ms preprocess, 629.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 576)\n"
363
+ ]
364
+ },
365
+ {
366
+ "name": "stderr",
367
+ "output_type": "stream",
368
+ "text": [
369
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
370
+ "INFO:MiVOLO:\tage: 28.65\n",
371
+ "INFO:MiVOLO:\tgender: female [99%]\n"
372
+ ]
373
+ },
374
+ {
375
+ "name": "stdout",
376
+ "output_type": "stream",
377
+ "text": [
378
+ "\n",
379
+ "0: 640x448 1 person, 1 face, 598.3ms\n",
380
+ "Speed: 2.1ms preprocess, 598.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
381
+ ]
382
+ },
383
+ {
384
+ "name": "stderr",
385
+ "output_type": "stream",
386
+ "text": [
387
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
388
+ "INFO:MiVOLO:\tage: 25.85\n",
389
+ "INFO:MiVOLO:\tgender: female [99%]\n",
390
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/98848c97-3d1e-4b52-9967-aeeca354a30e/width=656/98848c97-3d1e-4b52-9967-aeeca354a30e.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00133740>\n"
391
+ ]
392
+ },
393
+ {
394
+ "name": "stdout",
395
+ "output_type": "stream",
396
+ "text": [
397
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-09.csv\n",
398
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-10.csv\n"
399
+ ]
400
+ },
401
+ {
402
+ "name": "stderr",
403
+ "output_type": "stream",
404
+ "text": [
405
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/e6469288-b487-4a06-99c1-59e7ac22fa77/width=1024/e6469288-b487-4a06-99c1-59e7ac22fa77.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ecbdd0>\n"
406
+ ]
407
+ },
408
+ {
409
+ "name": "stdout",
410
+ "output_type": "stream",
411
+ "text": [
412
+ "\n",
413
+ "0: 448x640 (no detections), 536.6ms\n",
414
+ "Speed: 10.1ms preprocess, 536.6ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
415
+ "Processed and saved 2 images so far.\n",
416
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-10.csv\n",
417
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-11.csv\n",
418
+ "\n",
419
+ "0: 640x448 1 person, 1 face, 662.9ms\n",
420
+ "Speed: 2.6ms preprocess, 662.9ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
421
+ ]
422
+ },
423
+ {
424
+ "name": "stderr",
425
+ "output_type": "stream",
426
+ "text": [
427
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
428
+ "INFO:MiVOLO:\tage: 17.0\n",
429
+ "INFO:MiVOLO:\tgender: female [99%]\n"
430
+ ]
431
+ },
432
+ {
433
+ "name": "stdout",
434
+ "output_type": "stream",
435
+ "text": [
436
+ "Processed and saved 1 images so far.\n",
437
+ "\n",
438
+ "0: 640x384 1 person, 895.9ms\n",
439
+ "Speed: 2.0ms preprocess, 895.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
440
+ ]
441
+ },
442
+ {
443
+ "name": "stderr",
444
+ "output_type": "stream",
445
+ "text": [
446
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
447
+ "INFO:MiVOLO:\tage: 43.33\n",
448
+ "INFO:MiVOLO:\tgender: male [99%]\n"
449
+ ]
450
+ },
451
+ {
452
+ "name": "stdout",
453
+ "output_type": "stream",
454
+ "text": [
455
+ "\n",
456
+ "0: 640x448 (no detections), 529.4ms\n",
457
+ "Speed: 2.6ms preprocess, 529.4ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
458
+ "\n",
459
+ "0: 640x448 1 person, 539.3ms\n",
460
+ "Speed: 2.8ms preprocess, 539.3ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
461
+ ]
462
+ },
463
+ {
464
+ "name": "stderr",
465
+ "output_type": "stream",
466
+ "text": [
467
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
468
+ "INFO:MiVOLO:\tage: 39.15\n",
469
+ "INFO:MiVOLO:\tgender: male [99%]\n"
470
+ ]
471
+ },
472
+ {
473
+ "name": "stdout",
474
+ "output_type": "stream",
475
+ "text": [
476
+ "\n",
477
+ "0: 640x448 1 person, 1 face, 708.6ms\n",
478
+ "Speed: 2.5ms preprocess, 708.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
479
+ ]
480
+ },
481
+ {
482
+ "name": "stderr",
483
+ "output_type": "stream",
484
+ "text": [
485
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
486
+ "INFO:MiVOLO:\tage: 29.64\n",
487
+ "INFO:MiVOLO:\tgender: female [99%]\n",
488
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc/width=1080/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc.mp4: cannot identify image file <_io.BytesIO object at 0x14cb010c24d0>\n"
489
+ ]
490
+ },
491
+ {
492
+ "name": "stdout",
493
+ "output_type": "stream",
494
+ "text": [
495
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-11.csv\n",
496
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-12.csv\n",
497
+ "\n",
498
+ "0: 640x384 1 person, 1 face, 461.0ms\n",
499
+ "Speed: 2.4ms preprocess, 461.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
500
+ ]
501
+ },
502
+ {
503
+ "name": "stderr",
504
+ "output_type": "stream",
505
+ "text": [
506
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
507
+ "INFO:MiVOLO:\tage: 19.61\n",
508
+ "INFO:MiVOLO:\tgender: female [99%]\n"
509
+ ]
510
+ },
511
+ {
512
+ "name": "stdout",
513
+ "output_type": "stream",
514
+ "text": [
515
+ "Processed and saved 1 images so far.\n",
516
+ "\n",
517
+ "0: 640x448 1 person, 1 face, 501.3ms\n",
518
+ "Speed: 3.1ms preprocess, 501.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
519
+ ]
520
+ },
521
+ {
522
+ "name": "stderr",
523
+ "output_type": "stream",
524
+ "text": [
525
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
526
+ "INFO:MiVOLO:\tage: 22.58\n",
527
+ "INFO:MiVOLO:\tgender: female [99%]\n",
528
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3004b5fa-af81-4de7-829d-1d809d70b878/width=512/3004b5fa-af81-4de7-829d-1d809d70b878.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
529
+ ]
530
+ },
531
+ {
532
+ "name": "stdout",
533
+ "output_type": "stream",
534
+ "text": [
535
+ "\n",
536
+ "0: 640x640 (no detections), 842.5ms\n",
537
+ "Speed: 4.5ms preprocess, 842.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n",
538
+ "\n",
539
+ "0: 640x416 (no detections), 446.8ms\n",
540
+ "Speed: 2.5ms preprocess, 446.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
541
+ "Processed and saved 5 images so far.\n",
542
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-12.csv\n",
543
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-01.csv\n",
544
+ "\n",
545
+ "0: 640x448 (no detections), 638.5ms\n",
546
+ "Speed: 2.3ms preprocess, 638.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
547
+ "Processed and saved 1 images so far.\n",
548
+ "\n",
549
+ "0: 640x416 (no detections), 441.7ms\n",
550
+ "Speed: 2.5ms preprocess, 441.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
551
+ "\n",
552
+ "0: 640x448 (no detections), 470.3ms\n",
553
+ "Speed: 2.3ms preprocess, 470.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
554
+ "\n",
555
+ "0: 640x448 (no detections), 693.9ms\n",
556
+ "Speed: 2.5ms preprocess, 693.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
557
+ "\n",
558
+ "0: 640x512 1 person, 1 face, 808.6ms\n",
559
+ "Speed: 3.2ms preprocess, 808.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
560
+ ]
561
+ },
562
+ {
563
+ "name": "stderr",
564
+ "output_type": "stream",
565
+ "text": [
566
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
567
+ "INFO:MiVOLO:\tage: 15.01\n",
568
+ "INFO:MiVOLO:\tgender: female [99%]\n"
569
+ ]
570
+ },
571
+ {
572
+ "name": "stdout",
573
+ "output_type": "stream",
574
+ "text": [
575
+ "\n",
576
+ "0: 640x320 1 person, 1 face, 345.6ms\n",
577
+ "Speed: 2.0ms preprocess, 345.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
578
+ ]
579
+ },
580
+ {
581
+ "name": "stderr",
582
+ "output_type": "stream",
583
+ "text": [
584
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
585
+ "INFO:MiVOLO:\tage: 20.86\n",
586
+ "INFO:MiVOLO:\tgender: female [99%]\n"
587
+ ]
588
+ },
589
+ {
590
+ "name": "stdout",
591
+ "output_type": "stream",
592
+ "text": [
593
+ "Processed and saved 6 images so far.\n",
594
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-01.csv\n",
595
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-02.csv\n",
596
+ "\n",
597
+ "0: 640x384 1 person, 1 face, 387.8ms\n",
598
+ "Speed: 1.9ms preprocess, 387.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
599
+ ]
600
+ },
601
+ {
602
+ "name": "stderr",
603
+ "output_type": "stream",
604
+ "text": [
605
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
606
+ "INFO:MiVOLO:\tage: 17.31\n",
607
+ "INFO:MiVOLO:\tgender: female [99%]\n"
608
+ ]
609
+ },
610
+ {
611
+ "name": "stdout",
612
+ "output_type": "stream",
613
+ "text": [
614
+ "Processed and saved 1 images so far.\n",
615
+ "\n",
616
+ "0: 640x480 1 person, 1 face, 540.4ms\n",
617
+ "Speed: 2.5ms preprocess, 540.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
618
+ ]
619
+ },
620
+ {
621
+ "name": "stderr",
622
+ "output_type": "stream",
623
+ "text": [
624
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
625
+ "INFO:MiVOLO:\tage: 17.47\n",
626
+ "INFO:MiVOLO:\tgender: female [99%]\n"
627
+ ]
628
+ },
629
+ {
630
+ "name": "stdout",
631
+ "output_type": "stream",
632
+ "text": [
633
+ "\n",
634
+ "0: 640x640 1 person, 1 face, 713.1ms\n",
635
+ "Speed: 3.8ms preprocess, 713.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n"
636
+ ]
637
+ },
638
+ {
639
+ "name": "stderr",
640
+ "output_type": "stream",
641
+ "text": [
642
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
643
+ "INFO:MiVOLO:\tage: 17.85\n",
644
+ "INFO:MiVOLO:\tgender: female [99%]\n"
645
+ ]
646
+ },
647
+ {
648
+ "name": "stdout",
649
+ "output_type": "stream",
650
+ "text": [
651
+ "\n",
652
+ "0: 640x640 (no detections), 778.8ms\n",
653
+ "Speed: 28.7ms preprocess, 778.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
654
+ "\n",
655
+ "0: 640x448 1 person, 1 face, 528.2ms\n",
656
+ "Speed: 2.3ms preprocess, 528.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
657
+ ]
658
+ },
659
+ {
660
+ "name": "stderr",
661
+ "output_type": "stream",
662
+ "text": [
663
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
664
+ "INFO:MiVOLO:\tage: 21.63\n",
665
+ "INFO:MiVOLO:\tgender: female [99%]\n"
666
+ ]
667
+ },
668
+ {
669
+ "name": "stdout",
670
+ "output_type": "stream",
671
+ "text": [
672
+ "\n",
673
+ "0: 640x448 1 person, 1 face, 518.4ms\n",
674
+ "Speed: 3.9ms preprocess, 518.4ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
675
+ ]
676
+ },
677
+ {
678
+ "name": "stderr",
679
+ "output_type": "stream",
680
+ "text": [
681
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
682
+ "INFO:MiVOLO:\tage: 18.25\n",
683
+ "INFO:MiVOLO:\tgender: female [99%]\n"
684
+ ]
685
+ },
686
+ {
687
+ "name": "stdout",
688
+ "output_type": "stream",
689
+ "text": [
690
+ "\n",
691
+ "0: 640x448 1 person, 1 face, 470.7ms\n",
692
+ "Speed: 2.5ms preprocess, 470.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
693
+ ]
694
+ },
695
+ {
696
+ "name": "stderr",
697
+ "output_type": "stream",
698
+ "text": [
699
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
700
+ "INFO:MiVOLO:\tage: 20.51\n",
701
+ "INFO:MiVOLO:\tgender: female [99%]\n"
702
+ ]
703
+ },
704
+ {
705
+ "name": "stdout",
706
+ "output_type": "stream",
707
+ "text": [
708
+ "\n",
709
+ "0: 640x480 1 person, 1 face, 647.1ms\n",
710
+ "Speed: 2.4ms preprocess, 647.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
711
+ ]
712
+ },
713
+ {
714
+ "name": "stderr",
715
+ "output_type": "stream",
716
+ "text": [
717
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
718
+ "INFO:MiVOLO:\tage: 58.87\n",
719
+ "INFO:MiVOLO:\tgender: male [99%]\n"
720
+ ]
721
+ },
722
+ {
723
+ "name": "stdout",
724
+ "output_type": "stream",
725
+ "text": [
726
+ "\n",
727
+ "0: 640x448 (no detections), 469.8ms\n",
728
+ "Speed: 2.6ms preprocess, 469.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
729
+ "\n",
730
+ "0: 640x448 1 person, 1 face, 477.5ms\n",
731
+ "Speed: 2.3ms preprocess, 477.5ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
732
+ ]
733
+ },
734
+ {
735
+ "name": "stderr",
736
+ "output_type": "stream",
737
+ "text": [
738
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
739
+ "INFO:MiVOLO:\tage: 23.79\n",
740
+ "INFO:MiVOLO:\tgender: female [99%]\n"
741
+ ]
742
+ },
743
+ {
744
+ "name": "stdout",
745
+ "output_type": "stream",
746
+ "text": [
747
+ "Processed and saved 10 images so far.\n",
748
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-02.csv\n",
749
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-03.csv\n",
750
+ "\n",
751
+ "0: 640x448 1 face, 511.4ms\n",
752
+ "Speed: 2.5ms preprocess, 511.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
753
+ ]
754
+ },
755
+ {
756
+ "name": "stderr",
757
+ "output_type": "stream",
758
+ "text": [
759
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
760
+ "INFO:MiVOLO:\tage: 24.87\n",
761
+ "INFO:MiVOLO:\tgender: female [99%]\n"
762
+ ]
763
+ },
764
+ {
765
+ "name": "stdout",
766
+ "output_type": "stream",
767
+ "text": [
768
+ "Processed and saved 1 images so far.\n",
769
+ "\n",
770
+ "0: 640x544 (no detections), 576.5ms\n",
771
+ "Speed: 2.9ms preprocess, 576.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 544)\n",
772
+ "\n",
773
+ "0: 640x448 1 person, 1 face, 687.1ms\n",
774
+ "Speed: 9.9ms preprocess, 687.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
775
+ ]
776
+ },
777
+ {
778
+ "name": "stderr",
779
+ "output_type": "stream",
780
+ "text": [
781
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
782
+ "INFO:MiVOLO:\tage: 25.76\n",
783
+ "INFO:MiVOLO:\tgender: female [99%]\n"
784
+ ]
785
+ },
786
+ {
787
+ "name": "stdout",
788
+ "output_type": "stream",
789
+ "text": [
790
+ "\n",
791
+ "0: 640x448 (no detections), 498.3ms\n",
792
+ "Speed: 2.3ms preprocess, 498.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
793
+ "\n",
794
+ "0: 640x512 (no detections), 573.2ms\n",
795
+ "Speed: 3.0ms preprocess, 573.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
796
+ "Processed and saved 5 images so far.\n",
797
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-03.csv\n",
798
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-04.csv\n",
799
+ "\n",
800
+ "0: 640x384 (no detections), 518.2ms\n",
801
+ "Speed: 2.7ms preprocess, 518.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
802
+ "Processed and saved 1 images so far.\n",
803
+ "\n",
804
+ "0: 640x512 (no detections), 707.7ms\n",
805
+ "Speed: 3.6ms preprocess, 707.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
806
+ "\n",
807
+ "0: 640x416 (no detections), 453.7ms\n",
808
+ "Speed: 2.4ms preprocess, 453.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
809
+ "\n",
810
+ "0: 640x384 (no detections), 391.0ms\n",
811
+ "Speed: 2.0ms preprocess, 391.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
812
+ "\n",
813
+ "0: 640x448 1 person, 1 face, 449.8ms\n",
814
+ "Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
815
+ ]
816
+ },
817
+ {
818
+ "name": "stderr",
819
+ "output_type": "stream",
820
+ "text": [
821
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
822
+ "INFO:MiVOLO:\tage: 22.39\n",
823
+ "INFO:MiVOLO:\tgender: female [99%]\n"
824
+ ]
825
+ },
826
+ {
827
+ "name": "stdout",
828
+ "output_type": "stream",
829
+ "text": [
830
+ "\n",
831
+ "0: 640x448 (no detections), 618.4ms\n",
832
+ "Speed: 2.3ms preprocess, 618.4ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
833
+ "\n",
834
+ "0: 640x448 1 person, 1 face, 631.0ms\n",
835
+ "Speed: 2.2ms preprocess, 631.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
836
+ ]
837
+ },
838
+ {
839
+ "name": "stderr",
840
+ "output_type": "stream",
841
+ "text": [
842
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
843
+ "INFO:MiVOLO:\tage: 24.05\n",
844
+ "INFO:MiVOLO:\tgender: female [99%]\n"
845
+ ]
846
+ },
847
+ {
848
+ "name": "stdout",
849
+ "output_type": "stream",
850
+ "text": [
851
+ "\n",
852
+ "0: 640x512 1 person, 496.4ms\n",
853
+ "Speed: 2.6ms preprocess, 496.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
854
+ ]
855
+ },
856
+ {
857
+ "name": "stderr",
858
+ "output_type": "stream",
859
+ "text": [
860
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
861
+ "INFO:MiVOLO:\tage: 22.81\n",
862
+ "INFO:MiVOLO:\tgender: male [99%]\n"
863
+ ]
864
+ },
865
+ {
866
+ "name": "stdout",
867
+ "output_type": "stream",
868
+ "text": [
869
+ "\n",
870
+ "0: 640x448 (no detections), 442.8ms\n",
871
+ "Speed: 2.3ms preprocess, 442.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
872
+ "\n",
873
+ "0: 640x448 1 person, 1 face, 477.7ms\n",
874
+ "Speed: 2.4ms preprocess, 477.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
875
+ ]
876
+ },
877
+ {
878
+ "name": "stderr",
879
+ "output_type": "stream",
880
+ "text": [
881
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
882
+ "INFO:MiVOLO:\tage: 21.62\n",
883
+ "INFO:MiVOLO:\tgender: female [99%]\n"
884
+ ]
885
+ },
886
+ {
887
+ "name": "stdout",
888
+ "output_type": "stream",
889
+ "text": [
890
+ "\n",
891
+ "0: 640x448 1 person, 1 face, 447.0ms\n",
892
+ "Speed: 2.2ms preprocess, 447.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
893
+ ]
894
+ },
895
+ {
896
+ "name": "stderr",
897
+ "output_type": "stream",
898
+ "text": [
899
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
900
+ "INFO:MiVOLO:\tage: 54.31\n",
901
+ "INFO:MiVOLO:\tgender: male [99%]\n"
902
+ ]
903
+ },
904
+ {
905
+ "name": "stdout",
906
+ "output_type": "stream",
907
+ "text": [
908
+ "\n",
909
+ "0: 640x640 (no detections), 819.0ms\n",
910
+ "Speed: 3.6ms preprocess, 819.0ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
911
+ "\n",
912
+ "0: 640x448 1 person, 1 face, 478.2ms\n",
913
+ "Speed: 1.8ms preprocess, 478.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
914
+ ]
915
+ },
916
+ {
917
+ "name": "stderr",
918
+ "output_type": "stream",
919
+ "text": [
920
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
921
+ "INFO:MiVOLO:\tage: 20.56\n",
922
+ "INFO:MiVOLO:\tgender: female [99%]\n"
923
+ ]
924
+ },
925
+ {
926
+ "name": "stdout",
927
+ "output_type": "stream",
928
+ "text": [
929
+ "\n",
930
+ "0: 640x448 1 person, 1 face, 471.2ms\n",
931
+ "Speed: 2.7ms preprocess, 471.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
932
+ ]
933
+ },
934
+ {
935
+ "name": "stderr",
936
+ "output_type": "stream",
937
+ "text": [
938
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
939
+ "INFO:MiVOLO:\tage: 21.31\n",
940
+ "INFO:MiVOLO:\tgender: female [99%]\n"
941
+ ]
942
+ },
943
+ {
944
+ "name": "stdout",
945
+ "output_type": "stream",
946
+ "text": [
947
+ "\n",
948
+ "0: 640x448 (no detections), 484.0ms\n",
949
+ "Speed: 2.2ms preprocess, 484.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
950
+ "\n",
951
+ "0: 640x640 (no detections), 832.6ms\n",
952
+ "Speed: 3.0ms preprocess, 832.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 640)\n",
953
+ "\n",
954
+ "0: 640x448 1 person, 1 face, 508.9ms\n",
955
+ "Speed: 2.5ms preprocess, 508.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
956
+ ]
957
+ },
958
+ {
959
+ "name": "stderr",
960
+ "output_type": "stream",
961
+ "text": [
962
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
963
+ "INFO:MiVOLO:\tage: 27.19\n",
964
+ "INFO:MiVOLO:\tgender: female [99%]\n",
965
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6879c7b9-5cb3-42db-b409-30b4e2f71945/width=1080/6879c7b9-5cb3-42db-b409-30b4e2f71945.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
966
+ ]
967
+ },
968
+ {
969
+ "name": "stdout",
970
+ "output_type": "stream",
971
+ "text": [
972
+ "\n",
973
+ "0: 640x448 9 persons, 461.8ms\n",
974
+ "Speed: 2.4ms preprocess, 461.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
975
+ ]
976
+ },
977
+ {
978
+ "name": "stderr",
979
+ "output_type": "stream",
980
+ "text": [
981
+ "INFO:MiVOLO:faces_input: torch.Size([9, 3, 224, 224]), person_input: torch.Size([9, 3, 224, 224])\n",
982
+ "INFO:MiVOLO:\tage: 30.4\n",
983
+ "INFO:MiVOLO:\tgender: male [55%]\n",
984
+ "INFO:MiVOLO:\tage: 28.89\n",
985
+ "INFO:MiVOLO:\tgender: female [63%]\n",
986
+ "INFO:MiVOLO:\tage: 30.31\n",
987
+ "INFO:MiVOLO:\tgender: female [68%]\n",
988
+ "INFO:MiVOLO:\tage: 31.62\n",
989
+ "INFO:MiVOLO:\tgender: female [53%]\n",
990
+ "INFO:MiVOLO:\tage: 35.17\n",
991
+ "INFO:MiVOLO:\tgender: male [53%]\n",
992
+ "INFO:MiVOLO:\tage: 33.02\n",
993
+ "INFO:MiVOLO:\tgender: male [95%]\n",
994
+ "INFO:MiVOLO:\tage: 35.17\n",
995
+ "INFO:MiVOLO:\tgender: male [53%]\n",
996
+ "INFO:MiVOLO:\tage: 35.17\n",
997
+ "INFO:MiVOLO:\tgender: male [53%]\n",
998
+ "INFO:MiVOLO:\tage: 35.17\n",
999
+ "INFO:MiVOLO:\tgender: male [53%]\n"
1000
+ ]
1001
+ },
1002
+ {
1003
+ "name": "stdout",
1004
+ "output_type": "stream",
1005
+ "text": [
1006
+ "Processed and saved 19 images so far.\n",
1007
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-04.csv\n",
1008
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-05.csv\n",
1009
+ "\n",
1010
+ "0: 640x448 1 person, 455.5ms\n",
1011
+ "Speed: 2.2ms preprocess, 455.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1012
+ ]
1013
+ },
1014
+ {
1015
+ "name": "stderr",
1016
+ "output_type": "stream",
1017
+ "text": [
1018
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1019
+ "INFO:MiVOLO:\tage: 37.57\n",
1020
+ "INFO:MiVOLO:\tgender: male [95%]\n"
1021
+ ]
1022
+ },
1023
+ {
1024
+ "name": "stdout",
1025
+ "output_type": "stream",
1026
+ "text": [
1027
+ "Processed and saved 1 images so far.\n",
1028
+ "\n",
1029
+ "0: 640x448 1 person, 1 face, 438.7ms\n",
1030
+ "Speed: 2.2ms preprocess, 438.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1031
+ ]
1032
+ },
1033
+ {
1034
+ "name": "stderr",
1035
+ "output_type": "stream",
1036
+ "text": [
1037
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1038
+ "INFO:MiVOLO:\tage: 15.62\n",
1039
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1040
+ ]
1041
+ },
1042
+ {
1043
+ "name": "stdout",
1044
+ "output_type": "stream",
1045
+ "text": [
1046
+ "\n",
1047
+ "0: 640x448 (no detections), 444.8ms\n",
1048
+ "Speed: 2.3ms preprocess, 444.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n"
1049
+ ]
1050
+ },
1051
+ {
1052
+ "name": "stderr",
1053
+ "output_type": "stream",
1054
+ "text": [
1055
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6032bd70-6d53-4007-9e89-e69d4748efb5/width=528/6032bd70-6d53-4007-9e89-e69d4748efb5.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ed0950>\n"
1056
+ ]
1057
+ },
1058
+ {
1059
+ "name": "stdout",
1060
+ "output_type": "stream",
1061
+ "text": [
1062
+ "\n",
1063
+ "0: 640x448 (no detections), 453.9ms\n",
1064
+ "Speed: 2.3ms preprocess, 453.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
1065
+ "\n",
1066
+ "0: 640x448 1 person, 1 face, 475.0ms\n",
1067
+ "Speed: 1.6ms preprocess, 475.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1068
+ ]
1069
+ },
1070
+ {
1071
+ "name": "stderr",
1072
+ "output_type": "stream",
1073
+ "text": [
1074
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1075
+ "INFO:MiVOLO:\tage: 22.5\n",
1076
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1077
+ ]
1078
+ },
1079
+ {
1080
+ "name": "stdout",
1081
+ "output_type": "stream",
1082
+ "text": [
1083
+ "\n",
1084
+ "0: 640x448 1 person, 1 face, 447.6ms\n",
1085
+ "Speed: 2.5ms preprocess, 447.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1086
+ ]
1087
+ },
1088
+ {
1089
+ "name": "stderr",
1090
+ "output_type": "stream",
1091
+ "text": [
1092
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1093
+ "INFO:MiVOLO:\tage: 23.46\n",
1094
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1095
+ ]
1096
+ },
1097
+ {
1098
+ "name": "stdout",
1099
+ "output_type": "stream",
1100
+ "text": [
1101
+ "\n",
1102
+ "0: 640x512 (no detections), 528.5ms\n",
1103
+ "Speed: 3.2ms preprocess, 528.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
1104
+ "\n",
1105
+ "0: 640x448 1 person, 1 face, 449.8ms\n",
1106
+ "Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1107
+ ]
1108
+ },
1109
+ {
1110
+ "name": "stderr",
1111
+ "output_type": "stream",
1112
+ "text": [
1113
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1114
+ "INFO:MiVOLO:\tage: 29.32\n",
1115
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1116
+ ]
1117
+ },
1118
+ {
1119
+ "name": "stdout",
1120
+ "output_type": "stream",
1121
+ "text": [
1122
+ "\n",
1123
+ "0: 640x448 1 person, 1 face, 617.7ms\n",
1124
+ "Speed: 2.4ms preprocess, 617.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1125
+ ]
1126
+ },
1127
+ {
1128
+ "name": "stderr",
1129
+ "output_type": "stream",
1130
+ "text": [
1131
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1132
+ "INFO:MiVOLO:\tage: 21.32\n",
1133
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1134
+ ]
1135
+ },
1136
+ {
1137
+ "name": "stdout",
1138
+ "output_type": "stream",
1139
+ "text": [
1140
+ "\n",
1141
+ "0: 640x448 (no detections), 609.1ms\n",
1142
+ "Speed: 2.3ms preprocess, 609.1ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
1143
+ "\n",
1144
+ "0: 640x448 (no detections), 436.2ms\n",
1145
+ "Speed: 2.5ms preprocess, 436.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1146
+ "\n",
1147
+ "0: 640x512 1 person, 1 face, 585.6ms\n",
1148
+ "Speed: 3.1ms preprocess, 585.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
1149
+ ]
1150
+ },
1151
+ {
1152
+ "name": "stderr",
1153
+ "output_type": "stream",
1154
+ "text": [
1155
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1156
+ "INFO:MiVOLO:\tage: 20.5\n",
1157
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1158
+ ]
1159
+ },
1160
+ {
1161
+ "name": "stdout",
1162
+ "output_type": "stream",
1163
+ "text": [
1164
+ "\n",
1165
+ "0: 640x448 1 person, 457.3ms\n",
1166
+ "Speed: 2.1ms preprocess, 457.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1167
+ ]
1168
+ },
1169
+ {
1170
+ "name": "stderr",
1171
+ "output_type": "stream",
1172
+ "text": [
1173
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1174
+ "INFO:MiVOLO:\tage: 25.19\n",
1175
+ "INFO:MiVOLO:\tgender: male [81%]\n"
1176
+ ]
1177
+ },
1178
+ {
1179
+ "name": "stdout",
1180
+ "output_type": "stream",
1181
+ "text": [
1182
+ "Processed and saved 14 images so far.\n",
1183
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-05.csv\n",
1184
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-06.csv\n",
1185
+ "\n",
1186
+ "0: 640x448 (no detections), 484.5ms\n",
1187
+ "Speed: 2.8ms preprocess, 484.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1188
+ "Processed and saved 1 images so far.\n",
1189
+ "\n",
1190
+ "0: 640x512 (no detections), 524.8ms\n",
1191
+ "Speed: 2.9ms preprocess, 524.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
1192
+ "\n",
1193
+ "0: 640x480 1 person, 478.0ms\n",
1194
+ "Speed: 2.6ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
1195
+ ]
1196
+ },
1197
+ {
1198
+ "name": "stderr",
1199
+ "output_type": "stream",
1200
+ "text": [
1201
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1202
+ "INFO:MiVOLO:\tage: 39.4\n",
1203
+ "INFO:MiVOLO:\tgender: male [99%]\n"
1204
+ ]
1205
+ },
1206
+ {
1207
+ "name": "stdout",
1208
+ "output_type": "stream",
1209
+ "text": [
1210
+ "\n",
1211
+ "0: 640x512 1 person, 1 face, 539.8ms\n",
1212
+ "Speed: 2.6ms preprocess, 539.8ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 512)\n"
1213
+ ]
1214
+ },
1215
+ {
1216
+ "name": "stderr",
1217
+ "output_type": "stream",
1218
+ "text": [
1219
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1220
+ "INFO:MiVOLO:\tage: 21.33\n",
1221
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1222
+ ]
1223
+ },
1224
+ {
1225
+ "name": "stdout",
1226
+ "output_type": "stream",
1227
+ "text": [
1228
+ "\n",
1229
+ "0: 640x448 1 person, 2 faces, 446.7ms\n",
1230
+ "Speed: 2.4ms preprocess, 446.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1231
+ ]
1232
+ },
1233
+ {
1234
+ "name": "stderr",
1235
+ "output_type": "stream",
1236
+ "text": [
1237
+ "INFO:MiVOLO:faces_input: torch.Size([2, 3, 224, 224]), person_input: torch.Size([2, 3, 224, 224])\n",
1238
+ "INFO:MiVOLO:\tage: 20.65\n",
1239
+ "INFO:MiVOLO:\tgender: female [99%]\n",
1240
+ "INFO:MiVOLO:\tage: 20.53\n",
1241
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1242
+ ]
1243
+ },
1244
+ {
1245
+ "name": "stdout",
1246
+ "output_type": "stream",
1247
+ "text": [
1248
+ "\n",
1249
+ "0: 640x640 1 person, 1 face, 655.0ms\n",
1250
+ "Speed: 3.3ms preprocess, 655.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n"
1251
+ ]
1252
+ },
1253
+ {
1254
+ "name": "stderr",
1255
+ "output_type": "stream",
1256
+ "text": [
1257
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1258
+ "INFO:MiVOLO:\tage: 26.34\n",
1259
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1260
+ ]
1261
+ },
1262
+ {
1263
+ "name": "stdout",
1264
+ "output_type": "stream",
1265
+ "text": [
1266
+ "\n",
1267
+ "0: 640x384 (no detections), 400.6ms\n",
1268
+ "Speed: 2.1ms preprocess, 400.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
1269
+ "\n",
1270
+ "0: 640x448 1 person, 587.9ms\n",
1271
+ "Speed: 2.2ms preprocess, 587.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1272
+ ]
1273
+ },
1274
+ {
1275
+ "name": "stderr",
1276
+ "output_type": "stream",
1277
+ "text": [
1278
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1279
+ "INFO:MiVOLO:\tage: 30.4\n",
1280
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1281
+ ]
1282
+ },
1283
+ {
1284
+ "name": "stdout",
1285
+ "output_type": "stream",
1286
+ "text": [
1287
+ "\n",
1288
+ "0: 640x448 (no detections), 610.3ms\n",
1289
+ "Speed: 2.3ms preprocess, 610.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1290
+ "\n",
1291
+ "0: 640x448 (no detections), 453.6ms\n",
1292
+ "Speed: 2.3ms preprocess, 453.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1293
+ "\n",
1294
+ "0: 640x512 1 person, 1 face, 511.3ms\n",
1295
+ "Speed: 2.8ms preprocess, 511.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
1296
+ ]
1297
+ },
1298
+ {
1299
+ "name": "stderr",
1300
+ "output_type": "stream",
1301
+ "text": [
1302
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1303
+ "INFO:MiVOLO:\tage: 34.28\n",
1304
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1305
+ ]
1306
+ },
1307
+ {
1308
+ "name": "stdout",
1309
+ "output_type": "stream",
1310
+ "text": [
1311
+ "\n",
1312
+ "0: 640x448 (no detections), 441.2ms\n",
1313
+ "Speed: 2.3ms preprocess, 441.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1314
+ "\n",
1315
+ "0: 640x448 (no detections), 586.3ms\n",
1316
+ "Speed: 2.3ms preprocess, 586.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1317
+ "\n",
1318
+ "0: 640x448 (no detections), 437.5ms\n",
1319
+ "Speed: 2.4ms preprocess, 437.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1320
+ "\n",
1321
+ "0: 448x640 1 person, 1 face, 437.4ms\n",
1322
+ "Speed: 2.4ms preprocess, 437.4ms inference, 0.7ms postprocess per image at shape (1, 3, 448, 640)\n"
1323
+ ]
1324
+ },
1325
+ {
1326
+ "name": "stderr",
1327
+ "output_type": "stream",
1328
+ "text": [
1329
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1330
+ "INFO:MiVOLO:\tage: 22.81\n",
1331
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1332
+ ]
1333
+ },
1334
+ {
1335
+ "name": "stdout",
1336
+ "output_type": "stream",
1337
+ "text": [
1338
+ "\n",
1339
+ "0: 640x448 (no detections), 436.8ms\n",
1340
+ "Speed: 2.6ms preprocess, 436.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1341
+ "\n",
1342
+ "0: 448x640 (no detections), 433.0ms\n",
1343
+ "Speed: 1.9ms preprocess, 433.0ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
1344
+ "\n",
1345
+ "0: 640x448 (no detections), 599.7ms\n",
1346
+ "Speed: 2.5ms preprocess, 599.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1347
+ "\n"
1348
+ ]
1349
+ }
1350
+ ],
1351
+ "source": [
1352
+ "# Set up logging\n",
1353
+ "detector_weights = current_dir.parent / 'ext/MiVOLO/models/yolov8x_person_face.pt'\n",
1354
+ "checkpoint = current_dir.parent / 'ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
1355
+ "\n",
1356
+ "_logger = logging.getLogger(\"inference\")\n",
1357
+ "logging.basicConfig(level=logging.INFO)\n",
1358
+ "\n",
1359
+ "# Placeholder configuration and predictor initialization for MiVOLO\n",
1360
+ "class Config:\n",
1361
+ " def __init__(self, detector_weights, checkpoint, device, with_persons=True, disable_faces=False, draw=False):\n",
1362
+ " self.detector_weights = detector_weights\n",
1363
+ " self.checkpoint = checkpoint\n",
1364
+ " self.device = device\n",
1365
+ " self.with_persons = with_persons\n",
1366
+ " self.disable_faces = disable_faces\n",
1367
+ " self.draw = draw\n",
1368
+ "\n",
1369
+ "\n",
1370
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1371
+ "config = Config(detector_weights=detector_weights, checkpoint=checkpoint, device=device)\n",
1372
+ "predictor = Predictor(config, verbose=True)\n",
1373
+ "\n",
1374
+ "def download_image(url):\n",
1375
+ " try:\n",
1376
+ " response = requests.get(url)\n",
1377
+ " response.raise_for_status()\n",
1378
+ " return Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
1379
+ " except requests.RequestException as e:\n",
1380
+ " _logger.error(f\"Failed to download image from {url}: {e}\")\n",
1381
+ " return None\n",
1382
+ " except UnidentifiedImageError as e:\n",
1383
+ " _logger.error(f\"Unidentified image error for URL {url}: {e}\")\n",
1384
+ " return None\n",
1385
+ "\n",
1386
+ "def process_images_with_progress(data, predictor, output_file, start_idx=0):\n",
1387
+ " results = []\n",
1388
+ " total_images = len(data)\n",
1389
+ "\n",
1390
+ " for idx, row in data.iterrows():\n",
1391
+ " if idx < start_idx:\n",
1392
+ " continue\n",
1393
+ "\n",
1394
+ " img_url = row[\"url\"]\n",
1395
+ " pil_image = download_image(img_url)\n",
1396
+ " if pil_image is None:\n",
1397
+ " continue\n",
1398
+ "\n",
1399
+ " np_image = np.array(pil_image)\n",
1400
+ " np_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)\n",
1401
+ " detected_objects, _ = predictor.recognize(np_image)\n",
1402
+ "\n",
1403
+ " row_result = row.to_dict() # Start with the original row's data\n",
1404
+ "\n",
1405
+ " if detected_objects and detected_objects.ages:\n",
1406
+ " for i in range(len(detected_objects.ages)):\n",
1407
+ " age = detected_objects.ages[i]\n",
1408
+ " gender = detected_objects.genders[i]\n",
1409
+ " gender_confidence = detected_objects.gender_scores[i]\n",
1410
+ "\n",
1411
+ " if gender_confidence >= 0.83:\n",
1412
+ " detection = {\n",
1413
+ " \"detection_type\": 'face' if i in detected_objects.face_to_person_map else 'person',\n",
1414
+ " \"gender\": gender,\n",
1415
+ " \"gender_confidence\": gender_confidence,\n",
1416
+ " \"age\": age,\n",
1417
+ " \"n_persons\": detected_objects.n_persons,\n",
1418
+ " \"n_faces\": detected_objects.n_faces,\n",
1419
+ " \"detected\": True\n",
1420
+ " }\n",
1421
+ " else:\n",
1422
+ " detection = {\n",
1423
+ " \"detection_type\": \"N/A\",\n",
1424
+ " \"gender\": \"N/A\",\n",
1425
+ " \"gender_confidence\": 0,\n",
1426
+ " \"age\": 0,\n",
1427
+ " \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
1428
+ " \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
1429
+ " \"detected\": False\n",
1430
+ " }\n",
1431
+ "\n",
1432
+ " results.append({**row_result, **detection})\n",
1433
+ " else:\n",
1434
+ " detection = {\n",
1435
+ " \"detection_type\": \"N/A\",\n",
1436
+ " \"gender\": \"N/A\",\n",
1437
+ " \"gender_confidence\": 0,\n",
1438
+ " \"age\": 0,\n",
1439
+ " \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
1440
+ " \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
1441
+ " \"detected\": False\n",
1442
+ " }\n",
1443
+ " results.append({**row_result, **detection})\n",
1444
+ "\n",
1445
+ " if idx % 100 == 0 or idx == total_images - 1:\n",
1446
+ " df = pd.DataFrame(results)\n",
1447
+ " if os.path.exists(output_file):\n",
1448
+ " df.to_csv(output_file, mode='a', header=False, index=False)\n",
1449
+ " else:\n",
1450
+ " df.to_csv(output_file, mode='w', header=True, index=False)\n",
1451
+ " results = []\n",
1452
+ " print(f\"Processed and saved {idx + 1} images so far.\")\n",
1453
+ "\n",
1454
+ "def generate_months(start, end):\n",
1455
+ " start_date = datetime.strptime(start, '%Y-%m')\n",
1456
+ " end_date = datetime.strptime(end, '%Y-%m')\n",
1457
+ " while start_date <= end_date:\n",
1458
+ " yield start_date.strftime('%Y-%m')\n",
1459
+ " start_date += relativedelta(months=1) # Increment by calendar months\n",
1460
+ "\n",
1461
+ "\n",
1462
+ "start_month = '2022-11'\n",
1463
+ "end_month = '2024-12'\n",
1464
+ "\n",
1465
+ "for month in generate_months(start_month, end_month):\n",
1466
+ " input_file_path = mivolo_in / f'Civiverse-{month}.csv'\n",
1467
+ " output_file_path = mivolo_out / f'{month}.csv'\n",
1468
+ "\n",
1469
+ " if input_file_path.exists():\n",
1470
+ " print(f\"Processing: {input_file_path}\")\n",
1471
+ "\n",
1472
+ " data = pd.read_csv(input_file_path)\n",
1473
+ " start_index = 0\n",
1474
+ " process_images_with_progress(data, predictor, output_file_path, start_idx=start_index)\n",
1475
+ "\n",
1476
+ " print(f\"Processed and saved to: {output_file_path}\")\n",
1477
+ " else:\n",
1478
+ " print(f\"File not found: {input_file_path}\")"
1479
+ ]
1480
+ },
1481
+ {
1482
+ "cell_type": "markdown",
1483
+ "id": "26aeeef7",
1484
+ "metadata": {},
1485
+ "source": [
1486
+ "## Visualization code"
1487
+ ]
1488
+ },
1489
+ {
1490
+ "cell_type": "code",
1491
+ "execution_count": null,
1492
+ "id": "88ec896a-bf9b-4cc6-a787-c1343f8acb41",
1493
+ "metadata": {},
1494
+ "outputs": [],
1495
+ "source": [
1496
+ "import matplotlib.pyplot as plt\n",
1497
+ "import matplotlib.patches as mpatches\n",
1498
+ "\n",
1499
+ "input_dir = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
1500
+ "plot_dir = current_dir.parent / 'plots/'\n",
1501
+ "\n",
1502
+ "\n",
1503
+ "all_data = pd.DataFrame()\n",
1504
+ "for file_path in input_dir.glob('*.csv'): # Reads all CSV files in the folder\n",
1505
+ " #print(f\"Loading: {file_path}\")\n",
1506
+ " data = pd.read_csv(file_path)\n",
1507
+ " all_data = pd.concat([all_data, data], ignore_index=True)\n",
1508
+ "\n",
1509
+ "# Filter rows where detection_type equals \"person\"\n",
1510
+ "person_data = all_data[all_data['detection_type'] == 'person']\n",
1511
+ "\n",
1512
+ "# Count unique images and categorize by persons detected\n",
1513
+ "n_images = all_data['id'].nunique()\n",
1514
+ "images_with_zero_persons = all_data[all_data['n_persons'] == 0]['id'].nunique()\n",
1515
+ "images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
1516
+ "\n",
1517
+ "n_persons_detected = person_data['id'].nunique() # Unique images with at least one detected person\n",
1518
+ "total_persons_detected = person_data.shape[0] # Total number of persons detected\n",
1519
+ "\n",
1520
+ "\n",
1521
+ "\n",
1522
+ "n_total_female = person_data[person_data['gender'] == 'female']['id'].nunique()\n",
1523
+ "n_total_male = person_data[person_data['gender'] == 'male']['id'].nunique()\n",
1524
+ "\n",
1525
+ "# Filter the data further for non-missing age and gender\n",
1526
+ "filtered_data = person_data.dropna(subset=['age', 'gender'])\n",
1527
+ "\n",
1528
+ "# Round ages for consistent plotting\n",
1529
+ "filtered_data['rounded_age'] = np.round(filtered_data['age'] * 4) / 4\n",
1530
+ "\n",
1531
+ "# Map browsingLevel to colors\n",
1532
+ "def get_browsing_color(browsing_level):\n",
1533
+ " color_mapping = {\n",
1534
+ " 1: 'silver',\n",
1535
+ " 2: 'rosybrown',\n",
1536
+ " 4: 'coral',\n",
1537
+ " 8: 'crimson',\n",
1538
+ " 16: 'blueviolet'\n",
1539
+ " }\n",
1540
+ " return color_mapping.get(browsing_level, 'black') # Default to black for unknown values\n",
1541
+ "\n",
1542
+ "filtered_data['color'] = filtered_data['browsingLevel'].apply(get_browsing_color)\n",
1543
+ "\n",
1544
+ "# Aggregate data for plotting\n",
1545
+ "aggregated_data = (\n",
1546
+ " filtered_data.groupby(['rounded_age', 'gender', 'color'])\n",
1547
+ " .size()\n",
1548
+ " .unstack(fill_value=0)\n",
1549
+ ")\n",
1550
+ "\n",
1551
+ "# Define NSFW colors\n",
1552
+ "nsfw_colors = ['blueviolet', 'crimson', 'coral', 'rosybrown', 'silver']\n",
1553
+ "\n",
1554
+ "# Plotting function\n",
1555
+ "def plot_gender_data(ax, data, gender_label):\n",
1556
+ " ages = data.index\n",
1557
+ " bottom = np.zeros(len(ages))\n",
1558
+ " \n",
1559
+ " for color in nsfw_colors:\n",
1560
+ " counts = data[color] if color in data.columns else np.zeros(len(ages))\n",
1561
+ " ax.bar(\n",
1562
+ " ages,\n",
1563
+ " counts,\n",
1564
+ " color=color,\n",
1565
+ " edgecolor=color,\n",
1566
+ " linewidth=1,\n",
1567
+ " width=0.2,\n",
1568
+ " bottom=bottom,\n",
1569
+ " alpha=0.5\n",
1570
+ " )\n",
1571
+ " bottom += counts\n",
1572
+ "\n",
1573
+ " x_min = 5\n",
1574
+ " x_max = filtered_data['rounded_age'].max()\n",
1575
+ " ax.set_xticks(np.arange(x_min, x_max + 0.5, 5))\n",
1576
+ " ax.set_xticklabels([f'{int(age)}' for age in np.arange(x_min, x_max + 0.5, 5)], fontsize=12, fontweight='bold')\n",
1577
+ " ax.set_xticks(np.arange(x_min, x_max + 0.5, 0.5), minor=True)\n",
1578
+ "\n",
1579
+ " y_min = 0\n",
1580
+ " y_max = bottom.max() + 100\n",
1581
+ " y_ticks = np.arange(y_min, y_max + 1, 100) # Fine-grained steps of 100\n",
1582
+ " ax.set_yticks(y_ticks)\n",
1583
+ " ax.set_yticklabels([str(int(y)) for y in y_ticks], fontsize=12, fontweight='bold')\n",
1584
+ "\n",
1585
+ " ax.grid(True, which='major', color='lightgrey', linestyle='-', linewidth=0.5)\n",
1586
+ " ax.grid(True, which='minor', color='lightgrey', linestyle=':', linewidth=0.5)\n",
1587
+ "\n",
1588
+ " ax.spines['top'].set_visible(False)\n",
1589
+ " ax.spines['right'].set_visible(False)\n",
1590
+ " ax.spines['left'].set_visible(False)\n",
1591
+ " ax.spines['bottom'].set_visible(False)\n",
1592
+ " \n",
1593
+ " ax.set_xlabel('Age', fontsize=12, fontweight='bold')\n",
1594
+ " if gender_label == 'Female':\n",
1595
+ " ax.set_ylabel('Number of Subjects', fontsize=14, fontweight='bold')\n",
1596
+ " ax.set_title(f'{gender_label} Read', fontsize=14, fontweight='bold')\n",
1597
+ "\n",
1598
+ "# Set up the subplots\n",
1599
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 6.5), sharey=True)\n",
1600
+ "\n",
1601
+ "plot_gender_data(axes[0], aggregated_data.xs('male', level='gender'), 'Male')\n",
1602
+ "plot_gender_data(axes[1], aggregated_data.xs('female', level='gender'), 'Female')\n",
1603
+ "\n",
1604
+ "legend_patches = [\n",
1605
+ " mpatches.Patch(facecolor='blueviolet', edgecolor='blueviolet', linewidth=2, label='Level 16: XXX'),\n",
1606
+ " mpatches.Patch(facecolor='crimson', edgecolor='crimson', linewidth=2, label='Level 8: X'),\n",
1607
+ " mpatches.Patch(facecolor='coral', edgecolor='coral', linewidth=2, label='Level 4: Mature'),\n",
1608
+ " mpatches.Patch(facecolor='rosybrown', edgecolor='rosybrown', linewidth=2, label='Level 2: Soft'),\n",
1609
+ " mpatches.Patch(facecolor='silver', edgecolor='silver', linewidth=2, label='Level 1: SFW'),\n",
1610
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'n images: {n_images}', alpha=0),\n",
1611
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'Total persons detected: {total_persons_detected}', alpha=0),\n",
1612
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'Unique images containing persons: {n_persons_detected}', alpha=0),\n",
1613
+ "]\n",
1614
+ "\n",
1615
+ "axes[0].legend(handles=legend_patches, title=\"Browsing Levels\", loc='upper left', fontsize=12, title_fontsize=12, frameon=True)\n",
1616
+ "plt.savefig(f'{plot_dir}/mivolo.svg', format='svg', bbox_inches='tight')\n",
1617
+ "plt.tight_layout()\n",
1618
+ "\n",
1619
+ "# Count images with at least one person\n",
1620
+ "images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
1621
+ "\n",
1622
+ "# Count unique images in `person_data`\n",
1623
+ "n_persons_detected = person_data['id'].nunique()\n",
1624
+ "\n",
1625
+ "# Count total persons detected\n",
1626
+ "total_persons = person_data.shape[0] # This counts all detected persons\n",
1627
+ "\n",
1628
+ "# Display potential inconsistencies\n",
1629
+ "print(f\"Total images: {n_images}\")\n",
1630
+ "print(f\"Images with at least one person: {images_with_one_or_more_persons}\")\n",
1631
+ "print(f\"Unique images in `person_data`: {n_persons_detected}\")\n",
1632
+ "print(f\"Total number of persons detected: {total_persons}\")\n",
1633
+ "\n",
1634
+ "\n",
1635
+ "\n",
1636
+ "plt.show()"
1637
+ ]
1638
+ },
1639
+ {
1640
+ "cell_type": "markdown",
1641
+ "id": "42b2a557-b8f4-4d0f-8907-98e3012a1b34",
1642
+ "metadata": {
1643
+ "execution": {
1644
+ "iopub.execute_input": "2025-02-06T20:01:54.848400Z",
1645
+ "iopub.status.busy": "2025-02-06T20:01:54.847713Z",
1646
+ "iopub.status.idle": "2025-02-06T20:01:54.851533Z",
1647
+ "shell.execute_reply": "2025-02-06T20:01:54.851102Z",
1648
+ "shell.execute_reply.started": "2025-02-06T20:01:54.848376Z"
1649
+ }
1650
+ },
1651
+ "source": [
1652
+ "### Latex Table"
1653
+ ]
1654
+ },
1655
+ {
1656
+ "cell_type": "code",
1657
+ "execution_count": null,
1658
+ "id": "3e506c41-6497-4ece-99f3-73f09fe1129e",
1659
+ "metadata": {},
1660
+ "outputs": [],
1661
+ "source": [
1662
+ "import os\n",
1663
+ "import pandas as pd\n",
1664
+ "\n",
1665
+ "# Define the directory containing CSV files\n",
1666
+ "directory_path = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/' # Update this path with the actual directory path\n",
1667
+ "\n",
1668
+ "# Prepare data for LaTeX table\n",
1669
+ "table_rows = []\n",
1670
+ "\n",
1671
+ "# Loop through each file in the directory\n",
1672
+ "for file_name in os.listdir(directory_path):\n",
1673
+ " if file_name.endswith('.csv'):\n",
1674
+ " file_path = os.path.join(directory_path, file_name)\n",
1675
+ " print(f\"Processing file: {file_name}\")\n",
1676
+ "\n",
1677
+ " # Load the data\n",
1678
+ " data = pd.read_csv(file_path)\n",
1679
+ "\n",
1680
+ " # Total images analyzed\n",
1681
+ " total_images = data['id'].nunique()\n",
1682
+ "\n",
1683
+ " # Count of images with no persons detected\n",
1684
+ " images_no_persons = data[data['n_persons'] == 0]['id'].nunique()\n",
1685
+ "\n",
1686
+ " # Total persons detected (only using \"person\" detection type)\n",
1687
+ " total_persons_count = data[data['detection_type'] == 'person'].shape[0]\n",
1688
+ "\n",
1689
+ " # Average age and standard deviation for male and female individuals\n",
1690
+ " male_age_stats = data[data['gender'] == 'male']['age'].agg(['mean', 'std']).fillna(0)\n",
1691
+ " female_age_stats = data[data['gender'] == 'female']['age'].agg(['mean', 'std']).fillna(0)\n",
1692
+ "\n",
1693
+ " # Count of female and male subjects\n",
1694
+ " female_images_count = data[data['gender'] == 'female']['id'].nunique()\n",
1695
+ " male_images_count = data[data['gender'] == 'male']['id'].nunique()\n",
1696
+ "\n",
1697
+ " # Female to male ratio\n",
1698
+ " female_to_male_ratio = female_images_count / male_images_count if male_images_count else None\n",
1699
+ "\n",
1700
+ " # Browsing level analysis for females\n",
1701
+ " female_browsing_level_1 = data[(data['gender'] == 'female') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
1702
+ " female_browsing_level_2_16 = data[(data['gender'] == 'female') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
1703
+ " \n",
1704
+ " female_browsing_level_1_percentage = (female_browsing_level_1 / female_images_count * 100) if female_images_count else 0\n",
1705
+ " female_browsing_level_2_16_percentage = (female_browsing_level_2_16 / female_images_count * 100) if female_images_count else 0\n",
1706
+ "\n",
1707
+ " # Browsing level analysis for males\n",
1708
+ " male_browsing_level_1 = data[(data['gender'] == 'male') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
1709
+ " male_browsing_level_2_16 = data[(data['gender'] == 'male') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
1710
+ "\n",
1711
+ " male_browsing_level_1_percentage = (male_browsing_level_1 / male_images_count * 100) if male_images_count else 0\n",
1712
+ " male_browsing_level_2_16_percentage = (male_browsing_level_2_16 / male_images_count * 100) if male_images_count else 0\n",
1713
+ "\n",
1714
+ " # Add row to table data\n",
1715
+ " table_rows.append([\n",
1716
+ " file_name.replace('.csv', ''), # Remove file extension\n",
1717
+ " total_images,\n",
1718
+ " total_persons_count,\n",
1719
+ " images_no_persons,\n",
1720
+ " f\"{female_browsing_level_1_percentage:.2f}\",\n",
1721
+ " f\"{female_browsing_level_2_16_percentage:.2f}\",\n",
1722
+ " f\"{male_browsing_level_1_percentage:.2f}\",\n",
1723
+ " f\"{male_browsing_level_2_16_percentage:.2f}\",\n",
1724
+ " f\"{female_to_male_ratio:.2f}\" if female_to_male_ratio is not None else \"N/A\",\n",
1725
+ " f\"{female_age_stats['mean']:.2f} ({female_age_stats['std']:.2f})\",\n",
1726
+ " f\"{male_age_stats['mean']:.2f} ({male_age_stats['std']:.2f})\"\n",
1727
+ " ])\n",
1728
+ "\n",
1729
+ "# Sort table rows by the filename (assumes filenames are formatted with sortable dates)\n",
1730
+ "table_rows = sorted(table_rows, key=lambda x: x[0])\n",
1731
+ "\n",
1732
+ "# Generate LaTeX table\n",
1733
+ "latex_table = r\"\"\"\n",
1734
+ "\\begin{table}[H]\n",
1735
+ "\\centering\n",
1736
+ "\\scriptsize\n",
1737
+ "\\renewcommand{\\arraystretch}{0.9}\n",
1738
+ "\\caption{Summary of Image Classification for 2023-2024}\n",
1739
+ "\\label{table:image_classification_2023_2024}\n",
1740
+ "\\begin{tabular}{lrrrrrrrrrr}\n",
1741
+ "\\toprule\n",
1742
+ "File Name & Total Images & Total Persons & No Persons & \\multicolumn{2}{c}{Female (\\%)} & \\multicolumn{2}{c}{Male (\\%)} & Female:Male & Female Age (Mean ± SD) & Male Age (Mean ± SD) \\\\\n",
1743
+ " & & & & L1 & L2-16 & L1 & L2-16 & & & \\\\\n",
1744
+ "\\midrule\n",
1745
+ "\"\"\"\n",
1746
+ "for row in table_rows:\n",
1747
+ " latex_table += \" & \".join(map(str, row)) + r\" \\\\\\\\\\n\"\n",
1748
+ "\n",
1749
+ "latex_table += r\"\"\"\n",
1750
+ "\\bottomrule\n",
1751
+ "\\end{tabular}\n",
1752
+ "\\vspace{1em}\n",
1753
+ "\\noindent\n",
1754
+ "\\textbf{Disclaimer:} \\(\\female\\) and \\(\\male\\) refer to female-read and male-read classifications as determined by the MiVOLO system's weights. \n",
1755
+ "We acknowledge the complexities of gender presentations and stress that these terms do not necessarily correspond to biological sex.\n",
1756
+ "\\end{table}\n",
1757
+ "\"\"\"\n",
1758
+ "\n",
1759
+ "# Output LaTeX table\n",
1760
+ "print(\"\\nGenerated LaTeX Table:\")\n",
1761
+ "print(latex_table)\n",
1762
+ "\n"
1763
+ ]
1764
+ },
1765
+ {
1766
+ "cell_type": "code",
1767
+ "execution_count": null,
1768
+ "id": "3ef428be-856b-4c4c-b1b0-a052d181d03b",
1769
+ "metadata": {},
1770
+ "outputs": [],
1771
+ "source": []
1772
+ }
1773
+ ],
1774
+ "metadata": {
1775
+ "kernelspec": {
1776
+ "display_name": "HORDE",
1777
+ "language": "python",
1778
+ "name": "horde"
1779
+ },
1780
+ "language_info": {
1781
+ "codemirror_mode": {
1782
+ "name": "ipython",
1783
+ "version": 3
1784
+ },
1785
+ "file_extension": ".py",
1786
+ "mimetype": "text/x-python",
1787
+ "name": "python",
1788
+ "nbconvert_exporter": "python",
1789
+ "pygments_lexer": "ipython3",
1790
+ "version": "3.12.4"
1791
+ }
1792
+ },
1793
+ "nbformat": 4,
1794
+ "nbformat_minor": 5
1795
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-1_Tag_occurences-checkpoint.ipynb ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [],
3
+ "metadata": {},
4
+ "nbformat": 4,
5
+ "nbformat_minor": 5
6
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Bloomz_query-checkpoint.ipynb ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "3b87c378-241e-41ab-be6e-84222594f22f",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import pandas as pd\n",
11
+ "import json\n",
12
+ "import time\n",
13
+ "import re\n",
14
+ "from pathlib import Path\n",
15
+ "from tqdm import tqdm\n",
16
+ "import torch\n",
17
+ "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
18
+ "\n",
19
+ "# Import is used for pd.notna() and pd.isna() checks\n",
20
+ "\n",
21
+ "current_dir = Path.cwd()\n",
22
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
23
+ "\n",
24
+ "# === CONFIGURATION ===\n",
25
+ "TEST_MODE = True\n",
26
+ "TEST_SIZE = 10\n",
27
+ "MAX_ROWS = 20000\n",
28
+ "SAVE_INTERVAL = 10\n",
29
+ "\n",
30
+ "# Model settings - BLOOMZ (BigScience - European consortium)\n",
31
+ "MODEL_NAME = \"bigscience/bloomz-7b1\" # Largest instruction-tuned BLOOM model\n",
32
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
33
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
34
+ "\n",
35
+ "PROFESSION_CATEGORIES = [\n",
36
+ " \"actor\", \"adult performer\", \"singer/musician\", \"model\",\n",
37
+ " \"online personality\", \"public figure\", \"voice actor/ASMR\",\n",
38
+ " \"sports professional\", \"tv personality\"\n",
39
+ "]\n",
40
+ "\n",
41
+ "# === LOAD MODEL ===\n",
42
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
43
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
44
+ "print(f\"This may take a while on first run (~14GB download)...\\n\")\n",
45
+ "\n",
46
+ "# Check GPU availability\n",
47
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
48
+ "print(f\"Device: {device}\")\n",
49
+ "\n",
50
+ "if device == \"cpu\":\n",
51
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
52
+ " print(\" Consider using a GPU or reducing model size.\")\n",
53
+ "\n",
54
+ "# Load tokenizer\n",
55
+ "print(\"Loading tokenizer...\")\n",
56
+ "try:\n",
57
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
58
+ " MODEL_NAME,\n",
59
+ " cache_dir=str(CACHE_DIR)\n",
60
+ " )\n",
61
+ " print(\"✅ Tokenizer loaded\")\n",
62
+ "except Exception as e:\n",
63
+ " print(f\"❌ Error loading tokenizer: {e}\")\n",
64
+ " raise\n",
65
+ "\n",
66
+ "# Ensure pad token is set\n",
67
+ "if tokenizer.pad_token is None:\n",
68
+ " tokenizer.pad_token = tokenizer.eos_token\n",
69
+ " print(f\"Set pad_token to eos_token: {tokenizer.eos_token}\")\n",
70
+ "\n",
71
+ "# Load model with optimizations\n",
72
+ "print(\"Loading model (this may take several minutes)...\")\n",
73
+ "try:\n",
74
+ " model = AutoModelForCausalLM.from_pretrained(\n",
75
+ " MODEL_NAME,\n",
76
+ " cache_dir=str(CACHE_DIR),\n",
77
+ " torch_dtype=torch.bfloat16, # Use BF16 for efficiency\n",
78
+ " device_map=\"auto\", # Automatically distribute across GPUs\n",
79
+ " low_cpu_mem_usage=True # Optimize memory usage\n",
80
+ " )\n",
81
+ " model.eval() # Set to evaluation mode\n",
82
+ " print(\"✅ Model loaded\")\n",
83
+ "except Exception as e:\n",
84
+ " print(f\"❌ Error loading model: {e}\")\n",
85
+ " raise\n",
86
+ "\n",
87
+ "# Check VRAM usage\n",
88
+ "if torch.cuda.is_available():\n",
89
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
90
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
91
+ "\n",
92
+ "# === LOAD DATA ===\n",
93
+ "df = pd.read_csv(input_file)\n",
94
+ "print(f\"Loaded {len(df)} rows\")\n",
95
+ "\n",
96
+ "if TEST_MODE:\n",
97
+ " print(f\"Running in TEST MODE with {TEST_SIZE} samples\")\n",
98
+ " df = df.head(TEST_SIZE).copy()\n",
99
+ "elif MAX_ROWS:\n",
100
+ " df = df.head(MAX_ROWS).copy()\n",
101
+ "\n",
102
+ "# === CREATE PROMPT (Exact DeepSeek style) ===\n",
103
+ "def create_prompt(row):\n",
104
+ " \"\"\"Create prompt.\"\"\"\n",
105
+ " name = row.get('real_name', row.get('name', ''))\n",
106
+ " if pd.isna(name):\n",
107
+ " name = row.get('name', '')\n",
108
+ " \n",
109
+ " # Gather hints exactly like DeepSeek version\n",
110
+ " hints = []\n",
111
+ " if pd.notna(row.get('likely_profession')):\n",
112
+ " hints.append(str(row['likely_profession']))\n",
113
+ " if pd.notna(row.get('likely_nationality')):\n",
114
+ " hints.append(str(row['likely_nationality']))\n",
115
+ " if pd.notna(row.get('likely_country')):\n",
116
+ " hints.append(str(row['likely_country']))\n",
117
+ " \n",
118
+ " # Add tags if we don't have enough hints\n",
119
+ " if len(hints) < 3:\n",
120
+ " for i in range(1, 8):\n",
121
+ " tag_col = f'tag_{i}'\n",
122
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
123
+ " tag_val = str(row[tag_col])\n",
124
+ " if tag_val not in hints:\n",
125
+ " hints.append(tag_val)\n",
126
+ " if len(hints) >= 5:\n",
127
+ " break\n",
128
+ " \n",
129
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
130
+ " \n",
131
+ " return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
132
+ "1. Full legal name (Western order if non-latin script)\n",
133
+ "2. Any stage names/aliases (comma separated)\n",
134
+ "3. Gender (Male/Female/Other/Unknown)\n",
135
+ "4. Top 3 most likely professions from ONLY these categories:\n",
136
+ " - actor\n",
137
+ " - adult performer\n",
138
+ " - singer/musician\n",
139
+ " - model\n",
140
+ " - online personality (includes streamers, cosplayers, influencers)\n",
141
+ " - public figure (includes politicians, activists, journalists, authors)\n",
142
+ " - voice actor/ASMR\n",
143
+ " - sports professional\n",
144
+ " - tv personality (includes hosts, presenters, reality TV)\n",
145
+ "\n",
146
+ "5. Primary country associated\n",
147
+ "\n",
148
+ "IMPORTANT:\n",
149
+ "- Choose professions ONLY from the 9 categories above\n",
150
+ "- Provide up to 3 professions, comma-separated, ordered by relevance\n",
151
+ "- Be SPECIFIC: choose the most accurate category for each role\n",
152
+ "- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
153
+ "- Use 'Unknown' when uncertain or for fictional characters/places\n",
154
+ "- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
155
+ "\n",
156
+ "Respond with exactly 5 numbered lines.\"\"\"\n",
157
+ "\n",
158
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
159
+ "\n",
160
+ "# === QUERY BLOOMZ LOCAL ===\n",
161
+ "def query_bloomz_local(prompt: str) -> str:\n",
162
+ " \"\"\"Query BLOOMZ-7B1 locally via transformers, return raw response string.\"\"\"\n",
163
+ " try:\n",
164
+ " # BLOOMZ works better with instruction-response format\n",
165
+ " full_prompt = f\"\"\"Instruction: Extract key data on a person based on the name and hints.\n",
166
+ "You must respond with exactly 5 numbered lines in this format:\n",
167
+ "1. Full legal name\n",
168
+ "2. Stage names/aliases \n",
169
+ "3. Gender\n",
170
+ "4. Professions (comma-separated, choose ONLY from: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality)\n",
171
+ "5. Country\n",
172
+ "\n",
173
+ "{prompt}\n",
174
+ "\n",
175
+ "Response:\"\"\"\n",
176
+ " \n",
177
+ " inputs = tokenizer(\n",
178
+ " full_prompt, \n",
179
+ " return_tensors=\"pt\", \n",
180
+ " truncation=True,\n",
181
+ " max_length=2048\n",
182
+ " ).to(device)\n",
183
+ " \n",
184
+ " # Generate with adjusted parameters for BLOOMZ\n",
185
+ " with torch.no_grad():\n",
186
+ " outputs = model.generate(\n",
187
+ " **inputs,\n",
188
+ " max_new_tokens=256,\n",
189
+ " temperature=0.3, # Increased for more variability\n",
190
+ " do_sample=True,\n",
191
+ " top_p=0.9,\n",
192
+ " top_k=40,\n",
193
+ " repetition_penalty=1.1,\n",
194
+ " pad_token_id=tokenizer.eos_token_id, # Use EOS as pad token\n",
195
+ " eos_token_id=tokenizer.eos_token_id,\n",
196
+ " early_stopping=True\n",
197
+ " )\n",
198
+ " \n",
199
+ " # Decode the entire output to see what's happening\n",
200
+ " full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
201
+ " \n",
202
+ " # Extract only the generated part (after the prompt)\n",
203
+ " generated_text = full_output[len(tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)):]\n",
204
+ " \n",
205
+ " # Debug output\n",
206
+ " if not hasattr(query_bloomz_local, 'debug_count'):\n",
207
+ " query_bloomz_local.debug_count = 0\n",
208
+ " \n",
209
+ " if query_bloomz_local.debug_count < 3:\n",
210
+ " print(f\"\\n📝 BLOOMZ Debug #{query_bloomz_local.debug_count + 1}:\")\n",
211
+ " print(f\"Prompt: {full_prompt[:200]}...\")\n",
212
+ " print(f\"Full output: {full_output[:500]}...\")\n",
213
+ " print(f\"Generated text: {generated_text}\")\n",
214
+ " print(f\"{'='*60}\\n\")\n",
215
+ " query_bloomz_local.debug_count += 1\n",
216
+ " \n",
217
+ " return generated_text.strip()\n",
218
+ " \n",
219
+ " except Exception as e:\n",
220
+ " print(f\"Error querying BLOOMZ: {e}\")\n",
221
+ " return None\n",
222
+ "\n",
223
+ "# === PARSE RESPONSE (Exact DeepSeek format) ===\n",
224
+ "def parse_response(response):\n",
225
+ " \"\"\"Parse numbered response into structured fields.\"\"\"\n",
226
+ " if not response:\n",
227
+ " return {\n",
228
+ " 'full_name': 'Unknown',\n",
229
+ " 'aliases': 'Unknown',\n",
230
+ " 'gender': 'Unknown',\n",
231
+ " 'profession_llm': 'Unknown',\n",
232
+ " 'country': 'Unknown'\n",
233
+ " }\n",
234
+ " \n",
235
+ " # Split into lines and clean\n",
236
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
237
+ " \n",
238
+ " # Initialize with Unknown values\n",
239
+ " fields = {\n",
240
+ " 'full_name': 'Unknown',\n",
241
+ " 'aliases': 'Unknown',\n",
242
+ " 'gender': 'Unknown',\n",
243
+ " 'profession_llm': 'Unknown',\n",
244
+ " 'country': 'Unknown'\n",
245
+ " }\n",
246
+ " \n",
247
+ " # Extract information from each numbered line\n",
248
+ " for line in lines:\n",
249
+ " if line.startswith('1.') or line.startswith('1)'):\n",
250
+ " fields['full_name'] = line[2:].strip()\n",
251
+ " elif line.startswith('2.') or line.startswith('2)'):\n",
252
+ " fields['aliases'] = line[2:].strip()\n",
253
+ " elif line.startswith('3.') or line.startswith('3)'):\n",
254
+ " fields['gender'] = line[2:].strip()\n",
255
+ " elif line.startswith('4.') or line.startswith('4)'):\n",
256
+ " fields['profession_llm'] = line[2:].strip()\n",
257
+ " elif line.startswith('5.') or line.startswith('5)'):\n",
258
+ " fields['country'] = line[2:].strip()\n",
259
+ " \n",
260
+ " return fields\n",
261
+ "\n",
262
+ "# === PROCESS ===\n",
263
+ "output_file = current_dir.parent / f\"data/CSV/bloomz_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
264
+ "index_file = current_dir.parent / \"misc/bloomz_query_index.txt\"\n",
265
+ "\n",
266
+ "current_index = 0\n",
267
+ "if index_file.exists():\n",
268
+ " with open(index_file) as f:\n",
269
+ " current_index = int(f.read().strip())\n",
270
+ " print(f\"Resuming from index {current_index}\")\n",
271
+ "\n",
272
+ "# Initialize columns (same as DeepSeek)\n",
273
+ "for col in ['full_name', 'gender', 'profession_llm', 'country', 'aliases']:\n",
274
+ " if col not in df.columns:\n",
275
+ " df[col] = 'Unknown'\n",
276
+ "\n",
277
+ "# Create prompts for all rows (same as DeepSeek)\n",
278
+ "print(\"Creating prompts...\")\n",
279
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
280
+ "\n",
281
+ "print(f\"\\nAnnotating with BLOOMZ-7B1 LOCAL - rows {current_index} to {len(df)}...\")\n",
282
+ "print(f\"Model: {MODEL_NAME}\")\n",
283
+ "print(f\"This may take a while...\\n\")\n",
284
+ "\n",
285
+ "try:\n",
286
+ " start_time = time.time()\n",
287
+ " \n",
288
+ " for i in tqdm(range(current_index, len(df)), desc=\"Annotating\"):\n",
289
+ " row = df.iloc[i]\n",
290
+ " \n",
291
+ " # Query BLOOMZ (equivalent to DeepSeek query)\n",
292
+ " response = query_bloomz_local(row['prompt'])\n",
293
+ " parsed_data = parse_response(response)\n",
294
+ " \n",
295
+ " # Update dataframe\n",
296
+ " for key, value in parsed_data.items():\n",
297
+ " df.at[i, key] = value\n",
298
+ " \n",
299
+ " current_index = i + 1\n",
300
+ " \n",
301
+ " # Save progress at intervals\n",
302
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
303
+ " df.to_csv(output_file, index=False)\n",
304
+ " with open(index_file, 'w') as f:\n",
305
+ " f.write(str(current_index))\n",
306
+ " print(f\"✅ Progress saved after {i+1} rows\")\n",
307
+ " \n",
308
+ " # Optional: Add small delay to prevent overheating (not needed for rate limiting like DeepSeek)\n",
309
+ " # time.sleep(0.1)\n",
310
+ " \n",
311
+ " elapsed_total = time.time() - start_time\n",
312
+ " print(f\"\\n✅ Done! Final results saved to {output_file}\")\n",
313
+ " \n",
314
+ " # Summary statistics (same as DeepSeek)\n",
315
+ " print(\"\\n=== Summary Statistics ===\")\n",
316
+ " print(f\"Total processed: {len(df)}\")\n",
317
+ " print(f\"\\nGender distribution:\")\n",
318
+ " print(df['gender'].value_counts())\n",
319
+ " print(f\"\\nTop 10 profession combinations:\")\n",
320
+ " print(df['profession_llm'].value_counts().head(10))\n",
321
+ " print(f\"\\nTop 10 countries:\")\n",
322
+ " print(df['country'].value_counts().head(10))\n",
323
+ " \n",
324
+ " # Sample results\n",
325
+ " print(\"\\n=== Sample Results ===\")\n",
326
+ " display_cols = ['real_name', 'full_name', 'gender', 'profession_llm', 'country']\n",
327
+ " available_cols = [col for col in display_cols if col in df.columns]\n",
328
+ " print(df[available_cols].head(10).to_string(index=False))\n",
329
+ " \n",
330
+ " # Additional info for local model\n",
331
+ " print(f\"\\nTotal time: {elapsed_total/60:.1f} minutes\")\n",
332
+ " print(f\"Average speed: {len(df)/(elapsed_total/3600):.1f} samples/hour\")\n",
333
+ " if torch.cuda.is_available():\n",
334
+ " print(f\"Final VRAM usage: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB\")\n",
335
+ "\n",
336
+ "except Exception as e:\n",
337
+ " print(f\"⚠️ Error encountered: {e}\")\n",
338
+ " print(f\"⚠️ Last processed index: {current_index}\")\n",
339
+ " \n",
340
+ " # Save progress before exiting\n",
341
+ " df.to_csv(output_file, index=False)\n",
342
+ " with open(index_file, 'w') as f:\n",
343
+ " f.write(str(current_index))\n",
344
+ " \n",
345
+ " print(f\"⚠️ Progress saved up to row {current_index}\")"
346
+ ]
347
+ }
348
+ ],
349
+ "metadata": {
350
+ "kernelspec": {
351
+ "display_name": "pm-paper",
352
+ "language": "python",
353
+ "name": "pm-paper"
354
+ },
355
+ "language_info": {
356
+ "codemirror_mode": {
357
+ "name": "ipython",
358
+ "version": 3
359
+ },
360
+ "file_extension": ".py",
361
+ "mimetype": "text/x-python",
362
+ "name": "python",
363
+ "nbconvert_exporter": "python",
364
+ "pygments_lexer": "ipython3",
365
+ "version": "3.11.13"
366
+ }
367
+ },
368
+ "nbformat": 4,
369
+ "nbformat_minor": 5
370
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_1_LLM_annotation-checkpoint.ipynb ADDED
@@ -0,0 +1,1941 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "23d0ae58",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Deepfake Adapter Dataset - LLM Annotation Pipeline"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "e4407358",
14
+ "metadata": {},
15
+ "source": [
16
+ "### Unified Model Loading & Inference\n",
17
+ "Code for querying Mistral, Gemma, and Qwen models."
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "id": "1a1b9d0e",
23
+ "metadata": {},
24
+ "source": [
25
+ "## CLEANING & PREPROCESSING"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "id": "3df42c46",
31
+ "metadata": {},
32
+ "source": [
33
+ "#### Named Entity Recognitition (NER) using SpaCy "
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "execution_count": null,
39
+ "id": "a287eef4",
40
+ "metadata": {},
41
+ "outputs": [],
42
+ "source": [
43
+ "import pandas as pd\n",
44
+ "import re\n",
45
+ "from pathlib import Path\n",
46
+ "import emoji\n",
47
+ "import spacy\n",
48
+ "\n",
49
+ "# Load spaCy model\n",
50
+ "# You may need to download it first: python -m spacy download en_core_web_sm\n",
51
+ "try:\n",
52
+ " nlp = spacy.load(\"en_core_web_sm\")\n",
53
+ " print(\"✅ spaCy model loaded: en_core_web_sm\")\n",
54
+ "except OSError:\n",
55
+ " print(\"❌ spaCy model not found. Downloading...\")\n",
56
+ " import subprocess\n",
57
+ " subprocess.run([\"python\", \"-m\", \"spacy\", \"download\", \"en_core_web_sm\"])\n",
58
+ " nlp = spacy.load(\"en_core_web_sm\")\n",
59
+ " print(\"✅ spaCy model downloaded and loaded\")\n",
60
+ "\n",
61
+ "# Set up paths\n",
62
+ "current_dir = Path.cwd()\n",
63
+ "#input_file = current_dir.parent / \"data/CSV/real_person_adapters.csv\"\n",
64
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter.csv\"\n",
65
+ "\n",
66
+ "# Load dataset\n",
67
+ "df = pd.read_csv(input_file)\n",
68
+ "print(f\"Loaded {len(df)} rows\")\n",
69
+ "\n",
70
+ "def translate_leetspeak(text: str) -> str:\n",
71
+ " \"\"\"\n",
72
+ " Translate common leetspeak patterns to normal letters.\n",
73
+ " Examples: 4kira -> Akira, 3mma -> Emma, 1rene -> Irene\n",
74
+ " \"\"\"\n",
75
+ " if not text:\n",
76
+ " return text\n",
77
+ " \n",
78
+ " # Common leetspeak mappings (order matters!)\n",
79
+ " leetspeak_map = {\n",
80
+ " '4': 'a',\n",
81
+ " '3': 'e', \n",
82
+ " '1': 'i',\n",
83
+ " '0': 'o',\n",
84
+ " '7': 't',\n",
85
+ " '5': 's',\n",
86
+ " '8': 'b',\n",
87
+ " '9': 'g',\n",
88
+ " '@': 'a',\n",
89
+ " '$': 's',\n",
90
+ " '!': 'i',\n",
91
+ " }\n",
92
+ " \n",
93
+ " result = text\n",
94
+ " # Apply mappings at word boundaries or start of string\n",
95
+ " for leet, normal in leetspeak_map.items():\n",
96
+ " # Replace at start of word\n",
97
+ " result = re.sub(rf'\\b{re.escape(leet)}', normal, result, flags=re.IGNORECASE)\n",
98
+ " # Replace standalone numbers that look like letters in context\n",
99
+ " result = re.sub(rf'(?<=[a-z]){re.escape(leet)}(?=[a-z])', normal, result, flags=re.IGNORECASE)\n",
100
+ " \n",
101
+ " return result\n",
102
+ "\n",
103
+ "def preprocess_for_ner(name: str) -> str:\n",
104
+ " \"\"\"\n",
105
+ " Preprocess the name before spaCy NER.\n",
106
+ " Remove noise but keep the actual name parts.\n",
107
+ " \"\"\"\n",
108
+ " if pd.isna(name):\n",
109
+ " return \"\"\n",
110
+ " \n",
111
+ " name = str(name)\n",
112
+ " \n",
113
+ " # FIRST: Translate leetspeak\n",
114
+ " name = translate_leetspeak(name)\n",
115
+ " \n",
116
+ " # Remove emoji\n",
117
+ " name = emoji.replace_emoji(name, replace=' ')\n",
118
+ " \n",
119
+ " # Remove version indicators (v1, v2, v1.0, etc.)\n",
120
+ " name = re.sub(r'\\s*[vV]\\d+(\\.\\d+)?\\s*', ' ', name)\n",
121
+ " \n",
122
+ " # Remove LoRA-related terms (case insensitive)\n",
123
+ " lora_terms = ['lora', 'loha', 'lycoris', 'controlnet', 'textual inversion', \n",
124
+ " 'embedding', 'ti', 'checkpoint', 'model', 'adapter', 'pony', 'sdxl', 'flux', 'illustrious', 'sd14', 'sd14', 'sd2', 'sd3', 'diffusion', 'stable', 'hunyuan']\n",
125
+ " for term in lora_terms:\n",
126
+ " name = re.sub(rf'\\b{term}\\b', '', name, flags=re.IGNORECASE)\n",
127
+ " \n",
128
+ " # Remove content in parentheses or brackets (often metadata)\n",
129
+ " name = re.sub(r'\\([^)]*\\)', '', name)\n",
130
+ " name = re.sub(r'\\[[^\\]]*\\]', '', name)\n",
131
+ " \n",
132
+ " # Remove special characters like 「」\n",
133
+ " name = re.sub(r'[「」『』【】〈〉《》]', '', name)\n",
134
+ " \n",
135
+ " # Handle pipe - keep first part\n",
136
+ " if '|' in name:\n",
137
+ " name = name.split('|')[0]\n",
138
+ " \n",
139
+ " # Handle forward slash - keep first part\n",
140
+ " if '/' in name:\n",
141
+ " name = name.split('/')[0]\n",
142
+ " \n",
143
+ " # Replace underscores with spaces\n",
144
+ " name = name.replace('_', ' ')\n",
145
+ " \n",
146
+ " # Remove multiple spaces\n",
147
+ " name = re.sub(r'\\s+', ' ', name)\n",
148
+ " \n",
149
+ " # Strip\n",
150
+ " name = name.strip()\n",
151
+ " \n",
152
+ " return name\n",
153
+ "\n",
154
+ "def extract_person_name(text: str) -> str:\n",
155
+ " \"\"\"\n",
156
+ " Use spaCy NER to extract person names from text.\n",
157
+ " Falls back to cleaned text if no PERSON entity found.\n",
158
+ " \"\"\"\n",
159
+ " if not text:\n",
160
+ " return \"\"\n",
161
+ " \n",
162
+ " # Run spaCy NER\n",
163
+ " doc = nlp(text)\n",
164
+ " \n",
165
+ " # Extract PERSON entities\n",
166
+ " person_entities = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
167
+ " \n",
168
+ " if person_entities:\n",
169
+ " # Return the first (usually longest/best) person name\n",
170
+ " return person_entities[0].strip()\n",
171
+ " \n",
172
+ " # If no PERSON entity found, try to extract capitalized words (likely names)\n",
173
+ " # This helps with names spaCy might miss\n",
174
+ " words = text.split()\n",
175
+ " capitalized_words = [w for w in words if w and w[0].isupper() and len(w) > 1]\n",
176
+ " \n",
177
+ " if capitalized_words:\n",
178
+ " # Join first few capitalized words (likely the name)\n",
179
+ " return ' '.join(capitalized_words[:3]).strip()\n",
180
+ " \n",
181
+ " # Last resort: return cleaned text\n",
182
+ " return text.strip()\n",
183
+ "\n",
184
+ "def clean_name_with_spacy(name: str) -> str:\n",
185
+ " \"\"\"\n",
186
+ " Complete name cleaning pipeline with spaCy NER.\n",
187
+ " \n",
188
+ " Pipeline:\n",
189
+ " 1. Translate leetspeak (4→a, 3→e, 1→i, etc.)\n",
190
+ " 2. Remove noise (emoji, version tags, LoRA terms)\n",
191
+ " 3. Use spaCy to extract PERSON entities\n",
192
+ " 4. Fallback to capitalized words or cleaned text\n",
193
+ " \"\"\"\n",
194
+ " # Step 1 & 2: Preprocess (leetspeak + noise removal)\n",
195
+ " preprocessed = preprocess_for_ner(name)\n",
196
+ " \n",
197
+ " if not preprocessed:\n",
198
+ " return \"\"\n",
199
+ " \n",
200
+ " # Step 3: Extract person name using spaCy NER\n",
201
+ " person_name = extract_person_name(preprocessed)\n",
202
+ " \n",
203
+ " return person_name\n",
204
+ "\n",
205
+ "# Apply name cleaning with spaCy\n",
206
+ "print(\"\\n🔄 Processing names with spaCy NER...\")\n",
207
+ "df['real_name'] = df['name'].apply(clean_name_with_spacy)\n",
208
+ "\n",
209
+ "# Show examples with detailed comparison\n",
210
+ "print(\"\\n📊 Name cleaning examples (with spaCy NER):\")\n",
211
+ "print(\"=\" * 100)\n",
212
+ "print(f\"{'Original Name':<50} | {'Cleaned Name':<30}\")\n",
213
+ "print(\"=\" * 100)\n",
214
+ "\n",
215
+ "examples = df[['name', 'real_name']].head(30)\n",
216
+ "shown = 0\n",
217
+ "for idx, row in examples.iterrows():\n",
218
+ " if row['name'] != row['real_name'] and shown < 20:\n",
219
+ " print(f\"{row['name']:<50} | {row['real_name']:<30}\")\n",
220
+ " shown += 1\n",
221
+ "\n",
222
+ "print(\"=\" * 100)\n",
223
+ "\n",
224
+ "# Show specific test cases\n",
225
+ "print(\"\\n🧪 Leetspeak translation examples:\")\n",
226
+ "test_names = ['4kira LoRA', '3mma Watson v2', '1rene LORA', 'L3vi Ackerman']\n",
227
+ "for test in test_names:\n",
228
+ " result = clean_name_with_spacy(test)\n",
229
+ " print(f\" {test:<30} -> {result}\")\n",
230
+ "\n",
231
+ "# Statistics\n",
232
+ "print(f\"\\n📈 Statistics:\")\n",
233
+ "print(f\" Total rows: {len(df)}\")\n",
234
+ "print(f\" Non-empty names: {(df['real_name'] != '').sum()}\")\n",
235
+ "print(f\" Empty names: {(df['real_name'] == '').sum()}\")\n",
236
+ "\n",
237
+ "# Show some examples of what spaCy identified\n",
238
+ "print(\"\\n🎯 Sample spaCy NER results:\")\n",
239
+ "sample_names = df['real_name'].head(20).tolist()\n",
240
+ "for i, name in enumerate(sample_names[:10], 1):\n",
241
+ " if name:\n",
242
+ " print(f\" {i}. {name}\")\n",
243
+ "\n",
244
+ "print(f\"\\n✅ Cleaned {len(df)} names using spaCy NER\")\n",
245
+ "\n",
246
+ "# Save intermediate result\n",
247
+ "output_step1 = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
248
+ "df.to_csv(output_step1, index=False)\n",
249
+ "print(f\"💾 Saved to {output_step1}\")\n"
250
+ ]
251
+ },
252
+ {
253
+ "cell_type": "markdown",
254
+ "id": "64687c72",
255
+ "metadata": {},
256
+ "source": [
257
+ "#### STEP 02: Nationality tag to Country hint\n",
258
+ "here tags related to nationality gets converted to the country equivalent."
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": null,
264
+ "id": "d6eaef5b",
265
+ "metadata": {},
266
+ "outputs": [],
267
+ "source": [
268
+ "import pandas as pd\n",
269
+ "from pathlib import Path\n",
270
+ "\n",
271
+ "# Set up paths\n",
272
+ "current_dir = Path.cwd()\n",
273
+ "countries_file = current_dir.parent / \"misc/lists/countries.csv\"\n",
274
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
275
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
276
+ "output_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
277
+ "\n",
278
+ "# Load datasets\n",
279
+ "poi_df = pd.read_csv(input_file)\n",
280
+ "countries_df = pd.read_csv(countries_file)\n",
281
+ "professions_df = pd.read_csv(professions_file)\n",
282
+ "\n",
283
+ "# Define uninhabited or non-relevant territories to exclude\n",
284
+ "excluded_territories = {\n",
285
+ " 'isle of man', 'bouvet island', 'heard island and mcdonald islands',\n",
286
+ " 'french southern territories', 'south georgia and the south sandwich islands',\n",
287
+ " 'svalbard and jan mayen', 'british indian ocean territory', 'antarctica',\n",
288
+ " 'christmas island', 'cocos (keeling) islands', 'norfolk island',\n",
289
+ " 'pitcairn', 'tokelau', 'united states minor outlying islands',\n",
290
+ " 'wallis and futuna', 'western sahara'\n",
291
+ "}\n",
292
+ "\n",
293
+ "# Step 1: Combine tags into one lowercase list\n",
294
+ "def combine_tags(row):\n",
295
+ " return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
296
+ "\n",
297
+ "poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
298
+ "\n",
299
+ "# Step 2: Build tag → (country, nationality) mapping with PRIORITIES\n",
300
+ "tag_to_country_nationality = {}\n",
301
+ "# We'll use a priority score: direct country name = 3, nationality = 2, word parts = 1\n",
302
+ "\n",
303
+ "for _, row in countries_df.iterrows():\n",
304
+ " country = str(row[\"en_short_name\"]).strip()\n",
305
+ " nationality = str(row[\"nationality\"]).strip()\n",
306
+ " \n",
307
+ " # Skip excluded territories\n",
308
+ " if country.lower() in excluded_territories:\n",
309
+ " continue\n",
310
+ "\n",
311
+ " country_lc = country.lower()\n",
312
+ " nationality_lc = nationality.lower()\n",
313
+ "\n",
314
+ " # Store as (country, nationality, priority)\n",
315
+ " # Exact country name match = highest priority\n",
316
+ " if country_lc not in tag_to_country_nationality:\n",
317
+ " tag_to_country_nationality[country_lc] = (country, \"\", 3)\n",
318
+ " \n",
319
+ " # Exact nationality match = medium priority \n",
320
+ " if nationality_lc not in tag_to_country_nationality:\n",
321
+ " tag_to_country_nationality[nationality_lc] = (\"\", nationality, 2)\n",
322
+ " \n",
323
+ " # No-space versions\n",
324
+ " country_no_space = country_lc.replace(\" \", \"\")\n",
325
+ " nationality_no_space = nationality_lc.replace(\" \", \"\")\n",
326
+ " \n",
327
+ " if country_no_space not in tag_to_country_nationality:\n",
328
+ " tag_to_country_nationality[country_no_space] = (country, \"\", 3)\n",
329
+ " if nationality_no_space not in tag_to_country_nationality:\n",
330
+ " tag_to_country_nationality[nationality_no_space] = (\"\", nationality, 2)\n",
331
+ "\n",
332
+ " # Word parts = lowest priority (only for longer words to avoid false matches)\n",
333
+ " for part in country_lc.split():\n",
334
+ " if len(part) > 4: # Only words longer than 4 chars\n",
335
+ " if part not in tag_to_country_nationality:\n",
336
+ " tag_to_country_nationality[part] = (country, \"\", 1)\n",
337
+ " for part in nationality_lc.split():\n",
338
+ " if len(part) > 4:\n",
339
+ " if part not in tag_to_country_nationality:\n",
340
+ " tag_to_country_nationality[part] = (\"\", nationality, 1)\n",
341
+ "\n",
342
+ "print(f\"Built country/nationality mapping with {len(tag_to_country_nationality)} entries\")\n",
343
+ "\n",
344
+ "# Step 3: Infer likely_country and likely_nationality by checking ALL tags\n",
345
+ "def infer_country_and_nationality(tags):\n",
346
+ " \"\"\"\n",
347
+ " Check ALL tags and return the best match based on priority.\n",
348
+ " Priority: exact country name > nationality > word parts\n",
349
+ " \"\"\"\n",
350
+ " best_match = None\n",
351
+ " best_priority = 0\n",
352
+ " \n",
353
+ " for tag in tags:\n",
354
+ " # Try cleaned version (no spaces)\n",
355
+ " cleaned = tag.replace(\" \", \"\").lower()\n",
356
+ " \n",
357
+ " # Check cleaned version\n",
358
+ " if cleaned in tag_to_country_nationality:\n",
359
+ " country, nationality, priority = tag_to_country_nationality[cleaned]\n",
360
+ " if priority > best_priority and country and country.lower() not in excluded_territories:\n",
361
+ " best_match = (country, nationality)\n",
362
+ " best_priority = priority\n",
363
+ " \n",
364
+ " # Also check original tag\n",
365
+ " if tag in tag_to_country_nationality:\n",
366
+ " country, nationality, priority = tag_to_country_nationality[tag]\n",
367
+ " if priority > best_priority and country and country.lower() not in excluded_territories:\n",
368
+ " best_match = (country, nationality)\n",
369
+ " best_priority = priority\n",
370
+ " \n",
371
+ " if best_match:\n",
372
+ " return pd.Series(best_match)\n",
373
+ " return pd.Series([\"\", \"\"])\n",
374
+ "\n",
375
+ "poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
376
+ "\n",
377
+ "# Step 4: Build tag → profession mapping\n",
378
+ "profession_alias_map = {}\n",
379
+ "\n",
380
+ "for _, row in professions_df.iterrows():\n",
381
+ " canonical = str(row['profession']).strip().lower()\n",
382
+ " profession_alias_map[canonical] = canonical\n",
383
+ " for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
384
+ " alias = row.get(alias_col)\n",
385
+ " if pd.notna(alias):\n",
386
+ " profession_alias_map[str(alias).strip().lower()] = canonical\n",
387
+ "\n",
388
+ "# Step 5: Infer likely profession from tags\n",
389
+ "def infer_profession_from_tags(tags):\n",
390
+ " matched = []\n",
391
+ " for tag in tags:\n",
392
+ " cleaned = tag.strip().lower()\n",
393
+ " if cleaned in profession_alias_map:\n",
394
+ " matched.append(profession_alias_map[cleaned])\n",
395
+ "\n",
396
+ " if not matched:\n",
397
+ " return \"\"\n",
398
+ " if \"celebrity\" in matched and len(set(matched)) > 1:\n",
399
+ " # Drop 'celebrity' if other professions are present\n",
400
+ " matched = [m for m in matched if m != \"celebrity\"]\n",
401
+ "\n",
402
+ " return matched[0] # Return the first specific match\n",
403
+ "\n",
404
+ "\n",
405
+ "poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
406
+ "\n",
407
+ "# Step 6: Save enriched dataset\n",
408
+ "poi_df.to_csv(output_file, index=False)\n",
409
+ "\n",
410
+ "# Preview results\n",
411
+ "print(f\"\\nProcessed {len(poi_df)} rows\")\n",
412
+ "print(f\"Rows with country: {(poi_df['likely_country'] != '').sum()}\")\n",
413
+ "print(f\"Rows with nationality: {(poi_df['likely_nationality'] != '').sum()}\")\n",
414
+ "print(f\"Rows with profession: {(poi_df['likely_profession'] != '').sum()}\")\n",
415
+ "\n",
416
+ "print(f\"\\nTop 10 countries:\")\n",
417
+ "print(poi_df[poi_df['likely_country'] != '']['likely_country'].value_counts().head(10))\n"
418
+ ]
419
+ },
420
+ {
421
+ "cell_type": "markdown",
422
+ "id": "4a4a58b3",
423
+ "metadata": {},
424
+ "source": [
425
+ "## LLM ANNOTATION"
426
+ ]
427
+ },
428
+ {
429
+ "cell_type": "markdown",
430
+ "id": "b298844d",
431
+ "metadata": {},
432
+ "source": [
433
+ "#### Model Configurations"
434
+ ]
435
+ },
436
+ {
437
+ "cell_type": "code",
438
+ "execution_count": null,
439
+ "id": "39f3d65e",
440
+ "metadata": {},
441
+ "outputs": [],
442
+ "source": [
443
+ "import pandas as pd\n",
444
+ "import json\n",
445
+ "import time\n",
446
+ "import re\n",
447
+ "from pathlib import Path\n",
448
+ "from tqdm import tqdm\n",
449
+ "import torch\n",
450
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
451
+ "import signal\n",
452
+ "from contextlib import contextmanager\n",
453
+ "\n",
454
+ "# Configuration\n",
455
+ "current_dir = Path.cwd()\n",
456
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
457
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
458
+ "\n",
459
+ "# Model configurations\n",
460
+ "MODEL_CONFIGS = {\n",
461
+ " 'mistral': {\n",
462
+ " 'name': 'mistralai/Mistral-7B-Instruct-v0.3',\n",
463
+ " 'dtype': torch.bfloat16,\n",
464
+ " 'quantization': None,\n",
465
+ " 'generation_params': {\n",
466
+ " 'max_new_tokens': 512,\n",
467
+ " 'temperature': 0.05,\n",
468
+ " 'do_sample': True,\n",
469
+ " 'top_p': 1.0,\n",
470
+ " }\n",
471
+ " },\n",
472
+ " 'gemma': {\n",
473
+ " 'name': 'google/gemma-3-27b-it',\n",
474
+ " 'dtype': torch.bfloat16,\n",
475
+ " 'quantization': None,\n",
476
+ " 'generation_params': {\n",
477
+ " 'max_new_tokens': 512,\n",
478
+ " 'temperature': 0.1,\n",
479
+ " 'do_sample': True,\n",
480
+ " 'top_p': 1.0,\n",
481
+ " }\n",
482
+ " },\n",
483
+ " 'qwen': {\n",
484
+ " 'name': 'Qwen/Qwen2.5-32B-Instruct',\n",
485
+ " 'dtype': None, # Will use quantization\n",
486
+ " 'quantization': BitsAndBytesConfig(\n",
487
+ " load_in_8bit=True,\n",
488
+ " llm_int8_threshold=6.0,\n",
489
+ " llm_int8_has_fp16_weight=False\n",
490
+ " ),\n",
491
+ " 'generation_params': {\n",
492
+ " 'max_new_tokens': 512,\n",
493
+ " 'temperature': 0.1,\n",
494
+ " 'do_sample': False,\n",
495
+ " }\n",
496
+ " }\n",
497
+ "}\n",
498
+ "\n",
499
+ "PROFESSION_CATEGORIES = [\n",
500
+ " \"actor\",\n",
501
+ " \"adult performer\",\n",
502
+ " \"singer/musician\",\n",
503
+ " \"model\",\n",
504
+ " \"online personality\",\n",
505
+ " \"public figure\",\n",
506
+ " \"voice actor/ASMR\",\n",
507
+ " \"sports professional\",\n",
508
+ " \"tv personality\"\n",
509
+ "]\n"
510
+ ]
511
+ },
512
+ {
513
+ "cell_type": "markdown",
514
+ "id": "c215b38c",
515
+ "metadata": {},
516
+ "source": [
517
+ "#### Load Model Function"
518
+ ]
519
+ },
520
+ {
521
+ "cell_type": "code",
522
+ "execution_count": null,
523
+ "id": "cfb5b13e",
524
+ "metadata": {},
525
+ "outputs": [],
526
+ "source": [
527
+ "def load_model(model_type='mistral'):\n",
528
+ " \"\"\"\n",
529
+ " Load model and tokenizer based on type.\n",
530
+ " \n",
531
+ " Args:\n",
532
+ " model_type: 'mistral', 'gemma', or 'qwen'\n",
533
+ " \n",
534
+ " Returns:\n",
535
+ " tuple: (model, tokenizer, config)\n",
536
+ " \"\"\"\n",
537
+ " if model_type not in MODEL_CONFIGS:\n",
538
+ " raise ValueError(f\"Unknown model type: {model_type}. Choose from {list(MODEL_CONFIGS.keys())}\")\n",
539
+ " \n",
540
+ " config = MODEL_CONFIGS[model_type]\n",
541
+ " model_name = config['name']\n",
542
+ " \n",
543
+ " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
544
+ " print(f\"Loading model: {model_name}\")\n",
545
+ " print(f\"Cache directory: {CACHE_DIR}\")\n",
546
+ " print(f\"Device: {device}\\n\")\n",
547
+ " \n",
548
+ " if device == \"cpu\":\n",
549
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
550
+ " \n",
551
+ " # Load tokenizer\n",
552
+ " try:\n",
553
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
554
+ " model_name,\n",
555
+ " cache_dir=str(CACHE_DIR),\n",
556
+ " use_fast=True\n",
557
+ " )\n",
558
+ " except:\n",
559
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
560
+ " model_name,\n",
561
+ " cache_dir=str(CACHE_DIR),\n",
562
+ " use_fast=False\n",
563
+ " )\n",
564
+ " \n",
565
+ " if tokenizer.pad_token is None:\n",
566
+ " tokenizer.pad_token = tokenizer.eos_token\n",
567
+ " \n",
568
+ " # Load model\n",
569
+ " model_kwargs = {\n",
570
+ " 'cache_dir': str(CACHE_DIR),\n",
571
+ " 'device_map': 'auto',\n",
572
+ " 'trust_remote_code': False\n",
573
+ " }\n",
574
+ " \n",
575
+ " if config['quantization']:\n",
576
+ " model_kwargs['quantization_config'] = config['quantization']\n",
577
+ " else:\n",
578
+ " model_kwargs['torch_dtype'] = config['dtype']\n",
579
+ " \n",
580
+ " model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n",
581
+ " model.eval()\n",
582
+ " \n",
583
+ " # Check VRAM\n",
584
+ " if torch.cuda.is_available():\n",
585
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
586
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
587
+ " \n",
588
+ " return model, tokenizer, config\n"
589
+ ]
590
+ },
591
+ {
592
+ "cell_type": "markdown",
593
+ "id": "11b2221a",
594
+ "metadata": {},
595
+ "source": [
596
+ "#### Inference Code"
597
+ ]
598
+ },
599
+ {
600
+ "cell_type": "code",
601
+ "execution_count": null,
602
+ "id": "229f96bd",
603
+ "metadata": {},
604
+ "outputs": [],
605
+ "source": [
606
+ "@contextmanager\n",
607
+ "def timeout(duration):\n",
608
+ " \"\"\"Context manager for timeout.\"\"\"\n",
609
+ " def handler(signum, frame):\n",
610
+ " raise TimeoutError(\"Operation timed out\")\n",
611
+ " \n",
612
+ " signal.signal(signal.SIGALRM, handler)\n",
613
+ " signal.alarm(duration)\n",
614
+ " try:\n",
615
+ " yield\n",
616
+ " finally:\n",
617
+ " signal.alarm(0)\n",
618
+ "\n",
619
+ "def query_model(prompt, model, tokenizer, config, use_timeout=False):\n",
620
+ " \"\"\"\n",
621
+ " Query model with given prompt.\n",
622
+ " \n",
623
+ " Args:\n",
624
+ " prompt: Input prompt string\n",
625
+ " model: Loaded model\n",
626
+ " tokenizer: Loaded tokenizer\n",
627
+ " config: Model configuration dict\n",
628
+ " use_timeout: Whether to use 60s timeout (for Qwen)\n",
629
+ " \n",
630
+ " Returns:\n",
631
+ " str: Model response or None on error\n",
632
+ " \"\"\"\n",
633
+ " try:\n",
634
+ " device = next(model.parameters()).device\n",
635
+ " \n",
636
+ " # Format as chat message\n",
637
+ " messages = [\n",
638
+ " {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
639
+ " {\"role\": \"user\", \"content\": prompt}\n",
640
+ " ]\n",
641
+ " \n",
642
+ " # Tokenize\n",
643
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
644
+ " text = tokenizer.apply_chat_template(\n",
645
+ " messages,\n",
646
+ " tokenize=False,\n",
647
+ " add_generation_prompt=True\n",
648
+ " )\n",
649
+ " else:\n",
650
+ " text = f\"[INST] {prompt} [/INST]\"\n",
651
+ " \n",
652
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
653
+ " \n",
654
+ " # Generation parameters\n",
655
+ " gen_kwargs = config['generation_params'].copy()\n",
656
+ " gen_kwargs['pad_token_id'] = tokenizer.eos_token_id\n",
657
+ " \n",
658
+ " # Generate\n",
659
+ " generation_fn = lambda: model.generate(**inputs, **gen_kwargs)\n",
660
+ " \n",
661
+ " if use_timeout:\n",
662
+ " with timeout(60):\n",
663
+ " with torch.no_grad():\n",
664
+ " outputs = generation_fn()\n",
665
+ " else:\n",
666
+ " with torch.no_grad():\n",
667
+ " outputs = generation_fn()\n",
668
+ " \n",
669
+ " # Decode\n",
670
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
671
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
672
+ " \n",
673
+ " return response.strip()\n",
674
+ " \n",
675
+ " except TimeoutError:\n",
676
+ " print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
677
+ " return None\n",
678
+ " except Exception as e:\n",
679
+ " print(f\"[ERROR] Generation failed: {e}\")\n",
680
+ " return None\n"
681
+ ]
682
+ },
683
+ {
684
+ "cell_type": "markdown",
685
+ "id": "88f005f8",
686
+ "metadata": {},
687
+ "source": [
688
+ "#### Prompt creation"
689
+ ]
690
+ },
691
+ {
692
+ "cell_type": "code",
693
+ "execution_count": null,
694
+ "id": "dfe05463",
695
+ "metadata": {},
696
+ "outputs": [],
697
+ "source": [
698
+ "def create_prompt(row):\n",
699
+ " \"\"\"Create annotation prompt from row data.\"\"\"\n",
700
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
701
+ " \n",
702
+ " # Gather hints\n",
703
+ " hints = []\n",
704
+ " if pd.notna(row.get('likely_profession')):\n",
705
+ " hints.append(str(row['likely_profession']))\n",
706
+ " if pd.notna(row.get('likely_nationality')):\n",
707
+ " hints.append(str(row['likely_nationality']))\n",
708
+ " if pd.notna(row.get('likely_country')):\n",
709
+ " hints.append(str(row['likely_country']))\n",
710
+ " \n",
711
+ " # Add tags if needed\n",
712
+ " if len(hints) < 3:\n",
713
+ " for i in range(1, 8):\n",
714
+ " tag_col = f'tag_{i}'\n",
715
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
716
+ " tag_val = str(row[tag_col])\n",
717
+ " if tag_val not in hints:\n",
718
+ " hints.append(tag_val)\n",
719
+ " if len(hints) >= 5:\n",
720
+ " break\n",
721
+ " \n",
722
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
723
+ " \n",
724
+ " return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
725
+ "\n",
726
+ "Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
727
+ "\n",
728
+ "FORMAT REQUIREMENTS:\n",
729
+ "1. Full legal name in Western order (first last). VALUE ONLY.\n",
730
+ "2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
731
+ "3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
732
+ "4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
733
+ "5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
734
+ "\n",
735
+ "RULES:\n",
736
+ "- Professions MUST match the exact categories listed (actress = actor)\n",
737
+ "- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
738
+ "- \"public figure\" includes politicians, activists, journalists, authors\n",
739
+ "- Use \"Unknown\" when uncertain or for fictional characters\n",
740
+ "- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
741
+ "- For multi-role people, list up to 3 categories by relevance\n",
742
+ "\n",
743
+ "EXAMPLE FORMAT:\n",
744
+ "1. Taylor Swift\n",
745
+ "2. None\n",
746
+ "3. Female\n",
747
+ "4. singer/musician, public figure\n",
748
+ "5. United States\"\"\"\n"
749
+ ]
750
+ },
751
+ {
752
+ "cell_type": "markdown",
753
+ "id": "854fa668",
754
+ "metadata": {},
755
+ "source": [
756
+ "#### Response parsing code"
757
+ ]
758
+ },
759
+ {
760
+ "cell_type": "code",
761
+ "execution_count": null,
762
+ "id": "1a4be2ee",
763
+ "metadata": {},
764
+ "outputs": [],
765
+ "source": [
766
+ "def parse_response(response):\n",
767
+ " \"\"\"Parse model response into structured fields.\"\"\"\n",
768
+ " if not response:\n",
769
+ " return {\n",
770
+ " 'full_name': 'Unknown',\n",
771
+ " 'aliases': 'Unknown',\n",
772
+ " 'gender': 'Unknown',\n",
773
+ " 'profession_llm': 'Unknown',\n",
774
+ " 'country': 'Unknown'\n",
775
+ " }\n",
776
+ " \n",
777
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
778
+ " \n",
779
+ " fields = {\n",
780
+ " 'full_name': 'Unknown',\n",
781
+ " 'aliases': 'Unknown',\n",
782
+ " 'gender': 'Unknown',\n",
783
+ " 'profession_llm': 'Unknown',\n",
784
+ " 'country': 'Unknown'\n",
785
+ " }\n",
786
+ " \n",
787
+ " for line in lines:\n",
788
+ " if line.startswith('1.'):\n",
789
+ " fields['full_name'] = line[2:].strip()\n",
790
+ " elif line.startswith('2.'):\n",
791
+ " fields['aliases'] = line[2:].strip()\n",
792
+ " elif line.startswith('3.'):\n",
793
+ " gender_raw = line[2:].strip()\n",
794
+ " gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
795
+ " gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
796
+ " fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
797
+ " elif line.startswith('4.'):\n",
798
+ " fields['profession_llm'] = line[2:].strip()\n",
799
+ " elif line.startswith('5.'):\n",
800
+ " country_raw = line[2:].strip()\n",
801
+ " country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
802
+ " fields['country'] = country_raw\n",
803
+ " \n",
804
+ " return fields\n"
805
+ ]
806
+ },
807
+ {
808
+ "cell_type": "markdown",
809
+ "id": "7e2f7a86",
810
+ "metadata": {},
811
+ "source": [
812
+ "#### CSV annotation"
813
+ ]
814
+ },
815
+ {
816
+ "cell_type": "code",
817
+ "execution_count": null,
818
+ "id": "5f3dd5d6",
819
+ "metadata": {},
820
+ "outputs": [],
821
+ "source": [
822
+ "def annotate_dataset(model_type='mistral', test_mode=False, test_size=100, max_rows=50862, save_interval=10):\n",
823
+ " \"\"\"\n",
824
+ " Annotate dataset using specified model.\n",
825
+ " \n",
826
+ " Args:\n",
827
+ " model_type: 'mistral', 'gemma', or 'qwen'\n",
828
+ " test_mode: If True, only process test_size rows\n",
829
+ " test_size: Number of rows to process in test mode\n",
830
+ " max_rows: Maximum rows to process\n",
831
+ " save_interval: Save progress every N rows\n",
832
+ " \"\"\"\n",
833
+ " # Setup paths\n",
834
+ " input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
835
+ " output_file = current_dir.parent / f\"data/CSV/{model_type}_local_annotated_POI{'_test' if test_mode else ''}.csv\"\n",
836
+ " index_file = current_dir.parent / f\"misc/query_indicies/{model_type}_local_query_index.txt\"\n",
837
+ " index_file.parent.mkdir(parents=True, exist_ok=True)\n",
838
+ " \n",
839
+ " # Load model\n",
840
+ " model, tokenizer, config = load_model(model_type)\n",
841
+ " \n",
842
+ " # Load data\n",
843
+ " print(f\"Loaded {len(df)} rows from input file\")\n",
844
+ " df = pd.read_csv(input_file)\n",
845
+ " \n",
846
+ " # Merge existing annotations if available\n",
847
+ " if output_file.exists():\n",
848
+ " existing_df = pd.read_csv(output_file)\n",
849
+ " annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
850
+ " for col in annotation_cols:\n",
851
+ " if col in existing_df.columns:\n",
852
+ " df[col] = existing_df[col][:len(df)]\n",
853
+ " \n",
854
+ " # Apply limits\n",
855
+ " if test_mode:\n",
856
+ " df = df.head(test_size).copy()\n",
857
+ " elif max_rows:\n",
858
+ " df = df.head(max_rows).copy()\n",
859
+ " \n",
860
+ " # Create prompts\n",
861
+ " df['prompt'] = df.apply(create_prompt, axis=1)\n",
862
+ " \n",
863
+ " # Load progress index\n",
864
+ " current_index = 0\n",
865
+ " if index_file.exists():\n",
866
+ " try:\n",
867
+ " current_index = int(index_file.read_text().strip())\n",
868
+ " except:\n",
869
+ " current_index = 0\n",
870
+ " \n",
871
+ " print(f\"Resuming from index {current_index}\")\n",
872
+ " \n",
873
+ " # Process rows\n",
874
+ " use_timeout = (model_type == 'qwen')\n",
875
+ " \n",
876
+ " for i in tqdm(range(current_index, len(df)), desc=f\"{model_type.capitalize()} Annotation\"):\n",
877
+ " prompt = df.at[i, \"prompt\"]\n",
878
+ " \n",
879
+ " # Query with retries\n",
880
+ " response = None\n",
881
+ " for attempt in range(3):\n",
882
+ " response = query_model(prompt, model, tokenizer, config, use_timeout)\n",
883
+ " \n",
884
+ " if response and len(response.strip()) > 10:\n",
885
+ " break\n",
886
+ " \n",
887
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
888
+ " time.sleep(0.5)\n",
889
+ " \n",
890
+ " # Skip if invalid\n",
891
+ " if not response or len(response.strip()) <= 10:\n",
892
+ " print(f\"❌ Row {i}: failed after retries, skipping\")\n",
893
+ " continue\n",
894
+ " \n",
895
+ " # Parse and validate\n",
896
+ " parsed = parse_response(response)\n",
897
+ " \n",
898
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
899
+ " print(f\"❌ Row {i}: parsed as all Unknown, skipping\")\n",
900
+ " continue\n",
901
+ " \n",
902
+ " # Write fields\n",
903
+ " for key, value in parsed.items():\n",
904
+ " df.at[i, key] = value\n",
905
+ " \n",
906
+ " current_index = i + 1\n",
907
+ " \n",
908
+ " # GPU cleanup\n",
909
+ " if torch.cuda.is_available():\n",
910
+ " torch.cuda.empty_cache()\n",
911
+ " torch.cuda.synchronize()\n",
912
+ " \n",
913
+ " # Save progress\n",
914
+ " if (i + 1) % save_interval == 0 or (i + 1) == len(df):\n",
915
+ " df.to_csv(output_file, index=False)\n",
916
+ " index_file.write_text(str(current_index))\n",
917
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
918
+ " \n",
919
+ " # Final save\n",
920
+ " df.to_csv(output_file, index=False)\n",
921
+ " index_file.write_text(str(current_index))\n",
922
+ " print(f\"✓ Finished annotation with {model_type}\")\n"
923
+ ]
924
+ },
925
+ {
926
+ "cell_type": "markdown",
927
+ "id": "55da2f4c",
928
+ "metadata": {},
929
+ "source": [
930
+ "### Usage Examples\n",
931
+ "Run annotation with your chosen model."
932
+ ]
933
+ },
934
+ {
935
+ "cell_type": "code",
936
+ "execution_count": null,
937
+ "id": "351ea40c",
938
+ "metadata": {},
939
+ "outputs": [],
940
+ "source": [
941
+ "# Example 1: Annotate with Mistral (13.5 GB VRAM)\n",
942
+ "# annotate_dataset(model_type='mistral', test_mode=False)\n",
943
+ "\n",
944
+ "# Example 2: Annotate with Gemma (56.3 GB VRAM)\n",
945
+ "# annotate_dataset(model_type='gemma', test_mode=False)\n",
946
+ "\n",
947
+ "# Example 3: Annotate with Qwen (32.7 GB VRAM, 8-bit)\n",
948
+ "# annotate_dataset(model_type='qwen', test_mode=False)\n",
949
+ "\n",
950
+ "# Test mode (first 100 rows)\n",
951
+ "# annotate_dataset(model_type='mistral', test_mode=True, test_size=100)\n"
952
+ ]
953
+ },
954
+ {
955
+ "cell_type": "markdown",
956
+ "id": "6431d347-d80c-4e8b-83a7-531e5df95a72",
957
+ "metadata": {},
958
+ "source": [
959
+ "## EuroLLM-9B-Instruct"
960
+ ]
961
+ },
962
+ {
963
+ "cell_type": "code",
964
+ "execution_count": null,
965
+ "id": "e8203abc-e7c3-4cb6-aaeb-fdc6933981fc",
966
+ "metadata": {},
967
+ "outputs": [],
968
+ "source": [
969
+ "import pandas as pd\n",
970
+ "import json\n",
971
+ "import time\n",
972
+ "import re\n",
973
+ "from pathlib import Path\n",
974
+ "from tqdm import tqdm\n",
975
+ "import torch\n",
976
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
977
+ "import signal\n",
978
+ "from contextlib import contextmanager\n",
979
+ "\n",
980
+ "current_dir = Path.cwd()\n",
981
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
982
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
983
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
984
+ "# === PROCESS DATA ===\n",
985
+ "\n",
986
+ "\n",
987
+ "# === CONFIGURATION ===\n",
988
+ "TEST_MODE = False\n",
989
+ "TEST_SIZE = 100\n",
990
+ "MAX_ROWS = 50862\n",
991
+ "SAVE_INTERVAL = 10\n",
992
+ "\n",
993
+ "\n",
994
+ "index_file = current_dir.parent / \"misc/query_indicies/eurollm_local_query_index.txt\"\n",
995
+ "output_file = current_dir.parent / f\"data/CSV/eurollm_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
996
+ "\n",
997
+ "# Model settings\n",
998
+ "MODEL_NAME = \"utter-project/EuroLLM-9B-Instruct\"\n",
999
+ "#MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
1000
+ "#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
1001
+ "#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
1002
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
1003
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
1004
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
1005
+ "\n",
1006
+ "# Define the SPECIFIC profession categories\n",
1007
+ "PROFESSION_CATEGORIES = [\n",
1008
+ " \"actor\",\n",
1009
+ " \"adult performer\",\n",
1010
+ " \"singer/musician\",\n",
1011
+ " \"model\",\n",
1012
+ " \"online personality\",\n",
1013
+ " \"public figure\",\n",
1014
+ " \"voice actor/ASMR\",\n",
1015
+ " \"sports professional\",\n",
1016
+ " \"tv personality\"\n",
1017
+ "]\n",
1018
+ "\n",
1019
+ "# === LOAD MODEL ===\n",
1020
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
1021
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
1022
+ "print(f\"This may take a while on first run...\\n\")\n",
1023
+ "\n",
1024
+ "# Check GPU availability\n",
1025
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1026
+ "print(f\"Device: {device}\")\n",
1027
+ "\n",
1028
+ "if device == \"cpu\":\n",
1029
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
1030
+ " print(\" Consider using a GPU or reducing model size.\")\n",
1031
+ "\n",
1032
+ "# Get HF token from credentials file\n",
1033
+ "import os\n",
1034
+ "credentials_dir = current_dir.parent / \"misc/credentials\"\n",
1035
+ "hf_token_file = credentials_dir / \"hf_token.txt\"\n",
1036
+ "\n",
1037
+ "HF_TOKEN = None\n",
1038
+ "if hf_token_file.exists():\n",
1039
+ " HF_TOKEN = hf_token_file.read_text().strip()\n",
1040
+ " print(\"✅ HF token loaded from credentials file\")\n",
1041
+ "else:\n",
1042
+ " print(\"⚠️ HF token file not found at:\", hf_token_file)\n",
1043
+ " print(\" The script will try to use cached credentials from 'huggingface-cli login'\")\n",
1044
+ " print(\" Or create the file: misc/credentials/hf_token.txt with your token\")\n",
1045
+ " HF_TOKEN = None # Will use cached token if available\n",
1046
+ "\n",
1047
+ "# Load tokenizer\n",
1048
+ "print(\"Loading tokenizer...\")\n",
1049
+ "try:\n",
1050
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1051
+ " MODEL_NAME,\n",
1052
+ " cache_dir=str(CACHE_DIR),\n",
1053
+ " use_fast=True,\n",
1054
+ " token=HF_TOKEN\n",
1055
+ " )\n",
1056
+ "except Exception as e:\n",
1057
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
1058
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1059
+ " MODEL_NAME,\n",
1060
+ " cache_dir=str(CACHE_DIR),\n",
1061
+ " use_fast=False,\n",
1062
+ " token=HF_TOKEN\n",
1063
+ " )\n",
1064
+ "\n",
1065
+ "# Ensure pad token is set\n",
1066
+ "if tokenizer.pad_token is None:\n",
1067
+ " tokenizer.pad_token = tokenizer.eos_token\n",
1068
+ "\n",
1069
+ "print(\"✅ Tokenizer loaded\")\n",
1070
+ "\n",
1071
+ "# Configure 8-bit quantization for A100\n",
1072
+ "print(\"Configuring 8-bit quantization...\")\n",
1073
+ "quantization_config = BitsAndBytesConfig(\n",
1074
+ " load_in_8bit=True,\n",
1075
+ " llm_int8_threshold=6.0,\n",
1076
+ " llm_int8_has_fp16_weight=False\n",
1077
+ ")\n",
1078
+ "\n",
1079
+ "# Load model with 8-bit quantization\n",
1080
+ "print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
1081
+ "model = AutoModelForCausalLM.from_pretrained(\n",
1082
+ " MODEL_NAME,\n",
1083
+ " cache_dir=str(CACHE_DIR),\n",
1084
+ " quantization_config=quantization_config,\n",
1085
+ " device_map=\"auto\",\n",
1086
+ " trust_remote_code=False,\n",
1087
+ " token=HF_TOKEN\n",
1088
+ ")\n",
1089
+ "model.eval()\n",
1090
+ "print(\"✅ Model loaded with 8-bit quantization\")\n",
1091
+ "\n",
1092
+ "# Check VRAM usage\n",
1093
+ "if torch.cuda.is_available():\n",
1094
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
1095
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
1096
+ "\n",
1097
+ "# === LOAD DATA ===\n",
1098
+ "if output_file.exists():\n",
1099
+ " print(\"Loading annotated CSV...\")\n",
1100
+ " df = pd.read_csv(output_file)\n",
1101
+ "else:\n",
1102
+ " print(\"Loading raw input CSV...\")\n",
1103
+ " df = pd.read_csv(input_file)\n",
1104
+ "\n",
1105
+ "\n",
1106
+ "# Try to load profession mapping files\n",
1107
+ "try:\n",
1108
+ " professions_df = pd.read_csv(professions_file)\n",
1109
+ " print(f\"✅ Loaded professions.csv\")\n",
1110
+ "except:\n",
1111
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
1112
+ "\n",
1113
+ "try:\n",
1114
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
1115
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
1116
+ "except:\n",
1117
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
1118
+ "\n",
1119
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
1120
+ "\n",
1121
+ "print(f\"Loaded {len(df)} rows\")\n",
1122
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
1123
+ "for cat in PROFESSION_CATEGORIES:\n",
1124
+ " print(f\" - {cat}\")\n",
1125
+ "\n",
1126
+ "if TEST_MODE:\n",
1127
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
1128
+ " df = df.head(TEST_SIZE).copy()\n",
1129
+ "elif MAX_ROWS:\n",
1130
+ " df = df.head(MAX_ROWS).copy()\n",
1131
+ "\n",
1132
+ "# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
1133
+ "def create_prompt(row):\n",
1134
+ " \"\"\"Create prompt for EuroLLM annotation with strict formatting requirements.\"\"\"\n",
1135
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
1136
+ " \n",
1137
+ " # Gather hints\n",
1138
+ " hints = []\n",
1139
+ " if pd.notna(row.get('likely_profession')):\n",
1140
+ " hints.append(str(row['likely_profession']))\n",
1141
+ " if pd.notna(row.get('likely_nationality')):\n",
1142
+ " hints.append(str(row['likely_nationality']))\n",
1143
+ " if pd.notna(row.get('likely_country')):\n",
1144
+ " hints.append(str(row['likely_country']))\n",
1145
+ " \n",
1146
+ " # Add tags if we don't have enough hints\n",
1147
+ " if len(hints) < 3:\n",
1148
+ " for i in range(1, 8):\n",
1149
+ " tag_col = f'tag_{i}'\n",
1150
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
1151
+ " tag_val = str(row[tag_col])\n",
1152
+ " if tag_val not in hints:\n",
1153
+ " hints.append(tag_val)\n",
1154
+ " if len(hints) >= 5:\n",
1155
+ " break\n",
1156
+ " \n",
1157
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
1158
+ " \n",
1159
+ " return f\"\"\"Extract information about '{name}'. \n",
1160
+ "Context hints (DO NOT copy these as professions): {hint_text}\n",
1161
+ "\n",
1162
+ "Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
1163
+ "\n",
1164
+ "FORMAT REQUIREMENTS:\n",
1165
+ "1. Full legal name in Western order (first last). VALUE ONLY.\n",
1166
+ "2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
1167
+ "3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
1168
+ "4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
1169
+ "5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
1170
+ "\n",
1171
+ "CRITICAL RULES FOR PROFESSIONS (Line 4):\n",
1172
+ "- ONLY use the exact profession categories listed above\n",
1173
+ "- DO NOT use descriptive words like \"sexy\", \"photorealistic\", \"celebrity\"\n",
1174
+ "- DO NOT copy the hint words as professions\n",
1175
+ "- If uncertain about profession, write \"Unknown\"\n",
1176
+ "- Valid professions are ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality\n",
1177
+ "- Actress = actor, streamer = online personality, YouTuber = online personality\n",
1178
+ "\n",
1179
+ "OTHER RULES:\n",
1180
+ "- Use \"Unknown\" when uncertain or for fictional characters\n",
1181
+ "- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
1182
+ "- For multi-role people, list up to 3 categories by relevance\"\"\"\n",
1183
+ "\n",
1184
+ "# Create prompts\n",
1185
+ "print(\"\\nCreating prompts...\")\n",
1186
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
1187
+ "print(\"✅ Prompts created\")\n",
1188
+ "\n",
1189
+ "@contextmanager\n",
1190
+ "def timeout(duration):\n",
1191
+ " def handler(signum, frame):\n",
1192
+ " raise TimeoutError(\"Operation timed out\")\n",
1193
+ " \n",
1194
+ " # Set the signal handler and alarm\n",
1195
+ " signal.signal(signal.SIGALRM, handler)\n",
1196
+ " signal.alarm(duration)\n",
1197
+ " try:\n",
1198
+ " yield\n",
1199
+ " finally:\n",
1200
+ " signal.alarm(0) # Disable the alarm\n",
1201
+ "\n",
1202
+ "\n",
1203
+ "def query_eurollm_local(prompt: str) -> str:\n",
1204
+ " \"\"\"Query EuroLLM locally via transformers with very low temperature.\"\"\"\n",
1205
+ " try:\n",
1206
+ " # Format as chat message for EuroLLM with strict system prompt\n",
1207
+ " messages = [\n",
1208
+ " {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
1209
+ " {\"role\": \"user\", \"content\": prompt}\n",
1210
+ " ]\n",
1211
+ " \n",
1212
+ " # Tokenize\n",
1213
+ " if hasattr(tokenizer, 'apply_chat_template') and tokenizer.chat_template is not None:\n",
1214
+ " text = tokenizer.apply_chat_template(\n",
1215
+ " messages,\n",
1216
+ " tokenize=False,\n",
1217
+ " add_generation_prompt=True\n",
1218
+ " )\n",
1219
+ " else:\n",
1220
+ " # Fallback for models without chat template\n",
1221
+ " text = f\"[INST] {prompt} [/INST]\"\n",
1222
+ " \n",
1223
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
1224
+ " \n",
1225
+ " # Generate with timeout and very low temperature\n",
1226
+ " with timeout(60):\n",
1227
+ " with torch.no_grad():\n",
1228
+ " outputs = model.generate(\n",
1229
+ " **inputs,\n",
1230
+ " max_new_tokens=100,\n",
1231
+ " temperature=0.01, # Very low temperature for more deterministic outputs\n",
1232
+ " do_sample=True, # Must be True when temperature is set\n",
1233
+ " pad_token_id=tokenizer.eos_token_id\n",
1234
+ " )\n",
1235
+ " \n",
1236
+ " # Decode\n",
1237
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
1238
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
1239
+ " \n",
1240
+ " return response.strip()\n",
1241
+ " \n",
1242
+ " except TimeoutError:\n",
1243
+ " print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
1244
+ " return None\n",
1245
+ " except Exception as e:\n",
1246
+ " print(f\"Generation error: {e}\")\n",
1247
+ " import traceback\n",
1248
+ " traceback.print_exc()\n",
1249
+ " return None\n",
1250
+ "\n",
1251
+ " \n",
1252
+ "# === PARSE RESPONSE WITH CLEANING ===\n",
1253
+ "def parse_response(response):\n",
1254
+ " \"\"\"Parse EuroLLM response into structured fields with cleaning.\"\"\"\n",
1255
+ " if not response:\n",
1256
+ " return {\n",
1257
+ " 'full_name': 'Unknown',\n",
1258
+ " 'aliases': 'Unknown',\n",
1259
+ " 'gender': 'Unknown',\n",
1260
+ " 'profession_llm': 'Unknown',\n",
1261
+ " 'country': 'Unknown'\n",
1262
+ " }\n",
1263
+ " \n",
1264
+ " # Valid profession categories\n",
1265
+ " VALID_PROFESSIONS = {\n",
1266
+ " \"actor\", \"adult performer\", \"singer/musician\", \"model\", \n",
1267
+ " \"online personality\", \"public figure\", \"voice actor/asmr\", \n",
1268
+ " \"sports professional\", \"tv personality\"\n",
1269
+ " }\n",
1270
+ " \n",
1271
+ " # Split into lines and clean\n",
1272
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
1273
+ " \n",
1274
+ " # Initialize with Unknown values\n",
1275
+ " fields = {\n",
1276
+ " 'full_name': 'Unknown',\n",
1277
+ " 'aliases': 'Unknown',\n",
1278
+ " 'gender': 'Unknown',\n",
1279
+ " 'profession_llm': 'Unknown',\n",
1280
+ " 'country': 'Unknown'\n",
1281
+ " }\n",
1282
+ " \n",
1283
+ " # Extract information from each numbered line\n",
1284
+ " for line in lines:\n",
1285
+ " if line.startswith('1.'):\n",
1286
+ " fields['full_name'] = line[2:].strip()\n",
1287
+ " elif line.startswith('2.'):\n",
1288
+ " fields['aliases'] = line[2:].strip()\n",
1289
+ " elif line.startswith('3.'):\n",
1290
+ " # Clean gender field - remove any labels\n",
1291
+ " gender_raw = line[2:].strip()\n",
1292
+ " # Remove common prefixes\n",
1293
+ " gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
1294
+ " # Extract just the gender word\n",
1295
+ " gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
1296
+ " fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
1297
+ " elif line.startswith('4.'):\n",
1298
+ " # Clean and validate profession field\n",
1299
+ " profession_raw = line[2:].strip()\n",
1300
+ " \n",
1301
+ " # Split by comma and validate each profession\n",
1302
+ " professions = [p.strip().lower() for p in profession_raw.split(',')]\n",
1303
+ " valid_profs = []\n",
1304
+ " \n",
1305
+ " for prof in professions:\n",
1306
+ " # Check if it's a valid profession\n",
1307
+ " if prof in VALID_PROFESSIONS:\n",
1308
+ " valid_profs.append(prof)\n",
1309
+ " # Check for common invalid entries\n",
1310
+ " elif prof in ['unknown', '']:\n",
1311
+ " continue\n",
1312
+ " # Reject descriptive words that aren't professions\n",
1313
+ " elif prof in ['sexy', 'photorealistic', 'celebrity', 'famous', 'popular', \n",
1314
+ " 'beautiful', 'attractive', 'hot', 'gorgeous']:\n",
1315
+ " continue\n",
1316
+ " # If it looks like it might be close to a valid profession, keep it\n",
1317
+ " elif any(valid in prof for valid in VALID_PROFESSIONS):\n",
1318
+ " # Try to extract the valid part\n",
1319
+ " for valid in VALID_PROFESSIONS:\n",
1320
+ " if valid in prof:\n",
1321
+ " valid_profs.append(valid)\n",
1322
+ " break\n",
1323
+ " \n",
1324
+ " # Set the cleaned professions or Unknown if none are valid\n",
1325
+ " if valid_profs:\n",
1326
+ " fields['profession_llm'] = ', '.join(valid_profs)\n",
1327
+ " else:\n",
1328
+ " fields['profession_llm'] = 'Unknown'\n",
1329
+ " \n",
1330
+ " elif line.startswith('5.'):\n",
1331
+ " # Clean country field - remove any labels\n",
1332
+ " country_raw = line[2:].strip()\n",
1333
+ " # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
1334
+ " country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
1335
+ " fields['country'] = country_raw\n",
1336
+ " \n",
1337
+ " return fields\n",
1338
+ "\n",
1339
+ "# === PROCESS DATA ===\n",
1340
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
1341
+ "\n",
1342
+ "# Load index\n",
1343
+ "current_index = 0\n",
1344
+ "if index_file.exists():\n",
1345
+ " try:\n",
1346
+ " current_index = int(index_file.read_text().strip())\n",
1347
+ " except:\n",
1348
+ " current_index = 0\n",
1349
+ "\n",
1350
+ "print(f\"Resuming from index {current_index}\")\n",
1351
+ "\n",
1352
+ "start_time = time.time()\n",
1353
+ "\n",
1354
+ "for i in tqdm(range(current_index, len(df)), desc=\"EuroLLM Local\"):\n",
1355
+ "\n",
1356
+ " prompt = df.at[i, \"prompt\"]\n",
1357
+ "\n",
1358
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
1359
+ " response = None\n",
1360
+ " for attempt in range(3):\n",
1361
+ " response = query_eurollm_local(prompt)\n",
1362
+ " \n",
1363
+ " # DEBUG: Print first few responses to see what's happening\n",
1364
+ " if i < 5:\n",
1365
+ " print(f\"\\n=== DEBUG Row {i}, Attempt {attempt+1} ===\")\n",
1366
+ " print(f\"Response length: {len(response) if response else 0}\")\n",
1367
+ " print(f\"Response: {response[:500] if response else 'None'}\")\n",
1368
+ " print(\"=\" * 50)\n",
1369
+ " \n",
1370
+ " # Valid response?\n",
1371
+ " if response and len(response.strip()) > 10:\n",
1372
+ " break\n",
1373
+ " \n",
1374
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
1375
+ " time.sleep(0.5)\n",
1376
+ "\n",
1377
+ " # If still invalid → DO NOT overwrite previous data\n",
1378
+ " if not response or len(response.strip()) <= 10:\n",
1379
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
1380
+ " continue\n",
1381
+ "\n",
1382
+ " parsed = parse_response(response)\n",
1383
+ "\n",
1384
+ " # DEBUG: Print first few parsed results\n",
1385
+ " if i < 5:\n",
1386
+ " print(f\"\\n=== PARSED Row {i} ===\")\n",
1387
+ " for key, value in parsed.items():\n",
1388
+ " print(f\" {key}: {value}\")\n",
1389
+ " print(\"=\" * 50)\n",
1390
+ "\n",
1391
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
1392
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
1393
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
1394
+ " continue\n",
1395
+ "\n",
1396
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
1397
+ " for key, value in parsed.items():\n",
1398
+ " df.at[i, key] = value\n",
1399
+ "\n",
1400
+ " # Advance progress ONLY after successful write\n",
1401
+ " current_index = i + 1\n",
1402
+ "\n",
1403
+ " # -------- GPU MEMORY CLEANUP --------\n",
1404
+ " if torch.cuda.is_available():\n",
1405
+ " torch.cuda.empty_cache()\n",
1406
+ " torch.cuda.synchronize()\n",
1407
+ "\n",
1408
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
1409
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
1410
+ " df.to_csv(output_file, index=False)\n",
1411
+ " with open(index_file, \"w\") as f:\n",
1412
+ " f.write(str(current_index))\n",
1413
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
1414
+ "\n",
1415
+ "# Final save\n",
1416
+ "df.to_csv(output_file, index=False)\n",
1417
+ "index_file.write_text(str(current_index))\n",
1418
+ "print(\"✅ Finished full dataset.\")"
1419
+ ]
1420
+ },
1421
+ {
1422
+ "cell_type": "markdown",
1423
+ "id": "472e5ac2-ec04-4bfa-8a67-116277238c15",
1424
+ "metadata": {},
1425
+ "source": [
1426
+ "## Mistral 24b instruct"
1427
+ ]
1428
+ },
1429
+ {
1430
+ "cell_type": "code",
1431
+ "execution_count": null,
1432
+ "id": "a55a5e30-83f3-4f7c-a537-b1216d4e8a07",
1433
+ "metadata": {
1434
+ "execution": {
1435
+ "iopub.execute_input": "2025-12-09T22:16:21.002786Z",
1436
+ "iopub.status.busy": "2025-12-09T22:16:21.002337Z"
1437
+ }
1438
+ },
1439
+ "outputs": [
1440
+ {
1441
+ "name": "stderr",
1442
+ "output_type": "stream",
1443
+ "text": [
1444
+ "/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
1445
+ " from .autonotebook import tqdm as notebook_tqdm\n"
1446
+ ]
1447
+ },
1448
+ {
1449
+ "name": "stdout",
1450
+ "output_type": "stream",
1451
+ "text": [
1452
+ "Loading model: mistralai/Mistral-Small-Instruct-2409\n",
1453
+ "Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
1454
+ "This may take a while on first run (~65GB download)...\n",
1455
+ "\n",
1456
+ "Device: cuda\n",
1457
+ "Loading tokenizer...\n",
1458
+ "✅ Tokenizer loaded\n",
1459
+ "Loading model (this may take several minutes)...\n"
1460
+ ]
1461
+ },
1462
+ {
1463
+ "name": "stderr",
1464
+ "output_type": "stream",
1465
+ "text": [
1466
+ "`torch_dtype` is deprecated! Use `dtype` instead!\n",
1467
+ "Loading checkpoint shards: 100%|██████████| 9/9 [02:42<00:00, 18.06s/it]\n"
1468
+ ]
1469
+ },
1470
+ {
1471
+ "name": "stdout",
1472
+ "output_type": "stream",
1473
+ "text": [
1474
+ "✅ Model loaded\n",
1475
+ "VRAM used: 21.40 GB\n",
1476
+ "\n",
1477
+ "Loading raw input CSV...\n",
1478
+ "Loaded 50861 rows from input file\n",
1479
+ "Found existing annotations, merging...\n"
1480
+ ]
1481
+ },
1482
+ {
1483
+ "name": "stderr",
1484
+ "output_type": "stream",
1485
+ "text": [
1486
+ "/tmp/ipykernel_3104208/1997558719.py:113: DtypeWarning: Columns (52,53,54,55,56) have mixed types. Specify dtype option on import or set low_memory=False.\n",
1487
+ " existing_df = pd.read_csv(output_file)\n"
1488
+ ]
1489
+ },
1490
+ {
1491
+ "name": "stdout",
1492
+ "output_type": "stream",
1493
+ "text": [
1494
+ "Existing annotations has 50861 rows\n",
1495
+ "Merged annotations, continuing with 50861 total rows\n",
1496
+ "✅ Loaded professions.csv\n",
1497
+ "✅ Loaded profession mapping with 9 categories\n",
1498
+ "Loaded 50861 rows\n",
1499
+ "\n",
1500
+ "Profession categories (9):\n",
1501
+ " - actor\n",
1502
+ " - adult performer\n",
1503
+ " - singer/musician\n",
1504
+ " - model\n",
1505
+ " - online personality\n",
1506
+ " - public figure\n",
1507
+ " - voice actor/ASMR\n",
1508
+ " - sports professional\n",
1509
+ " - tv personality\n",
1510
+ "\n",
1511
+ "Creating prompts...\n",
1512
+ "✅ Prompts created\n",
1513
+ "Resuming from index 8810\n"
1514
+ ]
1515
+ },
1516
+ {
1517
+ "name": "stderr",
1518
+ "output_type": "stream",
1519
+ "text": [
1520
+ "Mistral Local: 0%| | 0/42051 [00:00<?, ?it/s]/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:181: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
1521
+ " warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
1522
+ "Mistral Local: 0%| | 7/42051 [00:57<93:01:03, 7.96s/it] "
1523
+ ]
1524
+ }
1525
+ ],
1526
+ "source": [
1527
+ "import pandas as pd\n",
1528
+ "import json\n",
1529
+ "import time\n",
1530
+ "import re\n",
1531
+ "from pathlib import Path\n",
1532
+ "from tqdm import tqdm\n",
1533
+ "import torch\n",
1534
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
1535
+ "\n",
1536
+ "current_dir = Path.cwd()\n",
1537
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
1538
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
1539
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
1540
+ "# === PROCESS DATA ===\n",
1541
+ "\n",
1542
+ "\n",
1543
+ "# === CONFIGURATION ===\n",
1544
+ "TEST_MODE = False\n",
1545
+ "TEST_SIZE = 100\n",
1546
+ "MAX_ROWS = 50862\n",
1547
+ "SAVE_INTERVAL = 10\n",
1548
+ "\n",
1549
+ "output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
1550
+ "index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
1551
+ "\n",
1552
+ "\n",
1553
+ "# Model settings\n",
1554
+ "#MODEL_NAME = \"mistralai/Mistral-Small-3.1-24B-Instruct-2503\"\n",
1555
+ "MODEL_NAME = \"mistralai/Mistral-Small-Instruct-2409\"\n",
1556
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
1557
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
1558
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
1559
+ "\n",
1560
+ "# Define the SPECIFIC profession categories\n",
1561
+ "PROFESSION_CATEGORIES = [\n",
1562
+ " \"actor\",\n",
1563
+ " \"adult performer\",\n",
1564
+ " \"singer/musician\",\n",
1565
+ " \"model\",\n",
1566
+ " \"online personality\",\n",
1567
+ " \"public figure\",\n",
1568
+ " \"voice actor/ASMR\",\n",
1569
+ " \"sports professional\",\n",
1570
+ " \"tv personality\"\n",
1571
+ "]\n",
1572
+ "\n",
1573
+ "# === LOAD MODEL ===\n",
1574
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
1575
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
1576
+ "print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
1577
+ "\n",
1578
+ "# Check GPU availability\n",
1579
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1580
+ "print(f\"Device: {device}\")\n",
1581
+ "\n",
1582
+ "if device == \"cpu\":\n",
1583
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
1584
+ " print(\" Consider using a GPU or reducing model size.\")\n",
1585
+ "\n",
1586
+ "# Load tokenizer\n",
1587
+ "print(\"Loading tokenizer...\")\n",
1588
+ "try:\n",
1589
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1590
+ " MODEL_NAME,\n",
1591
+ " cache_dir=str(CACHE_DIR),\n",
1592
+ " use_fast=True\n",
1593
+ " )\n",
1594
+ "except Exception as e:\n",
1595
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
1596
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1597
+ " MODEL_NAME,\n",
1598
+ " cache_dir=str(CACHE_DIR),\n",
1599
+ " use_fast=False\n",
1600
+ " )\n",
1601
+ "\n",
1602
+ "# Ensure pad token is set\n",
1603
+ "if tokenizer.pad_token is None:\n",
1604
+ " tokenizer.pad_token = tokenizer.eos_token\n",
1605
+ "\n",
1606
+ "print(\"✅ Tokenizer loaded\")\n",
1607
+ "\n",
1608
+ "quantization_config = BitsAndBytesConfig(\n",
1609
+ " load_in_8bit=True\n",
1610
+ ")\n",
1611
+ "\n",
1612
+ "\n",
1613
+ "# Load model with optimizations\n",
1614
+ "print(\"Loading model (this may take several minutes)...\")\n",
1615
+ "model = AutoModelForCausalLM.from_pretrained(\n",
1616
+ " MODEL_NAME,\n",
1617
+ " cache_dir=str(CACHE_DIR),\n",
1618
+ " torch_dtype=torch.bfloat16,\n",
1619
+ " quantization_config=quantization_config,\n",
1620
+ " device_map=\"auto\",\n",
1621
+ " trust_remote_code=False\n",
1622
+ ")\n",
1623
+ "model.eval()\n",
1624
+ "print(\"✅ Model loaded\")\n",
1625
+ "\n",
1626
+ "# Check VRAM usage\n",
1627
+ "if torch.cuda.is_available():\n",
1628
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
1629
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
1630
+ "\n",
1631
+ "# === LOAD DATA ===\n",
1632
+ "print(\"Loading raw input CSV...\")\n",
1633
+ "df = pd.read_csv(input_file) # ALWAYS load the full input\n",
1634
+ "print(f\"Loaded {len(df)} rows from input file\")\n",
1635
+ "\n",
1636
+ "# If we have previous annotations, merge them\n",
1637
+ "if output_file.exists():\n",
1638
+ " print(\"Found existing annotations, merging...\")\n",
1639
+ " existing_df = pd.read_csv(output_file)\n",
1640
+ " print(f\"Existing annotations has {len(existing_df)} rows\")\n",
1641
+ " \n",
1642
+ " # Update df with existing annotations\n",
1643
+ " # Only update the columns that were annotated\n",
1644
+ " annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
1645
+ " for col in annotation_cols:\n",
1646
+ " if col in existing_df.columns:\n",
1647
+ " df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
1648
+ " \n",
1649
+ " print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
1650
+ "\n",
1651
+ "\n",
1652
+ "# Try to load profession mapping files\n",
1653
+ "try:\n",
1654
+ " professions_df = pd.read_csv(professions_file)\n",
1655
+ " print(f\"✅ Loaded professions.csv\")\n",
1656
+ "except:\n",
1657
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
1658
+ "\n",
1659
+ "try:\n",
1660
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
1661
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
1662
+ "except:\n",
1663
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
1664
+ "\n",
1665
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
1666
+ "\n",
1667
+ "print(f\"Loaded {len(df)} rows\")\n",
1668
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
1669
+ "for cat in PROFESSION_CATEGORIES:\n",
1670
+ " print(f\" - {cat}\")\n",
1671
+ "\n",
1672
+ "if TEST_MODE:\n",
1673
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
1674
+ " df = df.head(TEST_SIZE).copy()\n",
1675
+ "elif MAX_ROWS:\n",
1676
+ " df = df.head(MAX_ROWS).copy()\n",
1677
+ "\n",
1678
+ "# === CREATE PROMPTS (DEEPSEEK STYLE) ===\n",
1679
+ "def create_prompt(row):\n",
1680
+ " \"\"\"Create prompt for Mistral annotation with specific profession categories.\"\"\"\n",
1681
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
1682
+ " \n",
1683
+ " # Gather hints\n",
1684
+ " hints = []\n",
1685
+ " if pd.notna(row.get('likely_profession')):\n",
1686
+ " hints.append(str(row['likely_profession']))\n",
1687
+ " if pd.notna(row.get('likely_nationality')):\n",
1688
+ " hints.append(str(row['likely_nationality']))\n",
1689
+ " if pd.notna(row.get('likely_country')):\n",
1690
+ " hints.append(str(row['likely_country']))\n",
1691
+ " \n",
1692
+ " # Add tags if we don't have enough hints\n",
1693
+ " if len(hints) < 3:\n",
1694
+ " for i in range(1, 8):\n",
1695
+ " tag_col = f'tag_{i}'\n",
1696
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
1697
+ " tag_val = str(row[tag_col])\n",
1698
+ " if tag_val not in hints:\n",
1699
+ " hints.append(tag_val)\n",
1700
+ " if len(hints) >= 5:\n",
1701
+ " break\n",
1702
+ " \n",
1703
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
1704
+ " \n",
1705
+ " return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
1706
+ "1. Full legal name (Western order if non-latin script)\n",
1707
+ "2. Any stage names/aliases (comma separated)\n",
1708
+ "3. Gender (Male/Female/Other/Unknown)\n",
1709
+ "4. Top 3 most likely professions from ONLY these categories:\n",
1710
+ " - actor\n",
1711
+ " - adult performer\n",
1712
+ " - singer/musician\n",
1713
+ " - model\n",
1714
+ " - online personality (includes streamers, cosplayers, influencers)\n",
1715
+ " - public figure (includes politicians, activists, journalists, authors)\n",
1716
+ " - voice actor/ASMR\n",
1717
+ " - sports professional\n",
1718
+ " - tv personality (includes hosts, presenters, reality TV)\n",
1719
+ "\n",
1720
+ "5. Primary country associated\n",
1721
+ "\n",
1722
+ "IMPORTANT:\n",
1723
+ "- Choose professions ONLY from the 9 categories above\n",
1724
+ "- Provide up to 3 professions, comma-separated, ordered by relevance\n",
1725
+ "- Be SPECIFIC: choose the most accurate category for each role\n",
1726
+ "- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
1727
+ "- Use 'Unknown' when uncertain or for fictional characters/places\n",
1728
+ "- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
1729
+ "- For country respond with one word only, for example China or Columbia\n",
1730
+ "- actress = actor\n",
1731
+ "\n",
1732
+ "Respond with exactly 5 numbered lines.\"\"\"\n",
1733
+ "\n",
1734
+ "# Create prompts\n",
1735
+ "print(\"\\nCreating prompts...\")\n",
1736
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
1737
+ "print(\"✅ Prompts created\")\n",
1738
+ "\n",
1739
+ "# === QUERY MISTRAL LOCAL ===\n",
1740
+ "def query_mistral_local(prompt: str) -> str:\n",
1741
+ " \"\"\"Query Mistral locally via transformers.\"\"\"\n",
1742
+ " try:\n",
1743
+ " # Format as chat message for Mistral\n",
1744
+ " messages = [\n",
1745
+ " {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
1746
+ " {\"role\": \"user\", \"content\": prompt}\n",
1747
+ " ]\n",
1748
+ " \n",
1749
+ " # Tokenize\n",
1750
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
1751
+ " text = tokenizer.apply_chat_template(\n",
1752
+ " messages,\n",
1753
+ " tokenize=False,\n",
1754
+ " add_generation_prompt=True\n",
1755
+ " )\n",
1756
+ " else:\n",
1757
+ " # Fallback for older tokenizers\n",
1758
+ " text = f\"[INST] {prompt} [/INST]\"\n",
1759
+ " \n",
1760
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
1761
+ " \n",
1762
+ " # Generate\n",
1763
+ " with torch.no_grad():\n",
1764
+ " outputs = model.generate(\n",
1765
+ " **inputs,\n",
1766
+ " max_new_tokens=512,\n",
1767
+ " temperature=0.05,\n",
1768
+ " do_sample=True,\n",
1769
+ " top_p=0.8,\n",
1770
+ " pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id else tokenizer.eos_token_id\n",
1771
+ " )\n",
1772
+ " \n",
1773
+ " # Decode\n",
1774
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
1775
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
1776
+ " \n",
1777
+ " return response.strip()\n",
1778
+ " \n",
1779
+ " except Exception as e:\n",
1780
+ " print(f\"Generation error: {e}\")\n",
1781
+ " return None\n",
1782
+ "\n",
1783
+ "# === PARSE RESPONSE (DEEPSEEK STYLE) ===\n",
1784
+ "def parse_response(response):\n",
1785
+ " \"\"\"Parse Mistral response into structured fields.\"\"\"\n",
1786
+ " if not response:\n",
1787
+ " return {\n",
1788
+ " 'full_name': 'Unknown',\n",
1789
+ " 'aliases': 'Unknown',\n",
1790
+ " 'gender': 'Unknown',\n",
1791
+ " 'profession_llm': 'Unknown',\n",
1792
+ " 'country': 'Unknown'\n",
1793
+ " }\n",
1794
+ " \n",
1795
+ " # Split into lines and clean\n",
1796
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
1797
+ " \n",
1798
+ " # Initialize with Unknown values\n",
1799
+ " fields = {\n",
1800
+ " 'full_name': 'Unknown',\n",
1801
+ " 'aliases': 'Unknown',\n",
1802
+ " 'gender': 'Unknown',\n",
1803
+ " 'profession_llm': 'Unknown',\n",
1804
+ " 'country': 'Unknown'\n",
1805
+ " }\n",
1806
+ " \n",
1807
+ " # Extract information from each numbered line\n",
1808
+ " for line in lines:\n",
1809
+ " if line.startswith('1.'):\n",
1810
+ " fields['full_name'] = line[2:].strip()\n",
1811
+ " elif line.startswith('2.'):\n",
1812
+ " fields['aliases'] = line[2:].strip()\n",
1813
+ " elif line.startswith('3.'):\n",
1814
+ " fields['gender'] = line[2:].strip()\n",
1815
+ " elif line.startswith('4.'):\n",
1816
+ " fields['profession_llm'] = line[2:].strip()\n",
1817
+ " elif line.startswith('5.'):\n",
1818
+ " fields['country'] = line[2:].strip()\n",
1819
+ " \n",
1820
+ " return fields\n",
1821
+ "\n",
1822
+ "# === PROCESS DATA ===\n",
1823
+ "output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
1824
+ "index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
1825
+ "\n",
1826
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
1827
+ "\n",
1828
+ "# Load index\n",
1829
+ "current_index = 0\n",
1830
+ "if index_file.exists():\n",
1831
+ " try:\n",
1832
+ " current_index = int(index_file.read_text().strip())\n",
1833
+ " except:\n",
1834
+ " current_index = 0\n",
1835
+ "\n",
1836
+ "print(f\"Resuming from index {current_index}\")\n",
1837
+ "\n",
1838
+ "start_time = time.time()\n",
1839
+ "\n",
1840
+ "for i in tqdm(range(current_index, len(df)), desc=\"Mistral Local\"):\n",
1841
+ "\n",
1842
+ " prompt = df.at[i, \"prompt\"]\n",
1843
+ "\n",
1844
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
1845
+ " response = None\n",
1846
+ " for attempt in range(3):\n",
1847
+ " response = query_mistral_local(prompt)\n",
1848
+ " \n",
1849
+ " # Valid response?\n",
1850
+ " if response and len(response.strip()) > 10:\n",
1851
+ " break\n",
1852
+ " \n",
1853
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
1854
+ " time.sleep(0.5)\n",
1855
+ "\n",
1856
+ " # If still invalid → DO NOT overwrite previous data\n",
1857
+ " if not response or len(response.strip()) <= 10:\n",
1858
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
1859
+ " continue\n",
1860
+ "\n",
1861
+ " parsed = parse_response(response)\n",
1862
+ "\n",
1863
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
1864
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
1865
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
1866
+ " continue\n",
1867
+ "\n",
1868
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
1869
+ " for key, value in parsed.items():\n",
1870
+ " df.at[i, key] = value\n",
1871
+ "\n",
1872
+ " # Advance progress ONLY after successful write\n",
1873
+ " current_index = i + 1\n",
1874
+ "\n",
1875
+ " # -------- GPU MEMORY CLEANUP --------\n",
1876
+ " if torch.cuda.is_available():\n",
1877
+ " torch.cuda.empty_cache()\n",
1878
+ " torch.cuda.synchronize()\n",
1879
+ "\n",
1880
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
1881
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
1882
+ " df.to_csv(output_file, index=False)\n",
1883
+ " with open(index_file, \"w\") as f:\n",
1884
+ " f.write(str(current_index))\n",
1885
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
1886
+ "\n",
1887
+ "# Final save\n",
1888
+ "df.to_csv(output_file, index=False)\n",
1889
+ "index_file.write_text(str(current_index))\n",
1890
+ "print(\"✅ Finished full dataset.\")\n"
1891
+ ]
1892
+ },
1893
+ {
1894
+ "cell_type": "code",
1895
+ "execution_count": null,
1896
+ "id": "d7212e75-0ff6-45a0-8695-c4a3d3e02818",
1897
+ "metadata": {},
1898
+ "outputs": [],
1899
+ "source": [
1900
+ "import transformers\n",
1901
+ "print(f\"Transformers version: {transformers.__version__}\")\n",
1902
+ "\n",
1903
+ "# Check if Mistral3 is available\n",
1904
+ "try:\n",
1905
+ " from transformers import Mistral3ForCausalLM\n",
1906
+ " print(\"✅ Mistral3 is available\")\n",
1907
+ "except ImportError:\n",
1908
+ " print(\"❌ Mistral3 not available in this transformers version\")"
1909
+ ]
1910
+ },
1911
+ {
1912
+ "cell_type": "code",
1913
+ "execution_count": null,
1914
+ "id": "a6ab032e-246e-4c4e-9776-ff0bfbf6fd9c",
1915
+ "metadata": {},
1916
+ "outputs": [],
1917
+ "source": []
1918
+ }
1919
+ ],
1920
+ "metadata": {
1921
+ "kernelspec": {
1922
+ "display_name": "pm-paper",
1923
+ "language": "python",
1924
+ "name": "pm-paper"
1925
+ },
1926
+ "language_info": {
1927
+ "codemirror_mode": {
1928
+ "name": "ipython",
1929
+ "version": 3
1930
+ },
1931
+ "file_extension": ".py",
1932
+ "mimetype": "text/x-python",
1933
+ "name": "python",
1934
+ "nbconvert_exporter": "python",
1935
+ "pygments_lexer": "ipython3",
1936
+ "version": "3.11.13"
1937
+ }
1938
+ },
1939
+ "nbformat": 4,
1940
+ "nbformat_minor": 5
1941
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction-checkpoint.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-Copy1-checkpoint.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-checkpoint.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8_Deepfake_victims-checkpoint.ipynb ADDED
@@ -0,0 +1,668 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "06763fde",
6
+ "metadata": {},
7
+ "source": [
8
+ "# LLM annotation of Deepfake adapters"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "b773045a",
14
+ "metadata": {},
15
+ "source": [
16
+ "## Step 01 Data cleaning NER using spaCy"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "234a55e5",
22
+ "metadata": {},
23
+ "source": [
24
+ "#### Here we clean leetspeak and architectre specifiers from the names as a preprossesing step for the named entity recognition NER below"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "execution_count": 1,
30
+ "id": "f177df11",
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "import pandas as pd\n",
35
+ "import spacy\n",
36
+ "import re\n",
37
+ "import torch\n",
38
+ "from pathlib import Path\n",
39
+ "import unicodedata\n"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 2,
45
+ "id": "a2383f32",
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "current_dir = Path.cwd()\n",
50
+ "poi_models_dir = current_dir.parent / \"data/CSV/model_subsets/POI_models.csv\" ### POI models dataset\n",
51
+ "output = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_01.csv\" ### Output file"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "code",
56
+ "execution_count": 3,
57
+ "id": "66fc691f",
58
+ "metadata": {},
59
+ "outputs": [
60
+ {
61
+ "name": "stderr",
62
+ "output_type": "stream",
63
+ "text": [
64
+ "/home/lauhp/anaconda3/envs/latm/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
65
+ " from .autonotebook import tqdm as notebook_tqdm\n"
66
+ ]
67
+ }
68
+ ],
69
+ "source": [
70
+ "nlp = spacy.load(\"en_core_web_sm\") # or another model of your choice"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "code",
75
+ "execution_count": 4,
76
+ "id": "8348c4a4",
77
+ "metadata": {},
78
+ "outputs": [
79
+ {
80
+ "name": "stdout",
81
+ "output_type": "stream",
82
+ "text": [
83
+ "Done! Saved to NER_poi_step_01.csv\n"
84
+ ]
85
+ }
86
+ ],
87
+ "source": [
88
+ "def preprocess_name(name):\n",
89
+ " name = str(name)\n",
90
+ "\n",
91
+ " # Normalize unicode characters (e.g., fancy fonts)\n",
92
+ " name = unicodedata.normalize(\"NFKD\", name)\n",
93
+ "\n",
94
+ " # Lowercase everything\n",
95
+ " name = name.lower()\n",
96
+ "\n",
97
+ " # Remove special keywords and patterns\n",
98
+ " junk_words = [\n",
99
+ " 'jav', 'jp', 'lora', 'locon', 'lycoris', 'requested', 'japanese', 'model',\n",
100
+ " 'flux', 'flux1.d', 'pony', 'realistic'\n",
101
+ " ]\n",
102
+ " for word in junk_words:\n",
103
+ " name = re.sub(rf'\\b{re.escape(word)}\\b', '', name, flags=re.IGNORECASE)\n",
104
+ "\n",
105
+ " # Remove versions like v1, v2.0, etc.\n",
106
+ " name = re.sub(r'v\\.?\\d+(\\.\\d+)?', '', name)\n",
107
+ "\n",
108
+ " # Remove 'not' followed by a word\n",
109
+ " name = re.sub(r'\\bnot\\s+\\w+', '', name)\n",
110
+ "\n",
111
+ " # Replace underscores and pipes with spaces\n",
112
+ " name = re.sub(r'[_|]', ' ', name)\n",
113
+ "\n",
114
+ " # Remove parentheses and content within\n",
115
+ " name = re.sub(r'\\(.*?\\)', '', name)\n",
116
+ "\n",
117
+ " # Remove excess whitespace\n",
118
+ " name = re.sub(r'\\s+', ' ', name).strip()\n",
119
+ "\n",
120
+ " return name\n",
121
+ "\n",
122
+ "# -------------------------\n",
123
+ "# Fallback Extractor\n",
124
+ "# -------------------------\n",
125
+ "def fallback_extract(text):\n",
126
+ " words = text.split()\n",
127
+ " capitalized = [w for w in words if w and w[0].isalpha()]\n",
128
+ " if len(capitalized) >= 2:\n",
129
+ " return \" \".join(capitalized[:2])\n",
130
+ " elif capitalized:\n",
131
+ " return capitalized[0]\n",
132
+ " return None\n",
133
+ "\n",
134
+ "# -------------------------\n",
135
+ "# Full Extraction Logic\n",
136
+ "# -------------------------\n",
137
+ "def extract_real_name(raw_name):\n",
138
+ " cleaned = preprocess_name(raw_name)\n",
139
+ " doc = nlp(cleaned)\n",
140
+ " persons = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
141
+ " if persons:\n",
142
+ " return persons[0]\n",
143
+ " return fallback_extract(cleaned)\n",
144
+ "\n",
145
+ "# -------------------------\n",
146
+ "# Load Data and Apply\n",
147
+ "# -------------------------\n",
148
+ "df = pd.read_csv(poi_models_dir) # Or your own path\n",
149
+ "\n",
150
+ "# Apply the full extractor\n",
151
+ "texts = df['name'].astype(str).tolist()\n",
152
+ "docs = list(nlp.pipe([preprocess_name(t) for t in texts], batch_size=32))\n",
153
+ "\n",
154
+ "def extract_from_doc(doc, raw_text):\n",
155
+ " persons = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
156
+ " if persons:\n",
157
+ " return persons[0]\n",
158
+ " return fallback_extract(preprocess_name(raw_text))\n",
159
+ "\n",
160
+ "df['real_name'] = [extract_from_doc(doc, raw_text) for doc, raw_text in zip(docs, texts)]\n",
161
+ "\n",
162
+ "# Save the result\n",
163
+ "df.to_csv(output, index=False)\n",
164
+ "print(\"Done! Saved to NER_poi_step_01.csv\")\n"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "markdown",
169
+ "id": "a40b17fe",
170
+ "metadata": {},
171
+ "source": [
172
+ "## Step 2: compare country and profession with lists"
173
+ ]
174
+ },
175
+ {
176
+ "cell_type": "code",
177
+ "execution_count": 5,
178
+ "id": "414954ed",
179
+ "metadata": {},
180
+ "outputs": [
181
+ {
182
+ "name": "stdout",
183
+ "output_type": "stream",
184
+ "text": [
185
+ " real_name likely_country likely_nationality likely_profession\n",
186
+ "0 ronnie alonte model\n",
187
+ "1 zh elena kamperi twitch streamer\n",
188
+ "2 sofia vergara model\n",
189
+ "3 安然 anran celebrity\n",
190
+ "4 zoe kravitz celebrity\n"
191
+ ]
192
+ }
193
+ ],
194
+ "source": [
195
+ "import pandas as pd\n",
196
+ "from pathlib import Path\n",
197
+ "\n",
198
+ "# Set up paths\n",
199
+ "current_dir = Path.cwd()\n",
200
+ "countries = current_dir.parent / \"misc/lists/countries.csv\"\n",
201
+ "professions = current_dir.parent / \"misc/lists/professions.csv\"\n",
202
+ "inputNER = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_01.csv\"\n",
203
+ "outfile = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_02.csv\"\n",
204
+ "\n",
205
+ "# Load datasets\n",
206
+ "poi_df = pd.read_csv(inputNER)\n",
207
+ "countries_df = pd.read_csv(countries)\n",
208
+ "professions_df = pd.read_csv(professions)\n",
209
+ "\n",
210
+ "# Step 1: Combine tags into one lowercase list\n",
211
+ "def combine_tags(row):\n",
212
+ " return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
213
+ "\n",
214
+ "poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
215
+ "\n",
216
+ "# Step 2: Build tag → (country, nationality) mapping\n",
217
+ "tag_to_country_nationality = {}\n",
218
+ "\n",
219
+ "for _, row in countries_df.iterrows():\n",
220
+ " country = str(row[\"en_short_name\"]).strip()\n",
221
+ " nationality = str(row[\"nationality\"]).strip()\n",
222
+ "\n",
223
+ " country_lc = country.lower()\n",
224
+ " nationality_lc = nationality.lower()\n",
225
+ "\n",
226
+ " # Add variations to the mapping\n",
227
+ " tag_to_country_nationality[country_lc] = (country, \"\")\n",
228
+ " tag_to_country_nationality[nationality_lc] = (\"\", nationality)\n",
229
+ " tag_to_country_nationality[country_lc.replace(\" \", \"\")] = (country, \"\")\n",
230
+ " tag_to_country_nationality[nationality_lc.replace(\" \", \"\")] = (\"\", nationality)\n",
231
+ "\n",
232
+ " for part in country_lc.split():\n",
233
+ " tag_to_country_nationality[part] = (country, \"\")\n",
234
+ " for part in nationality_lc.split():\n",
235
+ " tag_to_country_nationality[part] = (\"\", nationality)\n",
236
+ "\n",
237
+ "# Step 3: Infer likely_country and likely_nationality\n",
238
+ "# Step 3: Infer likely_country and likely_nationality\n",
239
+ "def infer_country_and_nationality(tags):\n",
240
+ " for tag in tags:\n",
241
+ " cleaned = tag.replace(\" \", \"\").lower()\n",
242
+ " if cleaned in tag_to_country_nationality:\n",
243
+ " country, nationality = tag_to_country_nationality[cleaned]\n",
244
+ " # Special case: skip if country is \"Isle of Man\"\n",
245
+ " if country == \"Isle of Man\":\n",
246
+ " country = \"\"\n",
247
+ " return pd.Series([country, nationality])\n",
248
+ " return pd.Series([\"\", \"\"])\n",
249
+ "\n",
250
+ "\n",
251
+ "poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
252
+ "\n",
253
+ "# Step 4: Build tag → profession mapping\n",
254
+ "profession_alias_map = {}\n",
255
+ "\n",
256
+ "for _, row in professions_df.iterrows():\n",
257
+ " canonical = str(row['profession']).strip().lower()\n",
258
+ " profession_alias_map[canonical] = canonical\n",
259
+ " for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
260
+ " alias = row.get(alias_col)\n",
261
+ " if pd.notna(alias):\n",
262
+ " profession_alias_map[str(alias).strip().lower()] = canonical\n",
263
+ "\n",
264
+ "# Step 5: Infer likely profession from tags\n",
265
+ "def infer_profession_from_tags(tags):\n",
266
+ " matched = []\n",
267
+ " for tag in tags:\n",
268
+ " cleaned = tag.strip().lower()\n",
269
+ " if cleaned in profession_alias_map:\n",
270
+ " matched.append(profession_alias_map[cleaned])\n",
271
+ "\n",
272
+ " if not matched:\n",
273
+ " return \"\"\n",
274
+ " if \"celebrity\" in matched and len(set(matched)) > 1:\n",
275
+ " # Drop 'celebrity' if other professions are present\n",
276
+ " matched = [m for m in matched if m != \"celebrity\"]\n",
277
+ "\n",
278
+ " return matched[0] # Return the first specific match\n",
279
+ "\n",
280
+ "\n",
281
+ "poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
282
+ "\n",
283
+ "# Step 6: Save enriched dataset\n",
284
+ "poi_df.to_csv(outfile, index=False)\n",
285
+ "\n",
286
+ "# Optional: Preview\n",
287
+ "print(poi_df[[\"real_name\", \"likely_country\", \"likely_nationality\", \"likely_profession\"]].head())\n"
288
+ ]
289
+ },
290
+ {
291
+ "cell_type": "code",
292
+ "execution_count": 6,
293
+ "id": "054f230b",
294
+ "metadata": {},
295
+ "outputs": [],
296
+ "source": [
297
+ "#!pip install transformers torch\n",
298
+ "#!python -m spacy download en_core_web_trf\n",
299
+ "#pip install openai"
300
+ ]
301
+ },
302
+ {
303
+ "cell_type": "markdown",
304
+ "id": "59185461",
305
+ "metadata": {},
306
+ "source": [
307
+ "## Step 3: Query Deepseek-v3 with NAME and HINTS"
308
+ ]
309
+ },
310
+ {
311
+ "cell_type": "code",
312
+ "execution_count": 7,
313
+ "id": "504b970f",
314
+ "metadata": {},
315
+ "outputs": [],
316
+ "source": [
317
+ "#!pip install openpyxl"
318
+ ]
319
+ },
320
+ {
321
+ "cell_type": "code",
322
+ "execution_count": 10,
323
+ "id": "7c209115",
324
+ "metadata": {},
325
+ "outputs": [
326
+ {
327
+ "name": "stdout",
328
+ "output_type": "stream",
329
+ "text": [
330
+ "Row 1/4...\n",
331
+ "Saved up to row 1\n",
332
+ "Row 2/4...\n",
333
+ "Saved up to row 2\n",
334
+ "Row 3/4...\n",
335
+ "Saved up to row 3\n",
336
+ "Row 4/4...\n",
337
+ "Saved up to row 4\n",
338
+ "All done! Files: /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/data/CSV/Deepseek_annotated_POI.csv /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/data/CSV/Deepseek_annotated_POI.xlsx\n"
339
+ ]
340
+ }
341
+ ],
342
+ "source": [
343
+ "import pandas as pd\n",
344
+ "import openai\n",
345
+ "import time\n",
346
+ "import os\n",
347
+ "from pathlib import Path\n",
348
+ "from openai import OpenAI # Add this import\n",
349
+ "\n",
350
+ "# === PATHS & CONFIG ===\n",
351
+ "current_dir = Path.cwd()\n",
352
+ "inputCSV = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_02.csv\"\n",
353
+ "api_key_file = current_dir.parent / \"misc/credentials/deepseek_api_key.txt\" #store your API key under misc/credentials/deepseek_api_key.txt\n",
354
+ "\n",
355
+ "# Output both CSV and Excel for compatibility\n",
356
+ "OUTPUT_CSV = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.csv\"\n",
357
+ "OUTPUT_XLSX = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.xlsx\"\n",
358
+ "INDEX_FILE = current_dir.parent / \"misc/deepseek_query_index.txt\"\n",
359
+ "SAVE_INTERVAL = 1 # Save every N rows\n",
360
+ "START_ROW = 1 # Row index to start from (0-based)\n",
361
+ "END_ROW = 5 # Row index to end (exclusive)\n",
362
+ "\n",
363
+ "# === LOAD API KEY & CLIENT ===\n",
364
+ "with open(api_key_file) as f:\n",
365
+ " api_key = f.read().strip()\n",
366
+ "\n",
367
+ "client = OpenAI(\n",
368
+ " api_key=api_key,\n",
369
+ " base_url=\"https://api.deepseek.com/v1\"\n",
370
+ ")\n",
371
+ "\n",
372
+ "# === LOAD DATA ===\n",
373
+ "df = pd.read_csv(inputCSV)\n",
374
+ "df = df.iloc[START_ROW:END_ROW].reset_index(drop=True)\n",
375
+ "\n",
376
+ "# === PREPARE PROMPTS ===\n",
377
+ "def create_prompt(row):\n",
378
+ " name = row['real_name'] if pd.notna(row['real_name']) else row['name']\n",
379
+ " hints = []\n",
380
+ " for col in ('likely_profession','likely_nationality','likely_country'):\n",
381
+ " if pd.notna(row.get(col, None)):\n",
382
+ " hints.append(row[col])\n",
383
+ " if not hints:\n",
384
+ " tags = [row[f'tag_{i}'] for i in range(1,8) if pd.notna(row.get(f'tag_{i}', None))]\n",
385
+ " hints.extend(tags[:3])\n",
386
+ " hint_text = \", \".join(hints[:5])\n",
387
+ " return (\n",
388
+ " f\"Given '{name}' ({hint_text}), provide:\\n\"\n",
389
+ " \"1. Full legal name (Western order if non-latin script)\\n\"\n",
390
+ " \"2. Any stage names/aliases (comma separated)\\n\"\n",
391
+ " \"3. Gender\\n\"\n",
392
+ " \"4. Top 3 most specific, factual professions (use industry-standard terms, no euphemisms)\\n\"\n",
393
+ " \"5. Primary country associated\\n\"\n",
394
+ " \"Use 'Unknown' when uncertain or you encounter a fictional character or place. \"\n",
395
+ " \"For entertainment fields, specify sub-genres when known (kpop, adult industry, etc.).\"\n",
396
+ " )\n",
397
+ "\n",
398
+ "prompts = df.apply(create_prompt, axis=1).tolist()\n",
399
+ "\n",
400
+ "# === CHECK FOR EXISTING OUTPUT ===\n",
401
+ "if os.path.exists(INDEX_FILE):\n",
402
+ " with open(INDEX_FILE, 'r') as f:\n",
403
+ " current_index = int(f.read().strip())\n",
404
+ "else:\n",
405
+ " current_index = 0\n",
406
+ "\n",
407
+ "results = []\n",
408
+ "if os.path.exists(OUTPUT_XLSX):\n",
409
+ " existing = pd.read_excel(OUTPUT_XLSX)\n",
410
+ " if {'full_name','aliases','gender','profession_llm','country'}.issubset(existing.columns):\n",
411
+ " results = existing[['full_name','aliases','gender','profession_llm','country']].values.tolist()\n",
412
+ "\n",
413
+ "# === QUERY & PARSE ===\n",
414
+ "def query_deepseek(prompt):\n",
415
+ " try:\n",
416
+ " resp = client.chat.completions.create(\n",
417
+ " model=\"deepseek-chat\",\n",
418
+ " messages=[\n",
419
+ " {\"role\":\"system\",\"content\":\"You extract key data on a person; respond with exactly 5 numbered lines.\"},\n",
420
+ " {\"role\":\"user\",\"content\":prompt}\n",
421
+ " ],\n",
422
+ " temperature=0.05, top_p=0.8\n",
423
+ " )\n",
424
+ " return resp.choices[0].message.content.strip()\n",
425
+ " except Exception as e:\n",
426
+ " print(\"API error:\", e)\n",
427
+ " return \"\"\n",
428
+ "\n",
429
+ "def parse_response(resp):\n",
430
+ " lines = [l.strip() for l in resp.split('\\n') if l.strip()]\n",
431
+ " out = [\"Unknown\"]*5\n",
432
+ " for l in lines:\n",
433
+ " if l.startswith('1.'): out[0] = l[2:].strip()\n",
434
+ " elif l.startswith('2.'): out[1] = l[2:].strip()\n",
435
+ " elif l.startswith('3.'): out[2] = l[2:].strip()\n",
436
+ " elif l.startswith('4.'): out[3] = l[2:].strip()\n",
437
+ " elif l.startswith('5.'): out[4] = l[2:].strip()\n",
438
+ " return out\n",
439
+ "\n",
440
+ "# === PROCESS & SAVE ===\n",
441
+ "for i in range(current_index, len(df)):\n",
442
+ " print(f\"Row {i+1}/{len(df)}...\")\n",
443
+ " data = parse_response(query_deepseek(prompts[i]))\n",
444
+ " if i < len(results): results[i] = data\n",
445
+ " else: results.append(data)\n",
446
+ " current_index = i+1\n",
447
+ "\n",
448
+ " if current_index % SAVE_INTERVAL == 0 or current_index == len(df):\n",
449
+ " out_df = df.iloc[:current_index].copy()\n",
450
+ " out_df[['full_name','aliases','gender','profession_llm','country']] = pd.DataFrame(results[:current_index])\n",
451
+ " # CSV\n",
452
+ " out_df.to_csv(OUTPUT_CSV, index=False)\n",
453
+ " # Excel\n",
454
+ " out_df.to_excel(OUTPUT_XLSX, index=False)\n",
455
+ " with open(INDEX_FILE, 'w') as f:\n",
456
+ " f.write(str(current_index))\n",
457
+ " print(\"Saved up to row\", current_index)\n",
458
+ " time.sleep(1)\n",
459
+ "\n",
460
+ "print(\"All done! Files:\", OUTPUT_CSV, OUTPUT_XLSX)\n"
461
+ ]
462
+ },
463
+ {
464
+ "cell_type": "markdown",
465
+ "id": "7910e574",
466
+ "metadata": {},
467
+ "source": []
468
+ },
469
+ {
470
+ "cell_type": "markdown",
471
+ "id": "9d377005",
472
+ "metadata": {},
473
+ "source": [
474
+ "# Aggregate by individual names"
475
+ ]
476
+ },
477
+ {
478
+ "cell_type": "markdown",
479
+ "id": "d2e75354",
480
+ "metadata": {},
481
+ "source": [
482
+ " ##### e.g. Emma Watson [model1, model2, model3] etc."
483
+ ]
484
+ },
485
+ {
486
+ "cell_type": "code",
487
+ "execution_count": 11,
488
+ "id": "747c3a2f",
489
+ "metadata": {},
490
+ "outputs": [],
491
+ "source": [
492
+ "import pandas as pd\n",
493
+ "import re\n",
494
+ "from pathlib import Path\n",
495
+ "current_dir = Path.cwd()\n",
496
+ "\n",
497
+ "profession_map = current_dir.parent / \"misc/lists/mapped_professions.csv\"\n",
498
+ "\n",
499
+ "poi_df = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.csv\"\n",
500
+ "\n",
501
+ "output = current_dir.parent / \"data/CSV/Deepseek_annotated_POI_aggregated.csv\"\n",
502
+ "\n",
503
+ "countries_csv = current_dir.parent / \"misc/lists/countries.csv\"\n",
504
+ "countries_df = pd.read_csv(countries_csv)\n",
505
+ "\n",
506
+ "# Extract valid country names (strip whitespace)\n",
507
+ "valid_countries = set(countries_df['en_short_name'].str.strip())\n",
508
+ "\n",
509
+ "\n",
510
+ "# Load the dataset\n",
511
+ "df = pd.read_csv(poi_df) # Update path if needed\n",
512
+ "\n",
513
+ "# Step 1: Group by 'full_name' and aggregate required information\n",
514
+ "grouped_df = df.groupby('full_name').agg(\n",
515
+ " No_of_models=('id', 'count'),\n",
516
+ " modelIDs=('id', lambda x: list(x)),\n",
517
+ " combinedDownloadCount=('downloadCount', 'sum')\n",
518
+ ").reset_index()\n",
519
+ "\n",
520
+ "# Step 2: Keep representative info for each person\n",
521
+ "# Keep representative info (including aliases)\n",
522
+ "additional_columns = df.groupby('full_name').agg(\n",
523
+ " country=('country', 'first'),\n",
524
+ " profession_llm=('profession_llm', 'first'),\n",
525
+ " gender=('gender', 'first'),\n",
526
+ " aliases=('aliases', 'first') # ✅ Add this line\n",
527
+ ").reset_index()\n",
528
+ "\n",
529
+ "\n",
530
+ "\n",
531
+ "def standardize_country(country):\n",
532
+ " if not isinstance(country, str):\n",
533
+ " return \"Unknown\"\n",
534
+ "\n",
535
+ " country_clean = country.strip()\n",
536
+ " lowered = country_clean.lower()\n",
537
+ "\n",
538
+ " # Handle fictional or fantasy countries\n",
539
+ " fictional_keywords = [\"fictional\", \"westeros\", \"asgard\", \"middle-earth\", \"naboo\", \"middle earth\", \"latveria\"]\n",
540
+ " if any(keyword in lowered for keyword in fictional_keywords):\n",
541
+ " return \"Unknown\"\n",
542
+ "\n",
543
+ " # Handle known region-based adjustments\n",
544
+ " if \"macau\" in lowered:\n",
545
+ " return \"Macau\"\n",
546
+ " elif \"hong kong\" in lowered:\n",
547
+ " return \"Hong Kong\"\n",
548
+ " elif \"taiwan\" in lowered:\n",
549
+ " return \"Taiwan\"\n",
550
+ "\n",
551
+ " # Normalize complex or alternate country names\n",
552
+ " lowered = lowered.replace(\"United Kingdom of Great Britain and Northern Ireland\", \"united kingdom\")\n",
553
+ " lowered = lowered.replace(\"england\", \"united kingdom\")\n",
554
+ " lowered = lowered.replace(\"united states of america\", \"united states\")\n",
555
+ "\n",
556
+ " # Remove anything in brackets and after commas\n",
557
+ " country_clean = re.sub(r\"\\(.*?\\)\", \"\", country_clean)\n",
558
+ " country_clean = country_clean.split(',')[0].strip().lower()\n",
559
+ "\n",
560
+ " # Manual overrides\n",
561
+ " replacements = {\n",
562
+ " \"united kingdom\": \"UK\",\n",
563
+ " \"united kingdom of great britain and northern ireland\": \"UK\",\n",
564
+ " \"french southern territories\": \"Other\",\n",
565
+ " \"united states\": \"US\",\n",
566
+ " \"united states of america\": \"US\",\n",
567
+ " \"turkey\": \"Türkiye\",\n",
568
+ " \"czech republic\": \"Czechia\"\n",
569
+ " }\n",
570
+ "\n",
571
+ " if country_clean in replacements:\n",
572
+ " return replacements[country_clean]\n",
573
+ "\n",
574
+ " # Final check against valid country list (case-insensitive)\n",
575
+ " for valid in valid_countries:\n",
576
+ " if country_clean == valid.lower():\n",
577
+ " return valid\n",
578
+ "\n",
579
+ " return \"Unknown\"\n",
580
+ "\n",
581
+ "\n",
582
+ "# Updated function to fully remove anything in brackets (complete or not)\n",
583
+ "def get_profession_short(profession):\n",
584
+ " if isinstance(profession, str):\n",
585
+ " # Get first part before comma\n",
586
+ " first_prof = profession.split(',')[0].strip()\n",
587
+ " # Remove all bracketed content, even malformed\n",
588
+ " first_prof = re.sub(r\"[\\[].∗?[\\[].*?[\\]]\", \"\", first_prof) # removes properly closed\n",
589
+ " first_prof = re.sub(r\"[\\(\\[].*\", \"\", first_prof) # removes malformed\n",
590
+ " cleaned = first_prof.strip()\n",
591
+ " # Normalize 'Actress' to 'Actor'\n",
592
+ " if cleaned.lower() == \"actress\":\n",
593
+ " return \"Actor\"\n",
594
+ " return cleaned\n",
595
+ " return None\n",
596
+ "\n",
597
+ "# Load your mapping file\n",
598
+ "mapping_df = pd.read_csv(profession_map, on_bad_lines='skip') # or 'warn'\n",
599
+ "\n",
600
+ "\n",
601
+ "# Ensure the mapping columns are named correctly\n",
602
+ "# (Assuming columns are: 'profession_llm' or 'profession_short', and 'category' or 'mapped_profession')\n",
603
+ "# Adjust these as needed\n",
604
+ "mapping_df.columns = [col.strip().lower() for col in mapping_df.columns]\n",
605
+ "\n",
606
+ "# Rename for clarity and consistency\n",
607
+ "if 'profession_llm' in mapping_df.columns:\n",
608
+ " mapping_df = mapping_df.rename(columns={'profession_llm': 'profession_short'})\n",
609
+ "if 'category' in mapping_df.columns:\n",
610
+ " mapping_df = mapping_df.rename(columns={'category': 'mapped_profession'})\n",
611
+ "\n",
612
+ "# Merge the mapped profession into final_df\n",
613
+ "#final_df = final_df.merge(mapping_df[['profession_short', 'mapped_profession']], on='profession_short', how='left')\n",
614
+ "\n",
615
+ "\n",
616
+ "additional_columns = df.groupby('full_name').agg(\n",
617
+ " country=('country', 'first'),\n",
618
+ " profession_llm=('profession_llm', 'first'),\n",
619
+ " gender=('gender', 'first'),\n",
620
+ " aliases=('aliases', 'first') # <-- Added this line\n",
621
+ ").reset_index()\n",
622
+ "\n",
623
+ "\n",
624
+ "# Step 3: Merge the aggregated info with the representative info\n",
625
+ "final_df = pd.merge(grouped_df, additional_columns, on='full_name', how='left')\n",
626
+ "\n",
627
+ "# Step 4: Clean and transform columns\n",
628
+ "final_df['profession_short'] = final_df['profession_llm'].apply(get_profession_short)\n",
629
+ "final_df['country'] = final_df['country'].apply(standardize_country)\n",
630
+ "\n",
631
+ "# Step 5: Merge with profession mapping\n",
632
+ "final_df = final_df.merge(mapping_df[['profession_short', 'mapped_profession']], on='profession_short', how='left')\n",
633
+ "\n",
634
+ "# Optional: Save the result to a CSV file\n",
635
+ "final_df.to_csv(output, index=False)\n"
636
+ ]
637
+ },
638
+ {
639
+ "cell_type": "code",
640
+ "execution_count": null,
641
+ "id": "704e5246",
642
+ "metadata": {},
643
+ "outputs": [],
644
+ "source": []
645
+ }
646
+ ],
647
+ "metadata": {
648
+ "kernelspec": {
649
+ "display_name": "latm",
650
+ "language": "python",
651
+ "name": "python3"
652
+ },
653
+ "language_info": {
654
+ "codemirror_mode": {
655
+ "name": "ipython",
656
+ "version": 3
657
+ },
658
+ "file_extension": ".py",
659
+ "mimetype": "text/x-python",
660
+ "name": "python",
661
+ "nbconvert_exporter": "python",
662
+ "pygments_lexer": "ipython3",
663
+ "version": "3.10.15"
664
+ }
665
+ },
666
+ "nbformat": 4,
667
+ "nbformat_minor": 5
668
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8b_sunburst_profession-checkpoint.ipynb ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Prepare *.json for Figure 8b"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 5,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stdout",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "✅ Sunburst data saved to sunburst_data.json\n"
20
+ ]
21
+ }
22
+ ],
23
+ "source": [
24
+ "import pandas as pd\n",
25
+ "from collections import defaultdict\n",
26
+ "import json\n",
27
+ "from pathlib import Path\n",
28
+ "\n",
29
+ "current_dir = Path.cwd()\n",
30
+ "\n",
31
+ "sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
32
+ "\n",
33
+ "\n",
34
+ "aggregated_poi = current_dir.parent / \"data/CSV/Deepseek_annotated_POI_aggregated.csv\"\n",
35
+ "\n",
36
+ "df = pd.read_csv(aggregated_poi)\n",
37
+ "df['country_cleaned'] = df['country'].apply(lambda x: x if x not in ['Unknown', '', None] else 'Other')\n",
38
+ "\n",
39
+ "# Now get top countries excluding what was forced into 'Other'\n",
40
+ "top_countries = df['country_cleaned'].value_counts().nlargest(15).index.tolist()\n",
41
+ "\n",
42
+ "# Final limited country column\n",
43
+ "df['country_limited'] = df['country_cleaned'].apply(lambda x: x if x in top_countries else 'Other')\n",
44
+ "\n",
45
+ "# ---- Step 2: Limit to top 7 professions and combine Unknown and Sports Professional with Other ----\n",
46
+ "top_professions = df['mapped_profession'].value_counts().nlargest(7).index.tolist()\n",
47
+ "\n",
48
+ "# Explicitly remove 'Unknown' and 'Sports Professional' even if they are in the top 7\n",
49
+ "top_professions = [p for p in top_professions if p not in ['Unknown', 'Sports Professional']]\n",
50
+ "\n",
51
+ "# Normalize 'Unknown' and '\n",
52
+ "# Sports Professional' into 'Other'\n",
53
+ "# Normalize 'Unknown' and 'Sports Professional' into 'Other'\n",
54
+ "df['profession_limited'] = df['mapped_profession'].apply(\n",
55
+ " lambda x: 'Other' if x in ['Unknown', 'Sports Professional'] or x not in top_professions else x\n",
56
+ ")\n",
57
+ "\n",
58
+ "\n",
59
+ "\n",
60
+ "# ---- Step 3: Group by the limited country and profession ----\n",
61
+ "sunburst_data = df.groupby(['country_limited', 'profession_limited']).size().reset_index(name='count')\n",
62
+ "\n",
63
+ "# ---- Step 4: Create a nested structure for D3.js ----\n",
64
+ "sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
65
+ "country_map = defaultdict(list)\n",
66
+ "\n",
67
+ "for _, row in sunburst_data.iterrows():\n",
68
+ " country = row['country_limited']\n",
69
+ " profession = row['profession_limited']\n",
70
+ " count = int(row['count'])\n",
71
+ " country_map[country].append({\"name\": profession, \"value\": count})\n",
72
+ "\n",
73
+ "# For each country, sort the profession list so that \"Other\" appears at the end\n",
74
+ "for country, professions in country_map.items():\n",
75
+ " professions_sorted = sorted(professions, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
76
+ " country_map[country] = professions_sorted\n",
77
+ "\n",
78
+ "for country, professions in country_map.items():\n",
79
+ " sunburst_dict[\"children\"].append({\"name\": country, \"children\": professions})\n",
80
+ "\n",
81
+ "# ---- Step 5: Save to a JSON file ----\n",
82
+ "with open(sunburst_path, \"w\", encoding='utf-8') as f:\n",
83
+ " json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
84
+ "\n",
85
+ "\n",
86
+ "print(\"✅ Sunburst data saved to sunburst_data.json\")\n"
87
+ ]
88
+ },
89
+ {
90
+ "cell_type": "markdown",
91
+ "metadata": {},
92
+ "source": [
93
+ "the resulting *.json is the input for Figure_8.html"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "metadata": {},
99
+ "source": []
100
+ }
101
+ ],
102
+ "metadata": {
103
+ "kernelspec": {
104
+ "display_name": "latm",
105
+ "language": "python",
106
+ "name": "python3"
107
+ },
108
+ "language_info": {
109
+ "codemirror_mode": {
110
+ "name": "ipython",
111
+ "version": 3
112
+ },
113
+ "file_extension": ".py",
114
+ "mimetype": "text/x-python",
115
+ "name": "python",
116
+ "nbconvert_exporter": "python",
117
+ "pygments_lexer": "ipython3",
118
+ "version": "3.10.15"
119
+ }
120
+ },
121
+ "nbformat": 4,
122
+ "nbformat_minor": 2
123
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_compare-models-checkpoint.ipynb ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "id": "6cbaef9d-3058-4a59-a8ee-32fcc2062ed6",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "# Jupyter Notebook Cell for Country Name Standardization\n",
11
+ "# Copy and paste this entire cell into your Jupyter notebook\n",
12
+ "\n",
13
+ "import pandas as pd\n",
14
+ "import os\n",
15
+ "import re\n",
16
+ "\n",
17
+ "# Define country standardization mappings\n",
18
+ "COUNTRY_MAPPINGS = {\n",
19
+ " # USA variations\n",
20
+ " 'united states': 'USA',\n",
21
+ " 'united states of america': 'USA',\n",
22
+ " 'america': 'USA',\n",
23
+ " 'us': 'USA',\n",
24
+ " 'u.s.': 'USA',\n",
25
+ " 'u.s.a.': 'USA',\n",
26
+ " 'states': 'USA',\n",
27
+ " \n",
28
+ " # UK variations\n",
29
+ " 'united kingdom': 'UK',\n",
30
+ " 'england': 'UK',\n",
31
+ " 'britain': 'UK',\n",
32
+ " 'great britain': 'UK',\n",
33
+ " 'uk': 'UK',\n",
34
+ " 'u.k.': 'UK',\n",
35
+ " \n",
36
+ " # Turkey -> Türkiye\n",
37
+ " 'turkey': 'Türkiye',\n",
38
+ " \n",
39
+ " # Czech Republic -> Czechia\n",
40
+ " 'czech republic': 'Czechia',\n",
41
+ " 'czechoslovakia': 'Czechia',\n",
42
+ " 'czechoslowakia': 'Czechia', # Common misspelling\n",
43
+ " \n",
44
+ " # USSR -> Russia\n",
45
+ " 'ussr': 'Russia',\n",
46
+ " 'udssr': 'Russia',\n",
47
+ " 'soviet union': 'Russia',\n",
48
+ " \n",
49
+ " # Korea variations (maintain distinction between North and South)\n",
50
+ " 'korea': 'South Korea', # If just \"Korea\", assume South Korea\n",
51
+ " 'south korea': 'South Korea',\n",
52
+ " 'republic of korea': 'South Korea',\n",
53
+ " 'north korea': 'North Korea',\n",
54
+ " 'dprk': 'North Korea',\n",
55
+ " \n",
56
+ " # China variations\n",
57
+ " 'china': 'China',\n",
58
+ " 'people\\'s republic of china': 'China',\n",
59
+ " 'prc': 'China',\n",
60
+ " 'mainland china': 'China',\n",
61
+ " \n",
62
+ " # Common standardizations\n",
63
+ " 'holland': 'Netherlands',\n",
64
+ " 'the netherlands': 'Netherlands',\n",
65
+ " 'deutschland': 'Germany',\n",
66
+ " 'nippon': 'Japan',\n",
67
+ " 'espana': 'Spain',\n",
68
+ " 'españa': 'Spain',\n",
69
+ " \n",
70
+ " # Keep these as-is but included for completeness\n",
71
+ " 'usa': 'USA',\n",
72
+ " 'russia': 'Russia',\n",
73
+ "}\n",
74
+ "\n",
75
+ "def standardize_country(country_value):\n",
76
+ " \"\"\"\n",
77
+ " Standardize a single country name based on the mapping.\n",
78
+ " \"\"\"\n",
79
+ " if pd.isna(country_value):\n",
80
+ " return country_value\n",
81
+ " \n",
82
+ " # Convert to string and strip whitespace\n",
83
+ " country_str = str(country_value).strip()\n",
84
+ " \n",
85
+ " # Return if empty or 'Unknown'\n",
86
+ " if not country_str or country_str.lower() == 'unknown':\n",
87
+ " return country_str\n",
88
+ " \n",
89
+ " # Convert to lowercase for matching\n",
90
+ " country_lower = country_str.lower()\n",
91
+ " \n",
92
+ " # Check if it matches any of our mappings\n",
93
+ " for pattern, replacement in COUNTRY_MAPPINGS.items():\n",
94
+ " if country_lower == pattern:\n",
95
+ " return replacement\n",
96
+ " \n",
97
+ " # If no exact match found, return original with proper capitalization\n",
98
+ " # This preserves countries not in our mapping\n",
99
+ " return country_str\n",
100
+ "\n",
101
+ "def process_csv_file(input_file, output_file):\n",
102
+ " \"\"\"\n",
103
+ " Process a CSV file to standardize country names.\n",
104
+ " \"\"\"\n",
105
+ " print(f\"Processing: {input_file}\")\n",
106
+ " \n",
107
+ " # For mistral.csv which has no header, we need special handling\n",
108
+ " if 'mistral' in input_file.lower():\n",
109
+ " # Read without header\n",
110
+ " df = pd.read_csv(input_file, header=None)\n",
111
+ " \n",
112
+ " # Check if the last column contains country data\n",
113
+ " # Based on the structure, country should be in the last column\n",
114
+ " last_col = df.columns[-1]\n",
115
+ " \n",
116
+ " # Apply standardization to the last column\n",
117
+ " df[last_col] = df[last_col].apply(standardize_country)\n",
118
+ " \n",
119
+ " print(f\" - Standardized column {last_col} (assumed to be country)\")\n",
120
+ " print(f\" - Sample values after standardization: {df[last_col].dropna().head(10).tolist()}\")\n",
121
+ " else:\n",
122
+ " # Normal CSV with header\n",
123
+ " df = pd.read_csv(input_file)\n",
124
+ " \n",
125
+ " # Check if 'country' column exists\n",
126
+ " if 'country' in df.columns:\n",
127
+ " # Apply standardization\n",
128
+ " df['country'] = df['country'].apply(standardize_country)\n",
129
+ " \n",
130
+ " print(f\" - Found and standardized 'country' column\")\n",
131
+ " print(f\" - Unique countries after standardization: {sorted(df['country'].dropna().unique())}\")\n",
132
+ " else:\n",
133
+ " print(f\" - Warning: No 'country' column found in {input_file}\")\n",
134
+ " \n",
135
+ " # Save the processed file\n",
136
+ " df.to_csv(output_file, index=False)\n",
137
+ " print(f\" - Saved to: {output_file}\\n\")\n",
138
+ " \n",
139
+ " return df\n",
140
+ "\n",
141
+ "# ============================================================\n",
142
+ "# MAIN EXECUTION - Adjust paths as needed for your environment\n",
143
+ "# ============================================================\n",
144
+ "\n",
145
+ "# Input files - adjust these paths to match your file locations\n",
146
+ "input_files = [\n",
147
+ " 'gemma.csv',\n",
148
+ " 'mistral.csv',\n",
149
+ " 'qwen.csv'\n",
150
+ "]\n",
151
+ "\n",
152
+ "# Create a results dictionary to store processed dataframes\n",
153
+ "results = {}\n",
154
+ "\n",
155
+ "for filename in input_files:\n",
156
+ " # Adjust these paths based on where your files are located\n",
157
+ " # For example, if files are in current directory, use: input_path = filename\n",
158
+ " input_path = filename # Modify this based on your file location\n",
159
+ " \n",
160
+ " # Create output filename\n",
161
+ " base_name = filename.replace('.csv', '')\n",
162
+ " output_path = f'{base_name}_standardized_country.csv'\n",
163
+ " \n",
164
+ " # Check if input file exists\n",
165
+ " if os.path.exists(input_path):\n",
166
+ " df = process_csv_file(input_path, output_path)\n",
167
+ " results[base_name] = df\n",
168
+ " else:\n",
169
+ " print(f\"Warning: {input_path} not found! Please check the file path.\\n\")\n",
170
+ "\n",
171
+ "# Display summary statistics\n",
172
+ "print(\"=\" * 60)\n",
173
+ "print(\"SUMMARY OF COUNTRY STANDARDIZATION\")\n",
174
+ "print(\"=\" * 60)\n",
175
+ "\n",
176
+ "for name, df in results.items():\n",
177
+ " print(f\"\\n{name.upper()}:\")\n",
178
+ " print(f\" Total rows: {len(df)}\")\n",
179
+ " \n",
180
+ " # Find the country column\n",
181
+ " if name == 'mistral':\n",
182
+ " # For mistral, assume last column is country\n",
183
+ " country_col = df.columns[-1]\n",
184
+ " else:\n",
185
+ " country_col = 'country' if 'country' in df.columns else None\n",
186
+ " \n",
187
+ " if country_col is not None:\n",
188
+ " country_counts = df[country_col].value_counts()\n",
189
+ " print(f\" Top 10 countries:\")\n",
190
+ " for country, count in country_counts.head(10).items():\n",
191
+ " print(f\" - {country}: {count}\")\n",
192
+ "\n",
193
+ "print(\"\\n\" + \"=\" * 60)\n",
194
+ "print(\"Processing complete!\")\n",
195
+ "print(\"Standardized files have been created with '_standardized_country.csv' suffix\")\n",
196
+ "print(\"=\" * 60)\n",
197
+ "\n",
198
+ "# Optional: Display before/after comparison for a sample\n",
199
+ "print(\"\\n\" + \"=\" * 60)\n",
200
+ "print(\"EXAMPLE TRANSFORMATIONS APPLIED:\")\n",
201
+ "print(\"=\" * 60)\n",
202
+ "print(\"• 'United States' → 'USA'\")\n",
203
+ "print(\"• 'United States of America' → 'USA'\")\n",
204
+ "print(\"• 'States' → 'USA'\")\n",
205
+ "print(\"• 'England' → 'UK'\")\n",
206
+ "print(\"• 'United Kingdom' → 'UK'\")\n",
207
+ "print(\"• 'Britain' → 'UK'\")\n",
208
+ "print(\"• 'Turkey' → 'Türkiye'\")\n",
209
+ "print(\"• 'Czech Republic' → 'Czechia'\")\n",
210
+ "print(\"• 'Korea' → 'South Korea'\")\n",
211
+ "print(\"• 'UDSSR' → 'Russia'\")\n",
212
+ "print(\"=\" * 60)"
213
+ ]
214
+ }
215
+ ],
216
+ "metadata": {
217
+ "kernelspec": {
218
+ "display_name": "Python 3 (ipykernel)",
219
+ "language": "python",
220
+ "name": "python3"
221
+ },
222
+ "language_info": {
223
+ "codemirror_mode": {
224
+ "name": "ipython",
225
+ "version": 3
226
+ },
227
+ "file_extension": ".py",
228
+ "mimetype": "text/x-python",
229
+ "name": "python",
230
+ "nbconvert_exporter": "python",
231
+ "pygments_lexer": "ipython3",
232
+ "version": "3.12.10"
233
+ }
234
+ },
235
+ "nbformat": 4,
236
+ "nbformat_minor": 5
237
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-3_Figure_5_co-occurence_promotional_tags-checkpoint.ipynb ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Create *.json for figure 5 (Co-occurence network of Tags)"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 5,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stdout",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "Processing: america\n",
20
+ " ✅ Saved to /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/public/json/tags_america.json\n"
21
+ ]
22
+ }
23
+ ],
24
+ "source": [
25
+ "import pandas as pd\n",
26
+ "from itertools import combinations\n",
27
+ "from collections import Counter, defaultdict\n",
28
+ "import json\n",
29
+ "import re\n",
30
+ "import os\n",
31
+ "\n",
32
+ "from pathlib import Path\n",
33
+ "current_dir = Path.cwd()\n",
34
+ "\n",
35
+ "# === CONFIG ===\n",
36
+ "file_path = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
37
+ "output_dir = current_dir.parent / \"public/json/\"\n",
38
+ "#target_terms = [\"asian\", \"indian\", \"man\", \"woman\", \"german\", \"korean\", \"american\", \"russian\", \"style\", \"japanese\", \"chinese\"] # Add any tags you want to process\n",
39
+ "#target_terms = [\"character\", \"instagram\", \"youtuber\", \"actor\", \"actress\", \"celebrity\", \"vtuber\", \"kpop\"] # Add any tags you want to process\n",
40
+ "target_terms = [\"america\"] # Add any tags you want to process\n",
41
+ "min_connections = 1 # minimum number of link connections per node\n",
42
+ "\n",
43
+ "# === LOAD DATA ===\n",
44
+ "df = pd.read_csv(file_path)\n",
45
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
46
+ "df_tags = df[tag_columns]\n",
47
+ "\n",
48
+ "# === MAIN LOOP ===\n",
49
+ "for target_term in target_terms:\n",
50
+ " print(f\"Processing: {target_term}\")\n",
51
+ " \n",
52
+ " pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
53
+ " df_filtered = df_tags[df_tags.apply(\n",
54
+ " lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
55
+ " axis=1\n",
56
+ " )]\n",
57
+ "\n",
58
+ " # Skip if no data matches\n",
59
+ " if df_filtered.empty:\n",
60
+ " print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
61
+ " continue\n",
62
+ "\n",
63
+ " # === COUNT INDIVIDUAL TAGS ===\n",
64
+ " all_tags = df_filtered.values.flatten()\n",
65
+ " all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
66
+ " tag_counts = Counter(all_tags)\n",
67
+ "\n",
68
+ " # === CO-OCCURRENCE ===\n",
69
+ " co_occurrences = defaultdict(int)\n",
70
+ " for tags in df_filtered.itertuples(index=False, name=None):\n",
71
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
72
+ " for tag1, tag2 in combinations(tags, 2):\n",
73
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
74
+ "\n",
75
+ " edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
76
+ "\n",
77
+ " # === FILTER BY CONNECTIONS ===\n",
78
+ " connected_tags = Counter()\n",
79
+ " for tag1, tag2, _ in edges:\n",
80
+ " connected_tags[tag1] += 1\n",
81
+ " connected_tags[tag2] += 1\n",
82
+ "\n",
83
+ " nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
84
+ " valid_ids = set(node[\"id\"] for node in nodes)\n",
85
+ " links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
86
+ " for tag1, tag2, weight in edges\n",
87
+ " if tag1 in valid_ids and tag2 in valid_ids]\n",
88
+ "\n",
89
+ " if not nodes or not links:\n",
90
+ " print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
91
+ " continue\n",
92
+ "\n",
93
+ " # === EXPORT ===\n",
94
+ " d3_data = {\"nodes\": nodes, \"links\": links}\n",
95
+ " safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
96
+ " output_file = os.path.join(output_dir, f\"tags_{safe_term}.json\")\n",
97
+ " \n",
98
+ " with open(output_file, \"w\") as f:\n",
99
+ " json.dump(d3_data, f, indent=4)\n",
100
+ " \n",
101
+ " print(f\" ✅ Saved to {output_file}\")\n"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "markdown",
106
+ "metadata": {},
107
+ "source": [
108
+ "## Different Countries"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "code",
113
+ "execution_count": 2,
114
+ "metadata": {},
115
+ "outputs": [
116
+ {
117
+ "name": "stderr",
118
+ "output_type": "stream",
119
+ "text": [
120
+ "/tmp/ipykernel_68582/2797381217.py:15: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
121
+ " df = pd.read_csv(file_path)\n"
122
+ ]
123
+ },
124
+ {
125
+ "name": "stdout",
126
+ "output_type": "stream",
127
+ "text": [
128
+ "Processing: united states\n",
129
+ " ⚠️ No matches for 'united states', skipping.\n",
130
+ "Processing: korea\n",
131
+ " ✅ Saved to data/json/tags_korea_poi.json\n",
132
+ "Processing: uk\n",
133
+ " ✅ Saved to data/json/tags_uk_poi.json\n",
134
+ "Processing: russia\n",
135
+ " ✅ Saved to data/json/tags_russia_poi.json\n",
136
+ "Processing: china\n",
137
+ " ✅ Saved to data/json/tags_china_poi.json\n",
138
+ "Processing: canada\n",
139
+ " ✅ Saved to data/json/tags_canada_poi.json\n",
140
+ "Processing: India\n",
141
+ " ✅ Saved to data/json/tags_india_poi.json\n",
142
+ "Processing: germany\n",
143
+ " ✅ Saved to data/json/tags_germany_poi.json\n"
144
+ ]
145
+ }
146
+ ],
147
+ "source": [
148
+ "import pandas as pd\n",
149
+ "from itertools import combinations\n",
150
+ "from collections import Counter, defaultdict\n",
151
+ "import json\n",
152
+ "import re\n",
153
+ "import os\n",
154
+ "\n",
155
+ "# === CONFIG ===\n",
156
+ "file_path = \"data/model_subsets/all_models_poi_true.csv\"\n",
157
+ "output_dir = \"data/json/\"\n",
158
+ "target_terms = [\"united states\", \"korea\", \"uk\", \"russia\", \"china\", \"canada\", \"India\", \"germany\"] # Add any tags you want to process\n",
159
+ "min_connections = 1 # minimum number of link connections per node\n",
160
+ "\n",
161
+ "# === LOAD DATA ===\n",
162
+ "df = pd.read_csv(file_path)\n",
163
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
164
+ "df_tags = df[tag_columns]\n",
165
+ "\n",
166
+ "# === MAIN LOOP ===\n",
167
+ "for target_term in target_terms:\n",
168
+ " print(f\"Processing: {target_term}\")\n",
169
+ " \n",
170
+ " pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
171
+ " df_filtered = df_tags[df_tags.apply(\n",
172
+ " lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
173
+ " axis=1\n",
174
+ " )]\n",
175
+ "\n",
176
+ " # Skip if no data matches\n",
177
+ " if df_filtered.empty:\n",
178
+ " print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
179
+ " continue\n",
180
+ "\n",
181
+ " # === COUNT INDIVIDUAL TAGS ===\n",
182
+ " all_tags = df_filtered.values.flatten()\n",
183
+ " all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
184
+ " tag_counts = Counter(all_tags)\n",
185
+ "\n",
186
+ " # === CO-OCCURRENCE ===\n",
187
+ " co_occurrences = defaultdict(int)\n",
188
+ " for tags in df_filtered.itertuples(index=False, name=None):\n",
189
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
190
+ " for tag1, tag2 in combinations(tags, 2):\n",
191
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
192
+ "\n",
193
+ " edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
194
+ "\n",
195
+ " # === FILTER BY CONNECTIONS ===\n",
196
+ " connected_tags = Counter()\n",
197
+ " for tag1, tag2, _ in edges:\n",
198
+ " connected_tags[tag1] += 1\n",
199
+ " connected_tags[tag2] += 1\n",
200
+ "\n",
201
+ " nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
202
+ " valid_ids = set(node[\"id\"] for node in nodes)\n",
203
+ " links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
204
+ " for tag1, tag2, weight in edges\n",
205
+ " if tag1 in valid_ids and tag2 in valid_ids]\n",
206
+ "\n",
207
+ " if not nodes or not links:\n",
208
+ " print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
209
+ " continue\n",
210
+ "\n",
211
+ " # === EXPORT ===\n",
212
+ " d3_data = {\"nodes\": nodes, \"links\": links}\n",
213
+ " safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
214
+ " output_file = os.path.join(output_dir, f\"tags_{safe_term}_poi.json\")\n",
215
+ " \n",
216
+ " with open(output_file, \"w\") as f:\n",
217
+ " json.dump(d3_data, f, indent=4)\n",
218
+ " \n",
219
+ " print(f\" ✅ Saved to {output_file}\")\n"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": 5,
225
+ "metadata": {},
226
+ "outputs": [
227
+ {
228
+ "name": "stderr",
229
+ "output_type": "stream",
230
+ "text": [
231
+ "/tmp/ipykernel_79893/1420003123.py:13: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
232
+ " df = pd.read_csv(file_path)\n"
233
+ ]
234
+ },
235
+ {
236
+ "name": "stdout",
237
+ "output_type": "stream",
238
+ "text": [
239
+ "✅ Exported 60330 nodes and 16921 links to public/json/nodes_all.json\n"
240
+ ]
241
+ }
242
+ ],
243
+ "source": [
244
+ "import pandas as pd\n",
245
+ "from itertools import combinations\n",
246
+ "from collections import Counter, defaultdict\n",
247
+ "import json\n",
248
+ "import os\n",
249
+ "\n",
250
+ "# === CONFIG ===\n",
251
+ "file_path = \"data/model_subsets/all_models_poi_false.csv\"\n",
252
+ "output_file = \"public/json/nodes_all.json\"\n",
253
+ "min_link_threshold = 10 # Only keep edges with co-occurrence >= this\n",
254
+ "\n",
255
+ "# === LOAD DATA ===\n",
256
+ "df = pd.read_csv(file_path)\n",
257
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
258
+ "df_tags = df[tag_columns]\n",
259
+ "\n",
260
+ "# === COUNT INDIVIDUAL TAGS ===\n",
261
+ "all_tags = df_tags.values.flatten()\n",
262
+ "all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
263
+ "tag_counts = Counter(all_tags)\n",
264
+ "\n",
265
+ "# === CO-OCCURRENCE ===\n",
266
+ "co_occurrences = defaultdict(int)\n",
267
+ "for tags in df_tags.itertuples(index=False, name=None):\n",
268
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
269
+ " for tag1, tag2 in combinations(tags, 2):\n",
270
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
271
+ "\n",
272
+ "# === Build Edges (Filtered by co-occurrence threshold)\n",
273
+ "edges = [\n",
274
+ " {\"source\": list(pair)[0], \"target\": list(pair)[1], \"value\": weight}\n",
275
+ " for pair, weight in co_occurrences.items()\n",
276
+ " if weight >= min_link_threshold\n",
277
+ "]\n",
278
+ "\n",
279
+ "# === Build Nodes (All tags that appear, regardless of links)\n",
280
+ "nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts]\n",
281
+ "\n",
282
+ "# === EXPORT ===\n",
283
+ "d3_data = {\"nodes\": nodes, \"links\": edges}\n",
284
+ "\n",
285
+ "os.makedirs(os.path.dirname(output_file), exist_ok=True)\n",
286
+ "with open(output_file, \"w\") as f:\n",
287
+ " json.dump(d3_data, f, indent=4)\n",
288
+ "\n",
289
+ "print(f\"✅ Exported {len(nodes)} nodes and {len(edges)} links to {output_file}\")\n"
290
+ ]
291
+ }
292
+ ],
293
+ "metadata": {
294
+ "kernelspec": {
295
+ "display_name": "latm",
296
+ "language": "python",
297
+ "name": "python3"
298
+ },
299
+ "language_info": {
300
+ "codemirror_mode": {
301
+ "name": "ipython",
302
+ "version": 3
303
+ },
304
+ "file_extension": ".py",
305
+ "mimetype": "text/x-python",
306
+ "name": "python",
307
+ "nbconvert_exporter": "python",
308
+ "pygments_lexer": "ipython3",
309
+ "version": "3.10.15"
310
+ }
311
+ },
312
+ "nbformat": 4,
313
+ "nbformat_minor": 2
314
+ }
jupyter_notebooks/.ipynb_checkpoints/Section_2-4_Figure_9_ectract_LoRA_metadata_v2-checkpoint.ipynb ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "f36422c8",
6
+ "metadata": {},
7
+ "source": [
8
+ "# LoRA metadata"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "raw",
13
+ "id": "8a2feb6e",
14
+ "metadata": {},
15
+ "source": [
16
+ "LoRA Metadata Processing Workflow\n",
17
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
18
+ "│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
19
+ "│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
20
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
21
+ " ↓\n",
22
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
23
+ "│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
24
+ "│ Files Using API │ │ using rotating API keys. │\n",
25
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
26
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
27
+ "│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
28
+ "│ from SafeTensor │ │ images, model type, and architecture. │\n",
29
+ "│ Files │ │ │\n",
30
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
31
+ " ↓\n",
32
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
33
+ "│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
34
+ "│ Metadata as JSON │ │ files for later analysis. │\n",
35
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
36
+ "\n",
37
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
38
+ "│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
39
+ "│ for Consolidation │ │ details, and filter necessary attributes. │\n",
40
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
41
+ " ↓\n",
42
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
43
+ "│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
44
+ "│ Tags & Model Info │ │ and systems used for model creation. │\n",
45
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
46
+ " ↓\n",
47
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
48
+ "│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
49
+ "│ Metadata to CSV │ │ format for final analysis. ���\n",
50
+ "└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "code",
55
+ "execution_count": null,
56
+ "id": "efc9939d",
57
+ "metadata": {},
58
+ "outputs": [],
59
+ "source": [
60
+ "import os\n",
61
+ "import re\n",
62
+ "import json\n",
63
+ "import csv\n",
64
+ "import struct\n",
65
+ "import requests\n",
66
+ "from pathlib import Path\n",
67
+ "import pandas as pd\n",
68
+ "from collections import Counter\n",
69
+ "from concurrent.futures import ProcessPoolExecutor\n",
70
+ "from pathlib import Path\n",
71
+ "import matplotlib.pyplot as plt\n",
72
+ "from matplotlib.font_manager import FontProperties\n",
73
+ "from matplotlib import font_manager\n",
74
+ "import pandas as pd\n",
75
+ "from collections import Counter\n",
76
+ "from concurrent.futures import ProcessPoolExecutor\n",
77
+ "\n",
78
+ "# Define the current directory and important file paths\n",
79
+ "current_dir = Path.cwd()\n",
80
+ "\n",
81
+ "# Define frequently used directories\n",
82
+ "\n",
83
+ "data_dir = current_dir.parent / 'data/csv/adapters.csv'\n",
84
+ "fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
85
+ "plots_dir = current_dir.parent / 'results/plots'\n",
86
+ "raw_data_dir = current_dir.parent / 'data/adapter_metadata/lora' ### location of the LoRA metadata (JSON)\n",
87
+ "temp_dir = current_dir.parent / 'data/raw/adapters_safetensors'\n",
88
+ "misc_dir = current_dir.parent / 'misc'\n",
89
+ "\n",
90
+ "# File paths\n",
91
+ "adapters_csv = current_dir.parent / 'data/csv/adapters.csv'\n",
92
+ "output_json_dir = raw_data_dir\n",
93
+ "api_keys_file = misc_dir / 'credentials/civit.txt'\n",
94
+ "\n",
95
+ "# Ensure directories exist\n",
96
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
97
+ "os.makedirs(temp_dir, exist_ok=True)\n",
98
+ "\n",
99
+ "\n",
100
+ "# Load fonts into Matplotlib\n",
101
+ "for font_path in font_paths:\n",
102
+ " font_manager.fontManager.addfont(font_path)\n",
103
+ "\n",
104
+ "# Set default font family for plots\n",
105
+ "plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
106
+ "\n",
107
+ "print('Paths and fonts initialized successfully.')\n",
108
+ "\n",
109
+ "print('Paths initialized successfully.')"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "id": "87a58593",
115
+ "metadata": {},
116
+ "source": [
117
+ "## Step 2: Download LoRA and extract *.safetensors metadata\n",
118
+ "This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "code",
123
+ "execution_count": null,
124
+ "id": "abd3a0bc",
125
+ "metadata": {},
126
+ "outputs": [],
127
+ "source": [
128
+ "import os\n",
129
+ "import sys\n",
130
+ "import csv\n",
131
+ "import json\n",
132
+ "import struct\n",
133
+ "import time\n",
134
+ "import requests\n",
135
+ "import signal\n",
136
+ "import contextlib\n",
137
+ "from pathlib import Path\n",
138
+ "import re\n",
139
+ "\n",
140
+ "# === Paste your API keys here ===\n",
141
+ "API_KEYS = [\n",
142
+ " \"399c06ea6d1b7349556a115376ec346b\", #DISCORD\n",
143
+ " \"213be9d373130f86e394c6fea4d75162\", #ASDD 1\n",
144
+ " \"4f180c0c56334b74394b467c5e5b8201\", #ASDD 2\n",
145
+ " \"bdfba7ac53290f66bc76130f25b74336\", #BSDD \n",
146
+ " \"43294f4a27b388624a896db5a65f445a\"\n",
147
+ "]\n",
148
+ "if not API_KEYS or any(not isinstance(k, str) or not k.strip() for k in API_KEYS):\n",
149
+ " raise ValueError(\"Please paste at least one valid API key into API_KEYS.\")\n",
150
+ "\n",
151
+ "# === Config (adjust paths as needed) ===\n",
152
+ "current_dir = Path.cwd()\n",
153
+ "output_json_dir = current_dir.parent / \"data/adapter_metadata/lora\" # where JSON outputs go\n",
154
+ "temp_dir = current_dir.parent / \"data/raw/adapters_safetensors\" # where downloads go\n",
155
+ "csv_path = current_dir.parent / \"data/csv/adapters_poi_false_sfw.csv\"\n",
156
+ "\n",
157
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
158
+ "os.makedirs(temp_dir, exist_ok=True)\n",
159
+ "\n",
160
+ "# === API key state ===\n",
161
+ "current_key_index = 0\n",
162
+ "\n",
163
+ "\n",
164
+ "\n",
165
+ "def safe_filename(name: str, max_length: int = 100) -> str:\n",
166
+ " # Replace unsafe chars\n",
167
+ " sanitized = re.sub(r'[^a-zA-Z0-9_\\-]', '_', name)\n",
168
+ " # Truncate if too long\n",
169
+ " if len(sanitized) > max_length:\n",
170
+ " sanitized = sanitized[:max_length]\n",
171
+ " return sanitized\n",
172
+ "\n",
173
+ "\n",
174
+ "def get_headers():\n",
175
+ " global current_key_index\n",
176
+ " return {\n",
177
+ " \"Accept\": \"application/json\",\n",
178
+ " \"Authorization\": f\"Bearer {API_KEYS[current_key_index].strip()}\"\n",
179
+ " }\n",
180
+ "\n",
181
+ "def rotate_api_key():\n",
182
+ " global current_key_index\n",
183
+ " if current_key_index < len(API_KEYS) - 1:\n",
184
+ " current_key_index += 1\n",
185
+ " print(f\"🔁 Rotated to API key #{current_key_index + 1}\")\n",
186
+ " else:\n",
187
+ " raise Exception(\"All API keys have been exhausted.\")\n",
188
+ "\n",
189
+ "# === Utilities ===\n",
190
+ "def save_json(data, filename):\n",
191
+ " with open(filename, 'w', encoding=\"utf-8\") as f:\n",
192
+ " json.dump(data, f, indent=4, ensure_ascii=False)\n",
193
+ "\n",
194
+ "def parse_safetensors(file_path):\n",
195
+ " # Minimal, tolerant metadata reader; returns {} on failure.\n",
196
+ " try:\n",
197
+ " with open(file_path, 'rb') as f:\n",
198
+ " file_data = f.read()\n",
199
+ " # Many safetensors use 8-byte header length; this code follows your original logic\n",
200
+ " # (4-byte) but keeps the 8-byte skip. Keep if it's working in your dataset.\n",
201
+ " metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
202
+ " metadata_bytes = file_data[8:8 + metadata_size]\n",
203
+ " metadata_str = metadata_bytes.decode('utf-8', errors='replace')\n",
204
+ " metadata = json.loads(metadata_str)\n",
205
+ " return metadata.get('__metadata__', {})\n",
206
+ " except Exception as e:\n",
207
+ " print(f\"Error parsing safetensors file: {e}\")\n",
208
+ " return {}\n",
209
+ "\n",
210
+ "# === Timeout context ===\n",
211
+ "class TimeoutException(Exception):\n",
212
+ " pass\n",
213
+ "\n",
214
+ "@contextlib.contextmanager\n",
215
+ "def time_limit(seconds):\n",
216
+ " def signal_handler(signum, frame):\n",
217
+ " raise TimeoutException(f\"Timed out after {seconds} seconds\")\n",
218
+ " # Note: SIGALRM works on Unix-like OS; on Windows this will be a no-op.\n",
219
+ " try:\n",
220
+ " signal.signal(signal.SIGALRM, signal_handler)\n",
221
+ " signal.alarm(seconds)\n",
222
+ " except Exception:\n",
223
+ " # Fallback: no hard alarm on non-Unix systems\n",
224
+ " pass\n",
225
+ " try:\n",
226
+ " yield\n",
227
+ " finally:\n",
228
+ " try:\n",
229
+ " signal.alarm(0)\n",
230
+ " except Exception:\n",
231
+ " pass\n",
232
+ "\n",
233
+ "# === Download with timeout, retries, backoff, and key rotation ===\n",
234
+ "def download_file(url, output_folder, timeout=30, overall_timeout=120, max_retries=3):\n",
235
+ " filename = url.split(\"/\")[-1]\n",
236
+ " output_path = os.path.join(output_folder, filename)\n",
237
+ "\n",
238
+ " global current_key_index\n",
239
+ " retries = 0\n",
240
+ " backoff = 2\n",
241
+ "\n",
242
+ " while current_key_index < len(API_KEYS):\n",
243
+ " try:\n",
244
+ " with time_limit(overall_timeout): # global cap per download\n",
245
+ " #print(f\"➡️ GET {url} using key #{current_key_index + 1}\")\n",
246
+ " resp = requests.get(\n",
247
+ " url,\n",
248
+ " headers=get_headers(),\n",
249
+ " stream=True,\n",
250
+ " timeout=(10, timeout), # (connect timeout, per-chunk read timeout)\n",
251
+ " )\n",
252
+ "\n",
253
+ " # Auth errors → rotate key\n",
254
+ " if resp.status_code in (401, 403):\n",
255
+ " print(f\"❌ Auth {resp.status_code} with key #{current_key_index + 1}. Rotating.\")\n",
256
+ " rotate_api_key()\n",
257
+ " retries = 0\n",
258
+ " backoff = 2\n",
259
+ " continue\n",
260
+ "\n",
261
+ " # Not found → bubble up as FileNotFoundError (do not rotate)\n",
262
+ " if resp.status_code == 404:\n",
263
+ " raise FileNotFoundError(f\"Model not found at {url}\")\n",
264
+ "\n",
265
+ " # Rate limit → either rotate or wait/backoff\n",
266
+ " if resp.status_code == 429:\n",
267
+ " print(\"⏳ Rate limited (429).\", end=\" \")\n",
268
+ " if current_key_index < len(API_KEYS) - 1:\n",
269
+ " print(\"Rotating key.\")\n",
270
+ " rotate_api_key()\n",
271
+ " retries = 0\n",
272
+ " backoff = 2\n",
273
+ " continue\n",
274
+ " else:\n",
275
+ " print(f\"Waiting {backoff}s (no other keys).\")\n",
276
+ " time.sleep(backoff)\n",
277
+ " backoff = min(backoff * 2, 60)\n",
278
+ " continue\n",
279
+ "\n",
280
+ " # Other HTTP errors → raise to RequestException path\n",
281
+ " resp.raise_for_status()\n",
282
+ "\n",
283
+ " # Save file\n",
284
+ " with open(output_path, 'wb') as fh:\n",
285
+ " for chunk in resp.iter_content(chunk_size=8192):\n",
286
+ " if chunk:\n",
287
+ " fh.write(chunk)\n",
288
+ "\n",
289
+ " return output_path, filename\n",
290
+ "\n",
291
+ " except TimeoutException as e:\n",
292
+ " # Hard overall timeout → propagate\n",
293
+ " raise e\n",
294
+ " except requests.exceptions.RequestException as e:\n",
295
+ " # Network-ish errors: retry same key with backoff up to max_retries\n",
296
+ " retries += 1\n",
297
+ " if retries <= max_retries:\n",
298
+ " print(f\"🌐 Network error (try {retries}/{max_retries}) with key #{current_key_index + 1}: {e}\")\n",
299
+ " time.sleep(backoff)\n",
300
+ " backoff = min(backoff * 2, 60)\n",
301
+ " continue\n",
302
+ " else:\n",
303
+ " raise Exception(f\"Failed to download {url} after {max_retries} retries: {e}\")\n",
304
+ "\n",
305
+ " # If we exit the loop, we truly ran out\n",
306
+ " raise Exception(\"All API keys have been exhausted or failed.\")\n",
307
+ "\n",
308
+ "# === Main processing ===\n",
309
+ "def process_csv(csv_path):\n",
310
+ " with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
311
+ " reader = csv.DictReader(csvfile)\n",
312
+ " for index, row in enumerate(reader):\n",
313
+ " # Collect up to 20 version IDs; use the most recent\n",
314
+ " version_ids = []\n",
315
+ " for i in range(1, 21):\n",
316
+ " k = f'version_id_{i}'\n",
317
+ " if k in row and row[k]:\n",
318
+ " try:\n",
319
+ " version_ids.append(int(float(row[k])))\n",
320
+ " except ValueError:\n",
321
+ " print(f\"Invalid version_id value '{row[k]}' in row: {row}\")\n",
322
+ "\n",
323
+ " if not version_ids:\n",
324
+ " print(f\"No valid version IDs found in row: {row}\")\n",
325
+ " continue\n",
326
+ "\n",
327
+ " most_recent_version_id = str(max(version_ids))\n",
328
+ " name = row.get('name', 'unknown')\n",
329
+ " sanitized_name = safe_filename(name, max_length=100)\n",
330
+ " new_json_file = os.path.join(\n",
331
+ " output_json_dir,\n",
332
+ " f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
333
+ " )\n",
334
+ "\n",
335
+ " # Skip if JSON already exists\n",
336
+ " if os.path.exists(new_json_file):\n",
337
+ " #print(f\"↩️ Skipping versionID {most_recent_version_id} (JSON already exists)\")\n",
338
+ " continue\n",
339
+ "\n",
340
+ " try:\n",
341
+ " adapter_file, fname = download_file(\n",
342
+ " row['downloadUrl'], str(temp_dir),\n",
343
+ " timeout=30, overall_timeout=180\n",
344
+ " )\n",
345
+ " metadata = parse_safetensors(adapter_file)\n",
346
+ "\n",
347
+ " civitaidata = {\n",
348
+ " k: (int(v) if str(v).isdigit() else v)\n",
349
+ " for k, v in row.items()\n",
350
+ " }\n",
351
+ " new_json_data = {\n",
352
+ " \"civitaidata\": civitaidata,\n",
353
+ " \"metadata\": metadata,\n",
354
+ " \"versionID\": most_recent_version_id\n",
355
+ " }\n",
356
+ " save_json(new_json_data, new_json_file)\n",
357
+ " #print(f\"✅ Created JSON for versionID {most_recent_version_id} with file {fname}\")\n",
358
+ "\n",
359
+ " except FileNotFoundError as e:\n",
360
+ " print(f\"⚠️ {e} — saving empty metadata.\")\n",
361
+ " civitaidata = {\n",
362
+ " k: (int(v) if str(v).isdigit() else v)\n",
363
+ " for k, v in row.items()\n",
364
+ " }\n",
365
+ " empty_json = {\n",
366
+ " \"civitaidata\": civitaidata,\n",
367
+ " \"metadata\": {},\n",
368
+ " \"versionID\": most_recent_version_id,\n",
369
+ " \"error\": \"Model not found (404)\"\n",
370
+ " }\n",
371
+ " save_json(empty_json, new_json_file)\n",
372
+ " except Exception as e:\n",
373
+ " print(f\"⚠️ Error processing versionID {most_recent_version_id}: {e}\")\n",
374
+ " civitaidata = {\n",
375
+ " k: (int(v) if str(v).isdigit() else v)\n",
376
+ " for k, v in row.items()\n",
377
+ " }\n",
378
+ " empty_json = {\n",
379
+ " \"civitaidata\": civitaidata,\n",
380
+ " \"metadata\": {},\n",
381
+ " \"versionID\": most_recent_version_id,\n",
382
+ " \"error\": str(e)\n",
383
+ " }\n",
384
+ " save_json(empty_json, new_json_file)\n",
385
+ " print(f\"💾 Saved empty JSON for versionID {most_recent_version_id} due to failure.\")\n",
386
+ "\n",
387
+ "# === Run ===\n",
388
+ "if __name__ == \"__main__\":\n",
389
+ " process_csv(csv_path)\n"
390
+ ]
391
+ }
392
+ ],
393
+ "metadata": {
394
+ "language_info": {
395
+ "name": "python"
396
+ }
397
+ },
398
+ "nbformat": 4,
399
+ "nbformat_minor": 5
400
+ }
jupyter_notebooks/0_Scraping_image_metadata.ipynb ADDED
@@ -0,0 +1,1345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "1111ea95-d385-49b9-a4d9-ef886ace5c7a",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-06T11:24:25.566747Z",
9
+ "iopub.status.busy": "2025-02-06T11:24:25.566066Z",
10
+ "iopub.status.idle": "2025-02-06T11:24:25.571748Z",
11
+ "shell.execute_reply": "2025-02-06T11:24:25.571305Z",
12
+ "shell.execute_reply.started": "2025-02-06T11:24:25.566705Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# 0 Scraping Metadata and Dataset consolidation\n"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "6632505a-e7ca-4463-9ffc-e36fad42235f",
22
+ "metadata": {},
23
+ "source": [
24
+ "## IMAGES\n",
25
+ "---"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "id": "e3388bac-bb71-40bc-a693-9ac7a2d5f32c",
31
+ "metadata": {
32
+ "execution": {
33
+ "iopub.execute_input": "2025-02-06T10:08:22.229784Z",
34
+ "iopub.status.busy": "2025-02-06T10:08:22.229287Z",
35
+ "iopub.status.idle": "2025-02-06T10:08:22.232210Z",
36
+ "shell.execute_reply": "2025-02-06T10:08:22.231793Z",
37
+ "shell.execute_reply.started": "2025-02-06T10:08:22.229766Z"
38
+ }
39
+ },
40
+ "source": [
41
+ "### Step 1: Image metadata scraping, sorting and CSV consolidation"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "raw",
46
+ "id": "8b5d623a-a647-42ae-9951-efa99048cdf3",
47
+ "metadata": {},
48
+ "source": [
49
+ "Dataset creation workflow\n",
50
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
51
+ "│ Scrape Data from API │ --> │ Use CivitAI API with a timestamp-based cursor to │\n",
52
+ "│ Paginate Using Cursor│ │ scrape image metadata and save in JSON batches. │\n",
53
+ "│ Save in JSON Format │ └───────────────────────────────────────────────────┘\n",
54
+ "└─────────┬────────────┘\n",
55
+ " ↓\n",
56
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
57
+ "│ Locate JSON Files │ --> │ Find all JSON files saved in the directory. │\n",
58
+ "│ in Directory │ │ │\n",
59
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
60
+ " ↓\n",
61
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
62
+ "│ Read Each JSON File │ --> │ Parse JSON files and extract metadata, reactions, │\n",
63
+ "│ Parse and Extract │ │ and resource details. │\n",
64
+ "│ Metadata, Reactions, │ │ │\n",
65
+ "│ and Resources │ │ │\n",
66
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
67
+ " ↓\n",
68
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
69
+ "│ Sort Items by │ --> │ Sort the extracted items chronologically using │\n",
70
+ "│ createdAt Timestamp │ │ their createdAt timestamp. │\n",
71
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
72
+ " ↓\n",
73
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
74
+ "│ Write Extracted Data │ --> │ Save the processed and sorted data into a │\n",
75
+ "│ to a Consolidated │ │ consolidated CSV file with structured columns. │\n",
76
+ "│ CSV File │ │ │\n",
77
+ "└────────��─────────────┘ └───────────────────────────────────────────────────┘\n"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "code",
82
+ "execution_count": 1,
83
+ "id": "f8decb63-43f5-4731-823d-94632eee7618",
84
+ "metadata": {
85
+ "execution": {
86
+ "iopub.execute_input": "2025-02-08T19:37:51.115763Z",
87
+ "iopub.status.busy": "2025-02-08T19:37:51.114573Z",
88
+ "iopub.status.idle": "2025-02-08T19:37:51.170027Z",
89
+ "shell.execute_reply": "2025-02-08T19:37:51.169401Z",
90
+ "shell.execute_reply.started": "2025-02-08T19:37:51.115738Z"
91
+ }
92
+ },
93
+ "outputs": [],
94
+ "source": [
95
+ "import os\n",
96
+ "import json\n",
97
+ "import csv\n",
98
+ "import requests\n",
99
+ "from datetime import datetime\n",
100
+ "import time\n",
101
+ "from pathlib import Path\n",
102
+ "import hashlib\n",
103
+ "import pandas as pd\n",
104
+ "import sys\n",
105
+ "from datetime import datetime, timedelta\n",
106
+ "import shutil"
107
+ ]
108
+ },
109
+ {
110
+ "cell_type": "code",
111
+ "execution_count": 2,
112
+ "id": "4b2426c3-96a0-468e-b6dc-78dea9c3e92b",
113
+ "metadata": {
114
+ "execution": {
115
+ "iopub.execute_input": "2025-02-08T19:37:51.809027Z",
116
+ "iopub.status.busy": "2025-02-08T19:37:51.808835Z",
117
+ "iopub.status.idle": "2025-02-08T19:37:51.812922Z",
118
+ "shell.execute_reply": "2025-02-08T19:37:51.812429Z",
119
+ "shell.execute_reply.started": "2025-02-08T19:37:51.809009Z"
120
+ }
121
+ },
122
+ "outputs": [],
123
+ "source": [
124
+ "current_dir = Path.cwd()"
125
+ ]
126
+ },
127
+ {
128
+ "cell_type": "code",
129
+ "execution_count": 3,
130
+ "id": "dd0daa43-d988-4ac5-b9c7-2f9c65078325",
131
+ "metadata": {
132
+ "execution": {
133
+ "iopub.execute_input": "2025-02-07T20:59:30.413939Z",
134
+ "iopub.status.busy": "2025-02-07T20:59:30.412861Z",
135
+ "iopub.status.idle": "2025-02-07T20:59:30.418660Z",
136
+ "shell.execute_reply": "2025-02-07T20:59:30.418054Z",
137
+ "shell.execute_reply.started": "2025-02-07T20:59:30.413918Z"
138
+ }
139
+ },
140
+ "outputs": [],
141
+ "source": [
142
+ "# Define the input timestamp in ISO 8601 format\n",
143
+ "input_timestamp = \"2025-03-24T12:59:03.335Z\" # point in time from when you want to obtain metadata (you can copy the timestamp from the last *.json batch obtained to get the data of longer timespans)\n",
144
+ "\n",
145
+ "# Function to convert an ISO 8601 date string to a Unix timestamp in milliseconds with a 2-hour offset \n",
146
+ "def iso_to_timestamp(iso_str):\n",
147
+ " # Parse the ISO 8601 date string (including milliseconds and 'Z' indicating UTC)\n",
148
+ " date = datetime.strptime(iso_str, \"%Y-%m-%dT%H:%M:%S.%fZ\") \n",
149
+ " # Add 2 hours to the parsed date\n",
150
+ " adjusted_date = date + timedelta(hours=2)\n",
151
+ " # Convert to Unix timestamp (in milliseconds)\n",
152
+ " timestamp = int(adjusted_date.timestamp() * 1000)\n",
153
+ " return timestamp\n",
154
+ "\n",
155
+ "# Convert the input timestamp to an initial cursor\n",
156
+ "initial_cursor = iso_to_timestamp(input_timestamp)\n"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "code",
161
+ "execution_count": 4,
162
+ "id": "20b4bc72-80c4-4af6-bc76-aeecfa65f9dc",
163
+ "metadata": {
164
+ "execution": {
165
+ "iopub.execute_input": "2025-02-07T20:59:31.134620Z",
166
+ "iopub.status.busy": "2025-02-07T20:59:31.133548Z",
167
+ "iopub.status.idle": "2025-02-07T20:59:31.138081Z",
168
+ "shell.execute_reply": "2025-02-07T20:59:31.137555Z",
169
+ "shell.execute_reply.started": "2025-02-07T20:59:31.134599Z"
170
+ }
171
+ },
172
+ "outputs": [],
173
+ "source": [
174
+ "data_raw = current_dir.parent / f\"data/raw/image_metadata/001/{input_timestamp.replace(':', '').replace('T', '_').replace('.', '_').replace('Z', '')}\""
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "markdown",
179
+ "id": "f69c921d-ca57-4746-8124-a0fa4fce80ff",
180
+ "metadata": {},
181
+ "source": [
182
+ "#### The *.json files will be saved in the following structure:"
183
+ ]
184
+ },
185
+ {
186
+ "cell_type": "raw",
187
+ "id": "d731001e-32a4-4894-a2fe-835ea69214fc",
188
+ "metadata": {},
189
+ "source": [
190
+ "data/\n",
191
+ "└── raw/\n",
192
+ " └── image_metadata\n",
193
+ " └── 001/\n",
194
+ " └── YYYYMMDD_HHMMSS/ # Folder named based on the input timestamp\n",
195
+ " ├── most_recent_1.json # Sequential JSON files\n",
196
+ " ├── most_recent_2.json\n",
197
+ " ├── ...\n",
198
+ " ├── most_recent_50000.json\n",
199
+ " └── YYYYMMDD_HHMMSS_session_TIMESTAMP/ # New folder after API limit is reached\n",
200
+ " ├── most_recent_50001.json\n",
201
+ " ├── most_recent_50002.json\n",
202
+ " ├── ...\n"
203
+ ]
204
+ },
205
+ {
206
+ "cell_type": "code",
207
+ "execution_count": 5,
208
+ "id": "4cb3c005-bf70-43fc-afe8-93bf30872fb4",
209
+ "metadata": {
210
+ "execution": {
211
+ "iopub.execute_input": "2025-02-06T16:51:57.402732Z",
212
+ "iopub.status.busy": "2025-02-06T16:51:57.402532Z",
213
+ "iopub.status.idle": "2025-02-06T16:52:00.855604Z",
214
+ "shell.execute_reply": "2025-02-06T16:52:00.854678Z",
215
+ "shell.execute_reply.started": "2025-02-06T16:51:57.402712Z"
216
+ }
217
+ },
218
+ "outputs": [],
219
+ "source": [
220
+ "max_images = 50000 # some huge number now redundant because CivitAI API currently limits free api usage to 50'000\n",
221
+ "\n",
222
+ "def get_image_metadata():\n",
223
+ " base_url = \"https://civitai.com/api/v1/images\"\n",
224
+ " headers = {\n",
225
+ " \"Accept\": \"application/json\",\n",
226
+ " \"Authorization\": \"Bearer APITOKEN\" # Replace with your actual API token\n",
227
+ " }\n",
228
+ " params = {\n",
229
+ " \"sort\": \"Most Reactions\",\n",
230
+ " \"nsfw\": \"X\",\n",
231
+ " \"cursor\": f\"0|{initial_cursor}\"\n",
232
+ " }\n",
233
+ "\n",
234
+ " # Use pathlib to create the base directory for saving files\n",
235
+ "\n",
236
+ " file_counter = 0\n",
237
+ "\n",
238
+ " # Create a folder based on the input timestamp\n",
239
+ " data_raw.mkdir(parents=True, exist_ok=True)\n",
240
+ " sub_directory_path = data_raw\n",
241
+ "\n",
242
+ " retry_delay = 300 # 300 seconds / 5 minutes\n",
243
+ " retries_without_cursor = 0 # Track consecutive retries without a new cursor\n",
244
+ "\n",
245
+ " while True:\n",
246
+ " response = requests.get(base_url, headers=headers, params=params)\n",
247
+ " if response.status_code == 200:\n",
248
+ " data = response.json()\n",
249
+ " items = data.get('items', [])\n",
250
+ " if not items:\n",
251
+ " print(\"No more data available.\")\n",
252
+ " retries_without_cursor += 1\n",
253
+ " if retries_without_cursor > 5: # Allow up to 5 retries before stopping\n",
254
+ " print(\"Reached the end of the data after multiple retries.\")\n",
255
+ " break\n",
256
+ " time.sleep(retry_delay)\n",
257
+ " continue\n",
258
+ "\n",
259
+ " retries_without_cursor = 0 # Reset if we get data\n",
260
+ "\n",
261
+ " next_cursor = data.get('metadata', {}).get('nextCursor')\n",
262
+ " if next_cursor:\n",
263
+ " # Increment the cursor by 50 (e.g., \"0|1722470401000\" -> \"50|1722470401000\")\n",
264
+ " cursor_value = int(params['cursor'].split(\"|\")[0])\n",
265
+ " new_cursor_value = cursor_value + 50\n",
266
+ " params['cursor'] = f\"{new_cursor_value}|{params['cursor'].split('|')[1]}\"\n",
267
+ " else:\n",
268
+ " print(\"No new cursor returned, stopping.\")\n",
269
+ " break\n",
270
+ "\n",
271
+ " file_counter += 1\n",
272
+ " if file_counter % max_images == 0:\n",
273
+ " time_stamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
274
+ " sub_directory_path = data_raw / f\"{input_timestamp.replace(':', '').replace('T', '_').replace('.', '_').replace('Z', '')}_session_{time_stamp}\"\n",
275
+ " sub_directory_path.mkdir(parents=True, exist_ok=True)\n",
276
+ "\n",
277
+ " file_path = sub_directory_path / f'most_recent_{file_counter}.json'\n",
278
+ " with open(file_path, 'w', encoding='utf-8') as file:\n",
279
+ " json.dump(data, file, indent=4)\n",
280
+ "\n",
281
+ " elif response.status_code == 502:\n",
282
+ " print(f\"Received HTTP 502. Retrying in {retry_delay // 60} minutes.\")\n",
283
+ " time.sleep(retry_delay) # Wait for 5 minutes before retrying\n",
284
+ " else:\n",
285
+ " print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
286
+ " break\n"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "markdown",
291
+ "id": "238044f1-544f-4821-a91e-57386d464f9e",
292
+ "metadata": {
293
+ "execution": {
294
+ "iopub.execute_input": "2025-02-06T16:13:44.748157Z",
295
+ "iopub.status.busy": "2025-02-06T16:13:44.747986Z",
296
+ "iopub.status.idle": "2025-02-06T16:13:44.752230Z",
297
+ "shell.execute_reply": "2025-02-06T16:13:44.751694Z",
298
+ "shell.execute_reply.started": "2025-02-06T16:13:44.748140Z"
299
+ }
300
+ },
301
+ "source": [
302
+ "uncomment this line to scrape image metadata:"
303
+ ]
304
+ },
305
+ {
306
+ "cell_type": "code",
307
+ "execution_count": null,
308
+ "id": "7c89cb68-983b-47df-a028-e02d7ca0829d",
309
+ "metadata": {
310
+ "execution": {
311
+ "iopub.execute_input": "2025-02-06T16:52:00.857210Z",
312
+ "iopub.status.busy": "2025-02-06T16:52:00.857015Z",
313
+ "iopub.status.idle": "2025-02-06T16:52:04.901096Z",
314
+ "shell.execute_reply": "2025-02-06T16:52:04.900159Z",
315
+ "shell.execute_reply.started": "2025-02-06T16:52:00.857191Z"
316
+ }
317
+ },
318
+ "outputs": [],
319
+ "source": [
320
+ "get_image_metadata()"
321
+ ]
322
+ },
323
+ {
324
+ "cell_type": "markdown",
325
+ "id": "06f5f27e-7278-44c9-ae34-9a2020cac2f2",
326
+ "metadata": {},
327
+ "source": [
328
+ "---\n",
329
+ "---"
330
+ ]
331
+ },
332
+ {
333
+ "cell_type": "markdown",
334
+ "id": "648aee1b-9e01-4400-a6a8-3361c97c377a",
335
+ "metadata": {
336
+ "execution": {
337
+ "iopub.execute_input": "2025-02-06T11:33:11.524990Z",
338
+ "iopub.status.busy": "2025-02-06T11:33:11.523287Z",
339
+ "iopub.status.idle": "2025-02-06T11:33:11.531385Z",
340
+ "shell.execute_reply": "2025-02-06T11:33:11.530884Z",
341
+ "shell.execute_reply.started": "2025-02-06T11:33:11.524893Z"
342
+ }
343
+ },
344
+ "source": [
345
+ "### Step 02 Chronological Sorting"
346
+ ]
347
+ },
348
+ {
349
+ "cell_type": "code",
350
+ "execution_count": 7,
351
+ "id": "b73d7ea5-50e0-485d-8bc5-a4736f31c789",
352
+ "metadata": {
353
+ "execution": {
354
+ "iopub.execute_input": "2025-02-06T16:52:04.902663Z",
355
+ "iopub.status.busy": "2025-02-06T16:52:04.902478Z",
356
+ "iopub.status.idle": "2025-02-06T16:52:08.491068Z",
357
+ "shell.execute_reply": "2025-02-06T16:52:08.490105Z",
358
+ "shell.execute_reply.started": "2025-02-06T16:52:04.902646Z"
359
+ }
360
+ },
361
+ "outputs": [],
362
+ "source": [
363
+ "def organize_files(source_dir, target_dir, max_items_per_file=100):\n",
364
+ " print(f\"Starting to organize files from {source_dir} to {target_dir}\")\n",
365
+ " item_buffer = []\n",
366
+ " file_count = 0\n",
367
+ "\n",
368
+ " # Walk through all files in the source directory\n",
369
+ " for root, dirs, files in os.walk(source_dir):\n",
370
+ " if '.ipynb_checkpoints' in root:\n",
371
+ " continue # Skip .ipynb_checkpoints directories\n",
372
+ " print(f\"Checking directory: {root}\")\n",
373
+ " for filename in files:\n",
374
+ " if filename.lower().endswith('.json'):\n",
375
+ " file_path = os.path.join(root, filename)\n",
376
+ " try:\n",
377
+ " with open(file_path, 'r') as file:\n",
378
+ " data = json.load(file)\n",
379
+ " items = data.get('items', []) # Get the list of items\n",
380
+ " for item in items:\n",
381
+ " item_buffer.append(item)\n",
382
+ " # Write out the buffer if it has reached the maximum size\n",
383
+ " if len(item_buffer) >= max_items_per_file:\n",
384
+ " write_items(item_buffer[:max_items_per_file], target_dir)\n",
385
+ " item_buffer = item_buffer[max_items_per_file:]\n",
386
+ " file_count += 1\n",
387
+ " except json.JSONDecodeError:\n",
388
+ " print(f\"Error decoding JSON from file {file_path}\")\n",
389
+ " except Exception as e:\n",
390
+ " print(f\"An error occurred with file {file_path}: {e}\")\n",
391
+ "\n",
392
+ " # Write any remaining items in the buffer\n",
393
+ " if item_buffer:\n",
394
+ " write_items(item_buffer, target_dir)\n",
395
+ " file_count += 1\n",
396
+ "\n",
397
+ " #print(f\"Processed {file_count} files.\")\n",
398
+ "\n",
399
+ "def write_items(items, target_dir):\n",
400
+ " # Use the createdAt from the first item to determine the directory\n",
401
+ " created_at = items[0].get('createdAt')\n",
402
+ " if created_at:\n",
403
+ " date_obj = datetime.fromisoformat(created_at.rstrip(\"Z\"))\n",
404
+ " new_dir = os.path.join(target_dir, f\"{date_obj.year}\", f\"{date_obj.year}-{date_obj.month:02d}\", f\"{date_obj.year}-{date_obj.month:02d}-{date_obj.day:02d}\")\n",
405
+ " os.makedirs(new_dir, exist_ok=True)\n",
406
+ " new_file_path = os.path.join(new_dir, f\"batch_{date_obj.strftime('%Y%m%dT%H%M%S')}.json\")\n",
407
+ " with open(new_file_path, 'w', encoding='utf-8') as new_file:\n",
408
+ " json.dump(items, new_file, indent=4)\n",
409
+ " #print(f\"Wrote {len(items)} items to {new_file_path}\")\n"
410
+ ]
411
+ },
412
+ {
413
+ "cell_type": "code",
414
+ "execution_count": 8,
415
+ "id": "07cebb8a-2955-4432-a0d1-7ee94975a34b",
416
+ "metadata": {
417
+ "execution": {
418
+ "iopub.execute_input": "2025-02-06T16:52:08.492517Z",
419
+ "iopub.status.busy": "2025-02-06T16:52:08.492328Z",
420
+ "iopub.status.idle": "2025-02-06T16:53:13.648165Z",
421
+ "shell.execute_reply": "2025-02-06T16:53:13.647511Z",
422
+ "shell.execute_reply.started": "2025-02-06T16:52:08.492499Z"
423
+ }
424
+ },
425
+ "outputs": [
426
+ {
427
+ "name": "stdout",
428
+ "output_type": "stream",
429
+ "text": [
430
+ "Starting to organize files from /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/image_metadata/001/2025-02-01_000000_000 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/sorted/image_metadata\n",
431
+ "Checking directory: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/image_metadata/001/2025-02-01_000000_000\n"
432
+ ]
433
+ }
434
+ ],
435
+ "source": [
436
+ "data_sorted = current_dir.parent / 'data/sorted/image_metadata'\n",
437
+ "\n",
438
+ "origin = data_raw\n",
439
+ "target = data_sorted\n",
440
+ "\n",
441
+ "organize_files( origin, target)"
442
+ ]
443
+ },
444
+ {
445
+ "cell_type": "markdown",
446
+ "id": "9255fd24-54ee-4235-bc44-a0246e2fd927",
447
+ "metadata": {},
448
+ "source": [
449
+ "---\n",
450
+ "---"
451
+ ]
452
+ },
453
+ {
454
+ "cell_type": "markdown",
455
+ "id": "64639464-6dc4-4997-a805-3f9f41f305c7",
456
+ "metadata": {},
457
+ "source": [
458
+ "### Step 3 CSV consolidation"
459
+ ]
460
+ },
461
+ {
462
+ "cell_type": "markdown",
463
+ "id": "add5f97b-562a-4af5-a8f2-635433a8158f",
464
+ "metadata": {
465
+ "execution": {
466
+ "iopub.execute_input": "2025-02-06T12:03:15.512133Z",
467
+ "iopub.status.busy": "2025-02-06T12:03:15.511399Z",
468
+ "iopub.status.idle": "2025-02-06T12:03:15.516297Z",
469
+ "shell.execute_reply": "2025-02-06T12:03:15.515815Z",
470
+ "shell.execute_reply.started": "2025-02-06T12:03:15.512114Z"
471
+ }
472
+ },
473
+ "source": [
474
+ "#### this script walks through the directories in **image-metadata/** and creates the dataset as **.csv** \n"
475
+ ]
476
+ },
477
+ {
478
+ "cell_type": "markdown",
479
+ "id": "0f8c83aa-c143-4fac-b890-a8d4249067a0",
480
+ "metadata": {
481
+ "execution": {
482
+ "iopub.execute_input": "2025-02-06T12:02:29.925370Z",
483
+ "iopub.status.busy": "2025-02-06T12:02:29.923935Z",
484
+ "iopub.status.idle": "2025-02-06T12:02:31.421575Z",
485
+ "shell.execute_reply": "2025-02-06T12:02:31.420575Z",
486
+ "shell.execute_reply.started": "2025-02-06T12:02:29.925332Z"
487
+ }
488
+ },
489
+ "source": [
490
+ "\n",
491
+ "### Dataset Columns\n",
492
+ "\n",
493
+ "| **Column Name** | **Description** |\n",
494
+ "|------------------------|-----------------------------------------------------------------------------------------------|\n",
495
+ "| `createdAt` | Timestamp when the item was created. |\n",
496
+ "| `url` | URL associated with the item. |\n",
497
+ "| `positivePrompt` | Positive prompts used in the generation process. |\n",
498
+ "| `negativePrompt` | Negative prompts used in the generation process. |\n",
499
+ "| `nsfw` | Indicates whether the item is NSFW (Not Safe for Work). |\n",
500
+ "| `nsfwLevel` | Level of NSFW content (e.g., Soft, Mature). we only considered SFW!! |\n",
501
+ "| `browsingLevel` | Browsing level required to access the content. |\n",
502
+ "| `statsSummary` | All social reactions summed up: cryCount, likeCount, heartCount, CommentCount |\n",
503
+ "| `commentCount` | Number of comments received. |\n",
504
+ "| `username` | Username of the creator of the item. |\n",
505
+ "| `Model` | Model used to generate the content. |\n",
506
+ "| `Meta` | Simplified meta details, including size, seed, steps, sampler, and version. |\n",
507
+ "| `VAE` | Variational Autoencoder (VAE) used, if any. |\n",
508
+ "| `resourceIDs`| Array of up to six resources (LoRA etc.), including name, type, and weight. |\n",
509
+ "\n"
510
+ ]
511
+ },
512
+ {
513
+ "cell_type": "code",
514
+ "execution_count": 5,
515
+ "id": "f8db24f9-6678-4e7d-a3b7-453f33ad0f71",
516
+ "metadata": {
517
+ "execution": {
518
+ "iopub.execute_input": "2025-02-07T20:59:38.328530Z",
519
+ "iopub.status.busy": "2025-02-07T20:59:38.328049Z",
520
+ "iopub.status.idle": "2025-02-07T20:59:38.342970Z",
521
+ "shell.execute_reply": "2025-02-07T20:59:38.342400Z",
522
+ "shell.execute_reply.started": "2025-02-07T20:59:38.328508Z"
523
+ }
524
+ },
525
+ "outputs": [],
526
+ "source": [
527
+ "def find_json_files(directory):\n",
528
+ " \"\"\"Walk through the directory and its subdirectories to find all JSON files.\"\"\"\n",
529
+ " json_files = []\n",
530
+ " for root, _, files in os.walk(directory):\n",
531
+ " for file in files:\n",
532
+ " if file.endswith('.json'):\n",
533
+ " json_files.append(os.path.join(root, file))\n",
534
+ " return json_files\n",
535
+ "\n",
536
+ "def hash_username(username):\n",
537
+ " \"\"\"Convert a username into a unique hash.\"\"\"\n",
538
+ " return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16] # Use first 16 characters for brevity\n",
539
+ "\n",
540
+ "def extract_resource_ids(meta, resources):\n",
541
+ " \"\"\"Extract resource IDs from metadata and resources.\"\"\"\n",
542
+ " resource_ids = set() # Use a set to avoid duplicates\n",
543
+ " hash_ids = set() # Separate set to track hashes specifically\n",
544
+ "\n",
545
+ " # Extract CivitAI model IDs\n",
546
+ " if \"civitaiResources\" in meta:\n",
547
+ " for resource in meta[\"civitaiResources\"]:\n",
548
+ " if \"modelVersionId\" in resource:\n",
549
+ " resource_ids.add(str(resource[\"modelVersionId\"])) # Ensure all IDs are strings\n",
550
+ "\n",
551
+ " # Extract model hashes\n",
552
+ " if \"Model hash\" in meta:\n",
553
+ " hash_ids.add(meta[\"Model hash\"])\n",
554
+ " if \"base_model_hash\" in meta:\n",
555
+ " hash_ids.add(meta[\"base_model_hash\"])\n",
556
+ " if \"models\" in meta and isinstance(meta[\"models\"], list):\n",
557
+ " hash_ids.update(str(model) for model in meta[\"models\"])\n",
558
+ "\n",
559
+ " # Extract identifiers from resources list\n",
560
+ " for resource in resources:\n",
561
+ " if isinstance(resource, dict): # Ensure resource is a dictionary\n",
562
+ " resource_hash = resource.get(\"hash\")\n",
563
+ " if resource_hash:\n",
564
+ " hash_ids.add(str(resource_hash)) # Add to hash set\n",
565
+ " else:\n",
566
+ " name = resource.get(\"name\")\n",
567
+ " if name:\n",
568
+ " resource_ids.add(str(name))\n",
569
+ "\n",
570
+ " # Remove any named IDs if their hash exists\n",
571
+ " resource_ids = {rid for rid in resource_ids if rid not in hash_ids}\n",
572
+ "\n",
573
+ " # Combine hashes and remaining resource IDs\n",
574
+ " final_ids = list(hash_ids) + list(resource_ids)\n",
575
+ " return final_ids\n",
576
+ "\n",
577
+ "def format_resource_ids(resource_ids):\n",
578
+ " \"\"\"Format resource IDs in a standardized list format.\"\"\"\n",
579
+ " formatted_ids = [str(rid).strip() for rid in resource_ids if rid]\n",
580
+ " return f\"[{', '.join(formatted_ids)}]\"\n",
581
+ "\n",
582
+ "def truncate_prompt(prompt, max_tokens=77):\n",
583
+ " \"\"\"Truncate a prompt to a maximum number of tokens.\"\"\"\n",
584
+ " return ' '.join(prompt.split()[:max_tokens]) if prompt else \"\"\n",
585
+ "\n",
586
+ "def write_to_csv(json_files, output_csv):\n",
587
+ " \"\"\"Read JSON files, extract data, and write to a CSV file.\"\"\"\n",
588
+ " with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:\n",
589
+ " fieldnames = [\n",
590
+ " 'id', 'createdAt', 'url', 'positivePrompt', 'negativePrompt', 'nsfw',\n",
591
+ " 'browsingLevel', 'statsSummary', 'usernameHash', 'Model', 'cfgScale',\n",
592
+ " 'sampler', 'Size', 'seed', 'VAE', 'generationSystem', 'resourceIDs'\n",
593
+ " ]\n",
594
+ " writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n",
595
+ " writer.writeheader()\n",
596
+ "\n",
597
+ " for json_file in json_files:\n",
598
+ " with open(json_file, 'r', encoding='utf-8') as file:\n",
599
+ " try:\n",
600
+ " data = json.load(file)\n",
601
+ " for item in data:\n",
602
+ " meta = item.get('meta', {}) or {}\n",
603
+ " stats = item.get('stats', {}) or {}\n",
604
+ " resources = meta.get('resources', []) if isinstance(meta, dict) else []\n",
605
+ "\n",
606
+ " # Anonymize username\n",
607
+ " username = item.get('username', '')\n",
608
+ " username_hash = hash_username(username) if username else ''\n",
609
+ "\n",
610
+ " # Extract resource IDs\n",
611
+ " resource_ids = extract_resource_ids(meta, resources)\n",
612
+ " formatted_resource_ids = format_resource_ids(resource_ids)\n",
613
+ "\n",
614
+ " # Summarize stats counts\n",
615
+ " stats_summary = sum(stats.get(key, 0) for key in stats)\n",
616
+ "\n",
617
+ " # Truncate prompts\n",
618
+ " positive_prompt = truncate_prompt(meta.get('prompt', ''))\n",
619
+ " negative_prompt = truncate_prompt(meta.get('negativePrompt', ''))\n",
620
+ "\n",
621
+ " # Extract metadata fields\n",
622
+ " cfg_scale = meta.get('cfgScale', 'N/A')\n",
623
+ " sampler = meta.get('sampler', 'N/A')\n",
624
+ " size = meta.get('Size', 'N/A')\n",
625
+ " seed = meta.get('seed', 'N/A')\n",
626
+ " vae = meta.get('VAE', 'N/A')\n",
627
+ " generation_system = \"CivitAI\" if \"civitaiResources\" in meta else \"Undetermined\"\n",
628
+ "\n",
629
+ " # Process a single row\n",
630
+ " row = {\n",
631
+ " 'id': item.get('id', ''),\n",
632
+ " 'createdAt': item.get('createdAt', ''),\n",
633
+ " 'url': item.get('url', ''),\n",
634
+ " 'positivePrompt': positive_prompt,\n",
635
+ " 'negativePrompt': negative_prompt,\n",
636
+ " 'nsfw': item.get('nsfw', False),\n",
637
+ " 'browsingLevel': item.get('browsingLevel', 'N/A'),\n",
638
+ " 'statsSummary': stats_summary,\n",
639
+ " 'usernameHash': username_hash,\n",
640
+ " 'Model': meta.get('Model', ''),\n",
641
+ " 'cfgScale': cfg_scale,\n",
642
+ " 'sampler': sampler,\n",
643
+ " 'Size': size,\n",
644
+ " 'seed': seed,\n",
645
+ " 'VAE': vae,\n",
646
+ " 'generationSystem': generation_system,\n",
647
+ " 'resourceIDs': formatted_resource_ids\n",
648
+ " }\n",
649
+ "\n",
650
+ " writer.writerow(row)\n",
651
+ " except (json.JSONDecodeError, KeyError, TypeError) as e:\n",
652
+ " print(f\"Error processing file {json_file}: {e}\")\n"
653
+ ]
654
+ },
655
+ {
656
+ "cell_type": "markdown",
657
+ "id": "1c51189e-5eb2-40ef-a5dd-807cd9cded5f",
658
+ "metadata": {
659
+ "execution": {
660
+ "iopub.execute_input": "2025-02-06T12:49:11.129158Z",
661
+ "iopub.status.busy": "2025-02-06T12:49:11.128195Z",
662
+ "iopub.status.idle": "2025-02-06T12:49:11.132440Z",
663
+ "shell.execute_reply": "2025-02-06T12:49:11.131779Z",
664
+ "shell.execute_reply.started": "2025-02-06T12:49:11.129132Z"
665
+ }
666
+ },
667
+ "source": [
668
+ "### Save as 'Civiverse' dataset in data/CSV"
669
+ ]
670
+ },
671
+ {
672
+ "cell_type": "code",
673
+ "execution_count": 6,
674
+ "id": "07c97ff9-7ce0-43b7-a156-f5fb79c49cd9",
675
+ "metadata": {
676
+ "execution": {
677
+ "iopub.execute_input": "2025-02-07T20:59:40.218975Z",
678
+ "iopub.status.busy": "2025-02-07T20:59:40.218224Z",
679
+ "iopub.status.idle": "2025-02-07T21:04:11.221872Z",
680
+ "shell.execute_reply": "2025-02-07T21:04:11.220974Z",
681
+ "shell.execute_reply.started": "2025-02-07T20:59:40.218926Z"
682
+ }
683
+ },
684
+ "outputs": [
685
+ {
686
+ "name": "stdout",
687
+ "output_type": "stream",
688
+ "text": [
689
+ "Error processing file /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/sorted/image_metadata/2024/2024-07/2024-07-29/batch_20240729T192800.json: Expecting value: line 1 column 1 (char 0)\n"
690
+ ]
691
+ }
692
+ ],
693
+ "source": [
694
+ "from pathlib import Path\n",
695
+ "\n",
696
+ "raw = current_dir.parent / 'data/sorted/image_metadata/'\n",
697
+ "csv_output = current_dir.parent / 'data/CSV/Civiverse.csv'\n",
698
+ "\n",
699
+ "csv_output.parent.mkdir(parents=True, exist_ok=True)\n",
700
+ "\n",
701
+ "json_files = find_json_files(raw)\n",
702
+ "write_to_csv(json_files, csv_output)"
703
+ ]
704
+ },
705
+ {
706
+ "cell_type": "markdown",
707
+ "id": "b3e64b07-98d1-480d-b4e1-f624a17db848",
708
+ "metadata": {
709
+ "execution": {
710
+ "iopub.execute_input": "2025-02-06T13:50:08.120671Z",
711
+ "iopub.status.busy": "2025-02-06T13:50:08.119949Z",
712
+ "iopub.status.idle": "2025-02-06T13:50:08.124761Z",
713
+ "shell.execute_reply": "2025-02-06T13:50:08.124298Z",
714
+ "shell.execute_reply.started": "2025-02-06T13:50:08.120632Z"
715
+ }
716
+ },
717
+ "source": [
718
+ "### Optional step 4: Create sampled subset"
719
+ ]
720
+ },
721
+ {
722
+ "cell_type": "code",
723
+ "execution_count": 11,
724
+ "id": "b531b8cd-4653-451d-a6d5-386a2721827c",
725
+ "metadata": {
726
+ "execution": {
727
+ "iopub.execute_input": "2025-02-06T16:55:01.916121Z",
728
+ "iopub.status.busy": "2025-02-06T16:55:01.915489Z",
729
+ "iopub.status.idle": "2025-02-06T16:55:01.920043Z",
730
+ "shell.execute_reply": "2025-02-06T16:55:01.919655Z",
731
+ "shell.execute_reply.started": "2025-02-06T16:55:01.916103Z"
732
+ }
733
+ },
734
+ "outputs": [],
735
+ "source": [
736
+ "input_file = current_dir.parent / 'data/CSV/Civiverse.csv'\n",
737
+ "output_file = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
738
+ "output_file.parent.mkdir(parents=True, exist_ok=True)"
739
+ ]
740
+ },
741
+ {
742
+ "cell_type": "code",
743
+ "execution_count": 12,
744
+ "id": "b82db8b3-310c-49e4-9d7a-aa34662f9556",
745
+ "metadata": {
746
+ "execution": {
747
+ "iopub.execute_input": "2025-02-06T16:55:01.920861Z",
748
+ "iopub.status.busy": "2025-02-06T16:55:01.920586Z",
749
+ "iopub.status.idle": "2025-02-06T16:55:06.790164Z",
750
+ "shell.execute_reply": "2025-02-06T16:55:06.789503Z",
751
+ "shell.execute_reply.started": "2025-02-06T16:55:01.920844Z"
752
+ }
753
+ },
754
+ "outputs": [
755
+ {
756
+ "name": "stderr",
757
+ "output_type": "stream",
758
+ "text": [
759
+ "/sctmp/lauwag/ipykernel_1043676/3764041865.py:4: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.\n",
760
+ " df = pd.read_csv(input_file)\n",
761
+ "/sctmp/lauwag/ipykernel_1043676/3764041865.py:10: SettingWithCopyWarning: \n",
762
+ "A value is trying to be set on a copy of a slice from a DataFrame.\n",
763
+ "Try using .loc[row_indexer,col_indexer] = value instead\n",
764
+ "\n",
765
+ "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
766
+ " df_sample['createdAt'] = pd.to_datetime(df_sample['createdAt'], errors='coerce')\n"
767
+ ]
768
+ }
769
+ ],
770
+ "source": [
771
+ "\n",
772
+ "# Define the sampling rate\n",
773
+ "n = 100\n",
774
+ "# Read the CSV file\n",
775
+ "df = pd.read_csv(input_file)\n",
776
+ "\n",
777
+ "# Sample every 15th row\n",
778
+ "df_sample = df.iloc[::n, :]\n",
779
+ "\n",
780
+ "# Convert the 'createdAt' column to datetime, inferring mixed formats\n",
781
+ "df_sample['createdAt'] = pd.to_datetime(df_sample['createdAt'], errors='coerce')\n",
782
+ "\n",
783
+ "# Sort the sampled data by the 'createdAt' column\n",
784
+ "df_sample_sorted = df_sample.sort_values(by='createdAt')\n",
785
+ "\n",
786
+ "# Save the sorted sample DataFrame to a new CSV file\n",
787
+ "df_sample_sorted.to_csv(output_file, index=False)"
788
+ ]
789
+ },
790
+ {
791
+ "cell_type": "markdown",
792
+ "id": "d915b7d6-937a-4fa9-9edc-ded67e3606f5",
793
+ "metadata": {},
794
+ "source": [
795
+ "---\n",
796
+ "---\n",
797
+ "---"
798
+ ]
799
+ },
800
+ {
801
+ "cell_type": "markdown",
802
+ "id": "11647bb7-5ce9-414a-8486-5bdce8d9cfea",
803
+ "metadata": {
804
+ "execution": {
805
+ "iopub.execute_input": "2025-02-06T12:49:43.126762Z",
806
+ "iopub.status.busy": "2025-02-06T12:49:43.125797Z",
807
+ "iopub.status.idle": "2025-02-06T12:49:43.129759Z",
808
+ "shell.execute_reply": "2025-02-06T12:49:43.129176Z",
809
+ "shell.execute_reply.started": "2025-02-06T12:49:43.126736Z"
810
+ }
811
+ },
812
+ "source": [
813
+ "## MODELS"
814
+ ]
815
+ },
816
+ {
817
+ "cell_type": "markdown",
818
+ "id": "5c9cb5e3-7cea-4574-9319-f3cd89354b1f",
819
+ "metadata": {
820
+ "execution": {
821
+ "iopub.execute_input": "2025-02-06T13:23:38.017078Z",
822
+ "iopub.status.busy": "2025-02-06T13:23:38.016639Z",
823
+ "iopub.status.idle": "2025-02-06T13:23:38.019993Z",
824
+ "shell.execute_reply": "2025-02-06T13:23:38.019549Z",
825
+ "shell.execute_reply.started": "2025-02-06T13:23:38.017053Z"
826
+ }
827
+ },
828
+ "source": [
829
+ "### Step 1: Scrape model metadata"
830
+ ]
831
+ },
832
+ {
833
+ "cell_type": "markdown",
834
+ "id": "83e12b9f-5ae1-407d-9754-5979d837f787",
835
+ "metadata": {
836
+ "execution": {
837
+ "iopub.execute_input": "2025-02-06T14:06:04.857874Z",
838
+ "iopub.status.busy": "2025-02-06T14:06:04.857500Z",
839
+ "iopub.status.idle": "2025-02-06T14:06:04.860438Z",
840
+ "shell.execute_reply": "2025-02-06T14:06:04.860030Z",
841
+ "shell.execute_reply.started": "2025-02-06T14:06:04.857856Z"
842
+ }
843
+ },
844
+ "source": [
845
+ "#### the resulting files will appear in data/raw/model_metadata as *.json"
846
+ ]
847
+ },
848
+ {
849
+ "cell_type": "code",
850
+ "execution_count": 5,
851
+ "id": "41ee14e0-fb78-4f91-aba1-13faf05af7d8",
852
+ "metadata": {
853
+ "execution": {
854
+ "iopub.execute_input": "2025-02-08T19:37:56.687832Z",
855
+ "iopub.status.busy": "2025-02-08T19:37:56.687251Z",
856
+ "iopub.status.idle": "2025-02-08T19:37:56.696572Z",
857
+ "shell.execute_reply": "2025-02-08T19:37:56.696059Z",
858
+ "shell.execute_reply.started": "2025-02-08T19:37:56.687809Z"
859
+ }
860
+ },
861
+ "outputs": [],
862
+ "source": [
863
+ "import datetime\n",
864
+ "\n",
865
+ "def load_api_keys():\n",
866
+ " \"\"\"Load API keys from a text file, one per line.\"\"\"\n",
867
+ " if not os.path.exists(key_karussell):\n",
868
+ " raise FileNotFoundError(f\"API key file '{API_KEYS_FILE}' not found!\")\n",
869
+ " \n",
870
+ " with open(key_karussell, 'r') as file:\n",
871
+ " keys = [line.strip() for line in file if line.strip()]\n",
872
+ " \n",
873
+ " if not keys:\n",
874
+ " raise ValueError(\"No API keys found in the file!\")\n",
875
+ " \n",
876
+ " return keys\n",
877
+ "\n",
878
+ "def get_model_metadata():\n",
879
+ " base_url = \"https://civitai.com/api/v1/models\"\n",
880
+ " params = {\"sort\": \"Newest\", \"nsfw\": True}\n",
881
+ "\n",
882
+ " # Load API keys\n",
883
+ " api_keys = load_api_keys()\n",
884
+ " key_index = 0 # Start with the first key\n",
885
+ "\n",
886
+ " page_counter = 0\n",
887
+ " max_pages = 300000000 # Adjust as needed\n",
888
+ " os.makedirs(directory_path, exist_ok=True)\n",
889
+ "\n",
890
+ " while True:\n",
891
+ " if page_counter >= max_pages:\n",
892
+ " print(f\"Reached the limit of {max_pages} pages.\")\n",
893
+ " break\n",
894
+ "\n",
895
+ " headers = {\n",
896
+ " \"Accept\": \"application/json\",\n",
897
+ " \"Authorization\": f\"Bearer {api_keys[key_index]}\"\n",
898
+ " }\n",
899
+ "\n",
900
+ " response = requests.get(base_url, headers=headers, params=params)\n",
901
+ "\n",
902
+ " if response.status_code == 200:\n",
903
+ " data = response.json()\n",
904
+ " page_counter += 1\n",
905
+ "\n",
906
+ " # Add timestamp\n",
907
+ " formatted_timestamp = datetime.datetime.now().strftime(\"data obtained on the %d.%m.%Y at %H:%M CEST\")\n",
908
+ "\n",
909
+ " data['timestamp'] = formatted_timestamp\n",
910
+ "\n",
911
+ " # Save data to file\n",
912
+ " file_path = os.path.join(directory_path, f'newest_models_{page_counter}.json')\n",
913
+ " with open(file_path, 'w', encoding='utf-8') as file:\n",
914
+ " json.dump(data, file, indent=4)\n",
915
+ "\n",
916
+ " # Check for nextCursor\n",
917
+ " next_cursor = data.get('metadata', {}).get('nextCursor')\n",
918
+ " if not next_cursor:\n",
919
+ " print(\"No more data available.\")\n",
920
+ " break\n",
921
+ " else:\n",
922
+ " params['cursor'] = next_cursor\n",
923
+ " \n",
924
+ " elif response.status_code in (401, 403): # Unauthorized or Forbidden\n",
925
+ " print(f\"API Key {key_index + 1} failed with status {response.status_code}. Trying next key...\")\n",
926
+ " key_index += 1\n",
927
+ "\n",
928
+ " if key_index >= len(api_keys):\n",
929
+ " print(\"All API keys failed. Exiting.\")\n",
930
+ " break # Stop if all keys fail\n",
931
+ " \n",
932
+ " else:\n",
933
+ " print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
934
+ " break # Stop on other errors\n"
935
+ ]
936
+ },
937
+ {
938
+ "cell_type": "code",
939
+ "execution_count": 6,
940
+ "id": "0b574bc0-8bc5-48fd-9ced-8723d07e0d0e",
941
+ "metadata": {
942
+ "execution": {
943
+ "iopub.execute_input": "2025-02-08T19:37:57.395918Z",
944
+ "iopub.status.busy": "2025-02-08T19:37:57.395733Z",
945
+ "iopub.status.idle": "2025-02-08T19:37:57.399679Z",
946
+ "shell.execute_reply": "2025-02-08T19:37:57.399183Z",
947
+ "shell.execute_reply.started": "2025-02-08T19:37:57.395902Z"
948
+ }
949
+ },
950
+ "outputs": [],
951
+ "source": [
952
+ "key_karussell = current_dir.parent / 'misc/api_keys.txt'\n",
953
+ "directory_path = current_dir.parent / 'data/raw/model_metadata/'"
954
+ ]
955
+ },
956
+ {
957
+ "cell_type": "markdown",
958
+ "id": "8f43ce23-5986-4b67-be2d-453831af9a6e",
959
+ "metadata": {},
960
+ "source": [
961
+ "uncomment this to get model metadata"
962
+ ]
963
+ },
964
+ {
965
+ "cell_type": "code",
966
+ "execution_count": 7,
967
+ "id": "3f95b4ba-5742-4268-b2e8-de9145faf495",
968
+ "metadata": {
969
+ "execution": {
970
+ "iopub.execute_input": "2025-02-08T19:37:58.117386Z",
971
+ "iopub.status.busy": "2025-02-08T19:37:58.117158Z",
972
+ "iopub.status.idle": "2025-02-08T19:37:58.121162Z",
973
+ "shell.execute_reply": "2025-02-08T19:37:58.120580Z",
974
+ "shell.execute_reply.started": "2025-02-08T19:37:58.117369Z"
975
+ }
976
+ },
977
+ "outputs": [],
978
+ "source": [
979
+ "#get_model_metadata()"
980
+ ]
981
+ },
982
+ {
983
+ "cell_type": "markdown",
984
+ "id": "1ed5f44e-612e-4b6d-8974-904fb3e058d6",
985
+ "metadata": {
986
+ "execution": {
987
+ "iopub.execute_input": "2025-02-06T12:52:40.162173Z",
988
+ "iopub.status.busy": "2025-02-06T12:52:40.159989Z",
989
+ "iopub.status.idle": "2025-02-06T12:52:40.170634Z",
990
+ "shell.execute_reply": "2025-02-06T12:52:40.169945Z",
991
+ "shell.execute_reply.started": "2025-02-06T12:52:40.162124Z"
992
+ }
993
+ },
994
+ "source": [
995
+ "### Step 2 Consolidate Model-dataset CSV"
996
+ ]
997
+ },
998
+ {
999
+ "cell_type": "code",
1000
+ "execution_count": 8,
1001
+ "id": "464d82f5-c24e-4b53-9b50-682fa4bf3430",
1002
+ "metadata": {
1003
+ "execution": {
1004
+ "iopub.execute_input": "2025-02-08T19:37:59.052245Z",
1005
+ "iopub.status.busy": "2025-02-08T19:37:59.051645Z",
1006
+ "iopub.status.idle": "2025-02-08T19:37:59.056017Z",
1007
+ "shell.execute_reply": "2025-02-08T19:37:59.055598Z",
1008
+ "shell.execute_reply.started": "2025-02-08T19:37:59.052224Z"
1009
+ }
1010
+ },
1011
+ "outputs": [],
1012
+ "source": [
1013
+ "## path thingy\n",
1014
+ "try: #scripts\n",
1015
+ " current_dir = Path(__file__).resolve().parent\n",
1016
+ "except NameError:\n",
1017
+ " # jupyter\n",
1018
+ " current_dir = Path.cwd()"
1019
+ ]
1020
+ },
1021
+ {
1022
+ "cell_type": "code",
1023
+ "execution_count": 9,
1024
+ "id": "14f3db4d-ef93-4629-8a27-971adb49248b",
1025
+ "metadata": {
1026
+ "execution": {
1027
+ "iopub.execute_input": "2025-02-08T19:37:59.443457Z",
1028
+ "iopub.status.busy": "2025-02-08T19:37:59.443236Z",
1029
+ "iopub.status.idle": "2025-02-08T19:37:59.890103Z",
1030
+ "shell.execute_reply": "2025-02-08T19:37:59.889571Z",
1031
+ "shell.execute_reply.started": "2025-02-08T19:37:59.443439Z"
1032
+ }
1033
+ },
1034
+ "outputs": [
1035
+ {
1036
+ "name": "stdout",
1037
+ "output_type": "stream",
1038
+ "text": [
1039
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_2.json\n",
1040
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_1.json\n",
1041
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_4.json\n",
1042
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_6.json\n",
1043
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_7.json\n",
1044
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_5.json\n",
1045
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_3.json\n"
1046
+ ]
1047
+ }
1048
+ ],
1049
+ "source": [
1050
+ "import os\n",
1051
+ "import json\n",
1052
+ "import pandas as pd\n",
1053
+ "import hashlib\n",
1054
+ "\n",
1055
+ "\n",
1056
+ "json_path = current_dir.parent / 'data/raw/model_metadata/'\n",
1057
+ "\n",
1058
+ "# Initialize a list to hold all the processed records\n",
1059
+ "data_records = []\n",
1060
+ "\n",
1061
+ "def hash_username(username):\n",
1062
+ " \"\"\"Convert a username into a unique hash.\"\"\"\n",
1063
+ " return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16] # Use first 16 characters for brevity\n",
1064
+ "\n",
1065
+ "# Loop through each file in the directory\n",
1066
+ "for filename in os.listdir(json_path):\n",
1067
+ " # Construct the full path to the JSON file\n",
1068
+ " file_path = os.path.join(json_path, filename)\n",
1069
+ "\n",
1070
+ " # Only process files with .json extension\n",
1071
+ " if filename.endswith('.json') and os.path.isfile(file_path):\n",
1072
+ " print(f\"Processing file: {file_path}\")\n",
1073
+ "\n",
1074
+ " # Open and load the JSON file\n",
1075
+ " with open(file_path, 'r') as file:\n",
1076
+ " try:\n",
1077
+ " data = json.load(file)\n",
1078
+ " except json.JSONDecodeError as e:\n",
1079
+ " print(f\"Error decoding JSON in file {filename}: {e}\")\n",
1080
+ " continue\n",
1081
+ "\n",
1082
+ " # Determine how the items are structured within the dictionary\n",
1083
+ " if 'items' in data:\n",
1084
+ " items = data['items']\n",
1085
+ " elif 'data' in data:\n",
1086
+ " items = data['data']\n",
1087
+ " else:\n",
1088
+ " print(f\"No known item list key found in {filename}\")\n",
1089
+ " continue\n",
1090
+ "\n",
1091
+ " for item in items:\n",
1092
+ " if not isinstance(item, dict):\n",
1093
+ " print(f\"Expected a dictionary for each item, but found: {type(item)} in {filename}\")\n",
1094
+ " continue\n",
1095
+ "\n",
1096
+ " # Extract required information\n",
1097
+ " model_versions = item.get('modelVersions', [])\n",
1098
+ " download_url = model_versions[0]['files'][0]['downloadUrl'] if model_versions and 'files' in model_versions[0] and model_versions[0]['files'] else ''\n",
1099
+ " auto_hashes = model_versions[0]['files'][0]['hashes'] if model_versions and 'files' in model_versions[0] and model_versions[0]['files'] else {}\n",
1100
+ "\n",
1101
+ " # Collect up to 20 modelVersion IDs\n",
1102
+ " model_version_ids = [mv.get('id', '') for mv in model_versions[:20]]\n",
1103
+ "\n",
1104
+ " # Get preview images: first and latest\n",
1105
+ " first_image_url = model_versions[0]['images'][0]['url'] if model_versions and 'images' in model_versions[0] and model_versions[0]['images'] else ''\n",
1106
+ " latest_index = len(model_versions) - 1\n",
1107
+ " latest_image_url = model_versions[latest_index]['images'][0]['url'] if model_versions and 'images' in model_versions[latest_index] and model_versions[latest_index]['images'] else ''\n",
1108
+ "\n",
1109
+ " # Hash the username\n",
1110
+ " username = item.get('creator', {}).get('username', '')\n",
1111
+ " username_hash = hash_username(username) if username else ''\n",
1112
+ "\n",
1113
+ " # Extract additional fields\n",
1114
+ " record = {\n",
1115
+ " 'id': item.get('id', ''),\n",
1116
+ " 'name': item.get('name', ''),\n",
1117
+ " 'type': item.get('type', ''),\n",
1118
+ " 'baseModel': model_versions[0].get('baseModel', '') if model_versions else '',\n",
1119
+ " 'downloadCount': item.get('stats', {}).get('downloadCount', 0),\n",
1120
+ " 'nsfwLevel': item.get('nsfwLevel', 0),\n",
1121
+ " 'modelVersions': len(model_versions),\n",
1122
+ " 'publishedAt': model_versions[0].get('publishedAt', '') if model_versions else '',\n",
1123
+ " 'usernameHash': username_hash,\n",
1124
+ " 'downloadUrl': download_url,\n",
1125
+ " 'firstImageUrl': first_image_url,\n",
1126
+ " 'latestImageUrl': latest_image_url,\n",
1127
+ " 'poi': item.get('poi', False),\n",
1128
+ " 'AutoV1': auto_hashes.get('AutoV1', ''),\n",
1129
+ " 'AutoV2': auto_hashes.get('AutoV2', ''),\n",
1130
+ " 'AutoV3': auto_hashes.get('AutoV3', ''),\n",
1131
+ " 'SHA256': auto_hashes.get('SHA256', ''),\n",
1132
+ " 'CRC32': auto_hashes.get('CRC32', ''),\n",
1133
+ " 'BLAKE3': auto_hashes.get('BLAKE3', ''),\n",
1134
+ " 'previewImage': latest_image_url\n",
1135
+ " }\n",
1136
+ "\n",
1137
+ " # Add version IDs to the record\n",
1138
+ " for i in range(20): # Collect up to 20 version IDs\n",
1139
+ " version_key = f'version_id_{i+1}'\n",
1140
+ " record[version_key] = model_version_ids[i] if len(model_version_ids) > i else ''\n",
1141
+ "\n",
1142
+ " # Add tags\n",
1143
+ " tags = item.get('tags', [])\n",
1144
+ " for i in range(7): # To retrieve up to tag 6\n",
1145
+ " tag_key = f'tag_{i+1}'\n",
1146
+ " record[tag_key] = tags[i] if len(tags) > i else ''\n",
1147
+ "\n",
1148
+ " data_records.append(record)\n",
1149
+ "\n",
1150
+ "# Create a DataFrame and sort by download count\n",
1151
+ "df = pd.DataFrame(data_records)\n",
1152
+ "df_sorted = df.sort_values(by='downloadCount', ascending=False)"
1153
+ ]
1154
+ },
1155
+ {
1156
+ "cell_type": "markdown",
1157
+ "id": "2d6f822f-754f-4feb-9ed6-6f868421c454",
1158
+ "metadata": {},
1159
+ "source": [
1160
+ "### Save model-data to CSV"
1161
+ ]
1162
+ },
1163
+ {
1164
+ "cell_type": "code",
1165
+ "execution_count": 10,
1166
+ "id": "f861cf7b-5eb3-46ad-ab2d-9e2fdf2169b4",
1167
+ "metadata": {
1168
+ "execution": {
1169
+ "iopub.execute_input": "2025-02-08T19:38:02.047421Z",
1170
+ "iopub.status.busy": "2025-02-08T19:38:02.046761Z",
1171
+ "iopub.status.idle": "2025-02-08T19:38:02.193381Z",
1172
+ "shell.execute_reply": "2025-02-08T19:38:02.192886Z",
1173
+ "shell.execute_reply.started": "2025-02-08T19:38:02.047397Z"
1174
+ }
1175
+ },
1176
+ "outputs": [
1177
+ {
1178
+ "name": "stdout",
1179
+ "output_type": "stream",
1180
+ "text": [
1181
+ "Data has been saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/Civiverse-Models.csv\n"
1182
+ ]
1183
+ }
1184
+ ],
1185
+ "source": [
1186
+ "output_csv = current_dir.parent / 'data/CSV/Civiverse-Models.csv'\n",
1187
+ "output_csv.parent.mkdir(parents=True, exist_ok=True)\n",
1188
+ "df_sorted.to_csv(output_csv, index=False)\n",
1189
+ "print(f\"Data has been saved to {output_csv}\")"
1190
+ ]
1191
+ },
1192
+ {
1193
+ "cell_type": "markdown",
1194
+ "id": "9b948933-2a2f-42b0-9083-2d81586ae3f2",
1195
+ "metadata": {
1196
+ "execution": {
1197
+ "iopub.execute_input": "2025-02-06T13:42:32.447860Z",
1198
+ "iopub.status.busy": "2025-02-06T13:42:32.446832Z",
1199
+ "iopub.status.idle": "2025-02-06T13:42:32.453827Z",
1200
+ "shell.execute_reply": "2025-02-06T13:42:32.453271Z",
1201
+ "shell.execute_reply.started": "2025-02-06T13:42:32.447821Z"
1202
+ }
1203
+ },
1204
+ "source": [
1205
+ "### Step 3 Create Subsets: Checkpoint only, POI True, POI False"
1206
+ ]
1207
+ },
1208
+ {
1209
+ "cell_type": "code",
1210
+ "execution_count": 11,
1211
+ "id": "ae894e26-4984-40e6-80a6-54fb6b61c873",
1212
+ "metadata": {
1213
+ "execution": {
1214
+ "iopub.execute_input": "2025-02-08T19:38:03.089160Z",
1215
+ "iopub.status.busy": "2025-02-08T19:38:03.088490Z",
1216
+ "iopub.status.idle": "2025-02-08T19:38:03.093067Z",
1217
+ "shell.execute_reply": "2025-02-08T19:38:03.092634Z",
1218
+ "shell.execute_reply.started": "2025-02-08T19:38:03.089138Z"
1219
+ }
1220
+ },
1221
+ "outputs": [],
1222
+ "source": [
1223
+ "file_path = current_dir.parent / 'data/CSV/Civiverse-Models.csv' # Update this with your actual file path\n",
1224
+ "(current_dir.parent / 'data/CSV/model_subsets').mkdir(parents=True, exist_ok=True)\n"
1225
+ ]
1226
+ },
1227
+ {
1228
+ "cell_type": "code",
1229
+ "execution_count": 13,
1230
+ "id": "88438f88-c723-423e-a4e5-e07f59096b72",
1231
+ "metadata": {
1232
+ "execution": {
1233
+ "iopub.execute_input": "2025-02-08T19:41:39.279495Z",
1234
+ "iopub.status.busy": "2025-02-08T19:41:39.279039Z",
1235
+ "iopub.status.idle": "2025-02-08T19:41:39.674445Z",
1236
+ "shell.execute_reply": "2025-02-08T19:41:39.673981Z",
1237
+ "shell.execute_reply.started": "2025-02-08T19:41:39.279476Z"
1238
+ }
1239
+ },
1240
+ "outputs": [
1241
+ {
1242
+ "name": "stdout",
1243
+ "output_type": "stream",
1244
+ "text": [
1245
+ "Files saved successfully!\n"
1246
+ ]
1247
+ }
1248
+ ],
1249
+ "source": [
1250
+ "import pandas as pd\n",
1251
+ "\n",
1252
+ "# Load the dataset\n",
1253
+ "\n",
1254
+ "data = pd.read_csv(file_path)\n",
1255
+ "\n",
1256
+ "# Version 1: Only 'poi' true models\n",
1257
+ "poi_true_models = data[data['poi'] == True]\n",
1258
+ "\n",
1259
+ "# Version 2: Only types lora, dora, locon, textual inversion\n",
1260
+ "specific_types = ['LORA', 'DORA', 'LOCON', 'textualInversion']\n",
1261
+ "adapters = data[data['type'].isin(specific_types)]\n",
1262
+ "\n",
1263
+ "# Version 3: Only type checkpoint\n",
1264
+ "checkpoint_models = data[data['type'] == 'Checkpoint']\n",
1265
+ "\n",
1266
+ "# Version 4: All models apart from 'poi' true\n",
1267
+ "non_poi_models = data[data['poi'] != True]\n",
1268
+ "\n",
1269
+ "# Version 5: All models apart from 'poi' true and with nsfwLevel below 13\n",
1270
+ "non_poi_low_nsfw_models = data[(data['poi'] != True) & (data['nsfwLevel'] < 13)]\n",
1271
+ "\n",
1272
+ "# Save the versions as separate CSV files\n",
1273
+ "poi_true_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_true.csv', index=False)\n",
1274
+ "adapters.to_csv(current_dir.parent / 'data/CSV/adapters.csv', index=False)\n",
1275
+ "checkpoint_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv', index=False)\n",
1276
+ "non_poi_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_false.csv', index=False)\n",
1277
+ "\n",
1278
+ "print(\"Files saved successfully!\")\n"
1279
+ ]
1280
+ },
1281
+ {
1282
+ "cell_type": "code",
1283
+ "execution_count": null,
1284
+ "id": "49e35088-8b2e-4189-83e1-3098d55dcad2",
1285
+ "metadata": {},
1286
+ "outputs": [],
1287
+ "source": [
1288
+ "import pandas as pd\n",
1289
+ "import os\n",
1290
+ "\n",
1291
+ "# Load the dataset\n",
1292
+ "df = pd.read_csv('data/all_models_with_tags.csv')\n",
1293
+ "\n",
1294
+ "# Filter for rows where poi is True\n",
1295
+ "filtered_df = df[df['poi'] == True]\n",
1296
+ "os.makedirs('data/model_subsets', exist_ok=True)\n",
1297
+ "\n",
1298
+ "# Save the filtered DataFrame to a new CSV file\n",
1299
+ "filtered_df.to_csv('data/model_subsets/all_models_poi.csv', index=False)\n"
1300
+ ]
1301
+ },
1302
+ {
1303
+ "cell_type": "code",
1304
+ "execution_count": null,
1305
+ "id": "06c15f2c",
1306
+ "metadata": {},
1307
+ "outputs": [],
1308
+ "source": [
1309
+ "import pandas as pd\n",
1310
+ "import os\n",
1311
+ "\n",
1312
+ "# Load the dataset\n",
1313
+ "df = pd.read_csv('data/all_models_with_tags.csv')\n",
1314
+ "\n",
1315
+ "# Filter for rows where poi is True\n",
1316
+ "filtered_df = df[df['poi'] == False]\n",
1317
+ "os.makedirs('data/model_subsets', exist_ok=True)\n",
1318
+ "\n",
1319
+ "# Save the filtered DataFrame to a new CSV file\n",
1320
+ "filtered_df.to_csv('data/model_subsets/all_models_poi_false.csv', index=False)\n"
1321
+ ]
1322
+ }
1323
+ ],
1324
+ "metadata": {
1325
+ "kernelspec": {
1326
+ "display_name": "latm",
1327
+ "language": "python",
1328
+ "name": "python3"
1329
+ },
1330
+ "language_info": {
1331
+ "codemirror_mode": {
1332
+ "name": "ipython",
1333
+ "version": 3
1334
+ },
1335
+ "file_extension": ".py",
1336
+ "mimetype": "text/x-python",
1337
+ "name": "python",
1338
+ "nbconvert_exporter": "python",
1339
+ "pygments_lexer": "ipython3",
1340
+ "version": "3.10.15"
1341
+ }
1342
+ },
1343
+ "nbformat": 4,
1344
+ "nbformat_minor": 5
1345
+ }
jupyter_notebooks/0_Scraping_model_metadata.ipynb ADDED
@@ -0,0 +1,643 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "1111ea95-d385-49b9-a4d9-ef886ace5c7a",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-06T11:24:25.566747Z",
9
+ "iopub.status.busy": "2025-02-06T11:24:25.566066Z",
10
+ "iopub.status.idle": "2025-02-06T11:24:25.571748Z",
11
+ "shell.execute_reply": "2025-02-06T11:24:25.571305Z",
12
+ "shell.execute_reply.started": "2025-02-06T11:24:25.566705Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# 0 Scraping Metadata and Dataset consolidation\n"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "6632505a-e7ca-4463-9ffc-e36fad42235f",
22
+ "metadata": {},
23
+ "source": [
24
+ "## IMAGES\n",
25
+ "---"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "id": "e3388bac-bb71-40bc-a693-9ac7a2d5f32c",
31
+ "metadata": {
32
+ "execution": {
33
+ "iopub.execute_input": "2025-02-06T10:08:22.229784Z",
34
+ "iopub.status.busy": "2025-02-06T10:08:22.229287Z",
35
+ "iopub.status.idle": "2025-02-06T10:08:22.232210Z",
36
+ "shell.execute_reply": "2025-02-06T10:08:22.231793Z",
37
+ "shell.execute_reply.started": "2025-02-06T10:08:22.229766Z"
38
+ }
39
+ },
40
+ "source": [
41
+ "### Step 1: Image metadata scraping, sorting and CSV consolidation"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "code",
46
+ "execution_count": 1,
47
+ "id": "f8decb63-43f5-4731-823d-94632eee7618",
48
+ "metadata": {
49
+ "execution": {
50
+ "iopub.execute_input": "2025-02-08T19:37:51.115763Z",
51
+ "iopub.status.busy": "2025-02-08T19:37:51.114573Z",
52
+ "iopub.status.idle": "2025-02-08T19:37:51.170027Z",
53
+ "shell.execute_reply": "2025-02-08T19:37:51.169401Z",
54
+ "shell.execute_reply.started": "2025-02-08T19:37:51.115738Z"
55
+ }
56
+ },
57
+ "outputs": [],
58
+ "source": [
59
+ "import os\n",
60
+ "import json\n",
61
+ "import csv\n",
62
+ "import requests\n",
63
+ "from datetime import datetime\n",
64
+ "import time\n",
65
+ "from pathlib import Path\n",
66
+ "import hashlib\n",
67
+ "import pandas as pd\n",
68
+ "import sys\n",
69
+ "from datetime import datetime, timedelta\n",
70
+ "import shutil"
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "code",
75
+ "execution_count": 2,
76
+ "id": "4b2426c3-96a0-468e-b6dc-78dea9c3e92b",
77
+ "metadata": {
78
+ "execution": {
79
+ "iopub.execute_input": "2025-02-08T19:37:51.809027Z",
80
+ "iopub.status.busy": "2025-02-08T19:37:51.808835Z",
81
+ "iopub.status.idle": "2025-02-08T19:37:51.812922Z",
82
+ "shell.execute_reply": "2025-02-08T19:37:51.812429Z",
83
+ "shell.execute_reply.started": "2025-02-08T19:37:51.809009Z"
84
+ }
85
+ },
86
+ "outputs": [],
87
+ "source": [
88
+ "current_dir = Path.cwd()"
89
+ ]
90
+ },
91
+ {
92
+ "cell_type": "markdown",
93
+ "id": "11647bb7-5ce9-414a-8486-5bdce8d9cfea",
94
+ "metadata": {
95
+ "execution": {
96
+ "iopub.execute_input": "2025-02-06T12:49:43.126762Z",
97
+ "iopub.status.busy": "2025-02-06T12:49:43.125797Z",
98
+ "iopub.status.idle": "2025-02-06T12:49:43.129759Z",
99
+ "shell.execute_reply": "2025-02-06T12:49:43.129176Z",
100
+ "shell.execute_reply.started": "2025-02-06T12:49:43.126736Z"
101
+ }
102
+ },
103
+ "source": [
104
+ "## MODELS"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "markdown",
109
+ "id": "5c9cb5e3-7cea-4574-9319-f3cd89354b1f",
110
+ "metadata": {
111
+ "execution": {
112
+ "iopub.execute_input": "2025-02-06T13:23:38.017078Z",
113
+ "iopub.status.busy": "2025-02-06T13:23:38.016639Z",
114
+ "iopub.status.idle": "2025-02-06T13:23:38.019993Z",
115
+ "shell.execute_reply": "2025-02-06T13:23:38.019549Z",
116
+ "shell.execute_reply.started": "2025-02-06T13:23:38.017053Z"
117
+ }
118
+ },
119
+ "source": [
120
+ "### Step 1: Scrape model metadata"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "markdown",
125
+ "id": "83e12b9f-5ae1-407d-9754-5979d837f787",
126
+ "metadata": {
127
+ "execution": {
128
+ "iopub.execute_input": "2025-02-06T14:06:04.857874Z",
129
+ "iopub.status.busy": "2025-02-06T14:06:04.857500Z",
130
+ "iopub.status.idle": "2025-02-06T14:06:04.860438Z",
131
+ "shell.execute_reply": "2025-02-06T14:06:04.860030Z",
132
+ "shell.execute_reply.started": "2025-02-06T14:06:04.857856Z"
133
+ }
134
+ },
135
+ "source": [
136
+ "#### the resulting files will appear in data/raw/model_metadata as *.json"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": 12,
142
+ "id": "5db9c00e",
143
+ "metadata": {},
144
+ "outputs": [],
145
+ "source": [
146
+ "key_karussell = current_dir.parent / 'misc/credentials/civitai_api_keys.txt'\n",
147
+ "directory_path = current_dir.parent / 'data/raw/model_metadata/'"
148
+ ]
149
+ },
150
+ {
151
+ "cell_type": "code",
152
+ "execution_count": 13,
153
+ "id": "41ee14e0-fb78-4f91-aba1-13faf05af7d8",
154
+ "metadata": {
155
+ "execution": {
156
+ "iopub.execute_input": "2025-02-08T19:37:56.687832Z",
157
+ "iopub.status.busy": "2025-02-08T19:37:56.687251Z",
158
+ "iopub.status.idle": "2025-02-08T19:37:56.696572Z",
159
+ "shell.execute_reply": "2025-02-08T19:37:56.696059Z",
160
+ "shell.execute_reply.started": "2025-02-08T19:37:56.687809Z"
161
+ }
162
+ },
163
+ "outputs": [],
164
+ "source": [
165
+ "import datetime\n",
166
+ "\n",
167
+ "def load_api_keys():\n",
168
+ " \"\"\"Load API keys from a text file, one per line.\"\"\"\n",
169
+ " if not os.path.exists(key_karussell):\n",
170
+ " raise FileNotFoundError(f\"API key file '{API_KEYS_FILE}' not found!\")\n",
171
+ " \n",
172
+ " with open(key_karussell, 'r') as file:\n",
173
+ " keys = [line.strip() for line in file if line.strip()]\n",
174
+ " \n",
175
+ " if not keys:\n",
176
+ " raise ValueError(\"No API keys found in the file!\")\n",
177
+ " \n",
178
+ " return keys\n",
179
+ "\n",
180
+ "def get_model_metadata():\n",
181
+ " base_url = \"https://civitai.com/api/v1/models\"\n",
182
+ " params = {\"sort\": \"Newest\", \"nsfw\": True}\n",
183
+ "\n",
184
+ " # Load API keys\n",
185
+ " api_keys = load_api_keys()\n",
186
+ " key_index = 0 # Start with the first key\n",
187
+ "\n",
188
+ " page_counter = 0\n",
189
+ " max_pages = 300000000 # Adjust as needed\n",
190
+ " os.makedirs(directory_path, exist_ok=True)\n",
191
+ "\n",
192
+ " while True:\n",
193
+ " if page_counter >= max_pages:\n",
194
+ " print(f\"Reached the limit of {max_pages} pages.\")\n",
195
+ " break\n",
196
+ "\n",
197
+ " headers = {\n",
198
+ " \"Accept\": \"application/json\",\n",
199
+ " \"Authorization\": f\"Bearer {api_keys[key_index]}\"\n",
200
+ " }\n",
201
+ "\n",
202
+ " response = requests.get(base_url, headers=headers, params=params)\n",
203
+ "\n",
204
+ " if response.status_code == 200:\n",
205
+ " data = response.json()\n",
206
+ " page_counter += 1\n",
207
+ "\n",
208
+ " # Add timestamp\n",
209
+ " formatted_timestamp = datetime.datetime.now().strftime(\"data obtained on the %d.%m.%Y at %H:%M CEST\")\n",
210
+ "\n",
211
+ " data['timestamp'] = formatted_timestamp\n",
212
+ "\n",
213
+ " # Save data to file\n",
214
+ " file_path = os.path.join(directory_path, f'newest_models_{page_counter}.json')\n",
215
+ " with open(file_path, 'w', encoding='utf-8') as file:\n",
216
+ " json.dump(data, file, indent=4)\n",
217
+ "\n",
218
+ " # Check for nextCursor\n",
219
+ " next_cursor = data.get('metadata', {}).get('nextCursor')\n",
220
+ " if not next_cursor:\n",
221
+ " print(\"No more data available.\")\n",
222
+ " break\n",
223
+ " else:\n",
224
+ " params['cursor'] = next_cursor\n",
225
+ " \n",
226
+ " elif response.status_code in (401, 403): # Unauthorized or Forbidden\n",
227
+ " print(f\"API Key {key_index + 1} failed with status {response.status_code}. Trying next key...\")\n",
228
+ " key_index += 1\n",
229
+ "\n",
230
+ " if key_index >= len(api_keys):\n",
231
+ " print(\"All API keys failed. Exiting.\")\n",
232
+ " break # Stop if all keys fail\n",
233
+ " \n",
234
+ " else:\n",
235
+ " print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
236
+ " break # Stop on other errors\n"
237
+ ]
238
+ },
239
+ {
240
+ "cell_type": "markdown",
241
+ "id": "8f43ce23-5986-4b67-be2d-453831af9a6e",
242
+ "metadata": {},
243
+ "source": [
244
+ "uncomment this to get model metadata"
245
+ ]
246
+ },
247
+ {
248
+ "cell_type": "code",
249
+ "execution_count": 14,
250
+ "id": "3f95b4ba-5742-4268-b2e8-de9145faf495",
251
+ "metadata": {
252
+ "execution": {
253
+ "iopub.execute_input": "2025-02-08T19:37:58.117386Z",
254
+ "iopub.status.busy": "2025-02-08T19:37:58.117158Z",
255
+ "iopub.status.idle": "2025-02-08T19:37:58.121162Z",
256
+ "shell.execute_reply": "2025-02-08T19:37:58.120580Z",
257
+ "shell.execute_reply.started": "2025-02-08T19:37:58.117369Z"
258
+ }
259
+ },
260
+ "outputs": [
261
+ {
262
+ "name": "stdout",
263
+ "output_type": "stream",
264
+ "text": [
265
+ "Failed to fetch data: HTTP 500\n"
266
+ ]
267
+ }
268
+ ],
269
+ "source": [
270
+ "get_model_metadata()"
271
+ ]
272
+ },
273
+ {
274
+ "cell_type": "markdown",
275
+ "id": "1ed5f44e-612e-4b6d-8974-904fb3e058d6",
276
+ "metadata": {
277
+ "execution": {
278
+ "iopub.execute_input": "2025-02-06T12:52:40.162173Z",
279
+ "iopub.status.busy": "2025-02-06T12:52:40.159989Z",
280
+ "iopub.status.idle": "2025-02-06T12:52:40.170634Z",
281
+ "shell.execute_reply": "2025-02-06T12:52:40.169945Z",
282
+ "shell.execute_reply.started": "2025-02-06T12:52:40.162124Z"
283
+ }
284
+ },
285
+ "source": [
286
+ "### Step 2 Consolidate Model-dataset CSV"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": 8,
292
+ "id": "464d82f5-c24e-4b53-9b50-682fa4bf3430",
293
+ "metadata": {
294
+ "execution": {
295
+ "iopub.execute_input": "2025-02-08T19:37:59.052245Z",
296
+ "iopub.status.busy": "2025-02-08T19:37:59.051645Z",
297
+ "iopub.status.idle": "2025-02-08T19:37:59.056017Z",
298
+ "shell.execute_reply": "2025-02-08T19:37:59.055598Z",
299
+ "shell.execute_reply.started": "2025-02-08T19:37:59.052224Z"
300
+ }
301
+ },
302
+ "outputs": [],
303
+ "source": [
304
+ "## path thingy\n",
305
+ "try: #scripts\n",
306
+ " current_dir = Path(__file__).resolve().parent\n",
307
+ "except NameError:\n",
308
+ " # jupyter\n",
309
+ " current_dir = Path.cwd()"
310
+ ]
311
+ },
312
+ {
313
+ "cell_type": "code",
314
+ "execution_count": null,
315
+ "id": "14f3db4d-ef93-4629-8a27-971adb49248b",
316
+ "metadata": {
317
+ "execution": {
318
+ "iopub.execute_input": "2025-02-08T19:37:59.443457Z",
319
+ "iopub.status.busy": "2025-02-08T19:37:59.443236Z",
320
+ "iopub.status.idle": "2025-02-08T19:37:59.890103Z",
321
+ "shell.execute_reply": "2025-02-08T19:37:59.889571Z",
322
+ "shell.execute_reply.started": "2025-02-08T19:37:59.443439Z"
323
+ }
324
+ },
325
+ "outputs": [
326
+ {
327
+ "name": "stdout",
328
+ "output_type": "stream",
329
+ "text": [
330
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_2.json\n",
331
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_1.json\n",
332
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_4.json\n",
333
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_6.json\n",
334
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_7.json\n",
335
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_5.json\n",
336
+ "Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_3.json\n"
337
+ ]
338
+ }
339
+ ],
340
+ "source": [
341
+ "import os\n",
342
+ "import json\n",
343
+ "import pandas as pd\n",
344
+ "import hashlib\n",
345
+ "from pathlib import Path\n",
346
+ "from datetime import datetime, timezone\n",
347
+ "\n",
348
+ "\n",
349
+ "\n",
350
+ "\n",
351
+ "def hash_username(username):\n",
352
+ " return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16]\n",
353
+ "\n",
354
+ "def parse_date(date_str):\n",
355
+ " try:\n",
356
+ " return datetime.fromisoformat(date_str.replace('Z', '+00:00'))\n",
357
+ " except Exception:\n",
358
+ " return datetime.min.replace(tzinfo=timezone.utc) # Make it timezone-aware\n",
359
+ "\n",
360
+ "def get_latest_model_version(model_versions):\n",
361
+ " return max(model_versions, key=lambda mv: parse_date(mv.get('publishedAt', '')))\n",
362
+ "\n",
363
+ "def process_directory_recursively(root_dir):\n",
364
+ " root_path = Path(root_dir)\n",
365
+ " seen = {} # id -> (publishedAt, record)\n",
366
+ " data_records = []\n",
367
+ "\n",
368
+ " for json_file in root_path.rglob('*.json'):\n",
369
+ " if not json_file.is_file():\n",
370
+ " continue\n",
371
+ "\n",
372
+ " #print(f\"Processing file: {json_file}\")\n",
373
+ " try:\n",
374
+ " with open(json_file, 'r', encoding='utf-8') as f:\n",
375
+ " data = json.load(f)\n",
376
+ " except Exception as e:\n",
377
+ " print(f\"Failed to load {json_file}: {e}\")\n",
378
+ " continue\n",
379
+ "\n",
380
+ " items = data.get('items') or data.get('data') or []\n",
381
+ " for item in items:\n",
382
+ " if not isinstance(item, dict):\n",
383
+ " continue\n",
384
+ "\n",
385
+ " model_id = item.get('id')\n",
386
+ " model_versions = item.get('modelVersions', [])\n",
387
+ " if not model_versions:\n",
388
+ " continue\n",
389
+ "\n",
390
+ " latest_version = get_latest_model_version(model_versions)\n",
391
+ " published_at = latest_version.get('publishedAt', '')\n",
392
+ " current_dt = parse_date(published_at)\n",
393
+ "\n",
394
+ " if model_id in seen and current_dt <= seen[model_id][0]:\n",
395
+ " continue\n",
396
+ " seen[model_id] = (current_dt, item)\n",
397
+ "\n",
398
+ " for model_id, (_, item) in seen.items():\n",
399
+ " model_versions = item.get('modelVersions', [])\n",
400
+ " latest_version = get_latest_model_version(model_versions)\n",
401
+ " version_ids = [mv.get('id', '') for mv in model_versions[:20]]\n",
402
+ "\n",
403
+ " files = latest_version.get('files', [])\n",
404
+ " auto_hashes = files[0].get('hashes', {}) if files else {}\n",
405
+ " images = latest_version.get('images', [])\n",
406
+ " first_image_url = images[0]['url'] if images else ''\n",
407
+ " latest_image_url = images[-1]['url'] if images else ''\n",
408
+ "\n",
409
+ " username = item.get('creator', {}).get('username', '')\n",
410
+ " record = {\n",
411
+ " 'id': item.get('id', ''),\n",
412
+ " 'name': item.get('name', ''),\n",
413
+ " 'type': item.get('type', ''),\n",
414
+ " 'baseModel': latest_version.get('baseModel', ''),\n",
415
+ " 'downloadCount': item.get('stats', {}).get('downloadCount', 0),\n",
416
+ " 'nsfwLevel': item.get('nsfwLevel', 0),\n",
417
+ " 'modelVersions': len(model_versions),\n",
418
+ " 'publishedAt': latest_version.get('publishedAt', ''),\n",
419
+ " 'usernameHash': hash_username(username) if username else '',\n",
420
+ " 'downloadUrl': latest_version.get('downloadUrl', ''),\n",
421
+ " 'firstImageUrl': first_image_url,\n",
422
+ " 'latestImageUrl': latest_image_url,\n",
423
+ " 'poi': item.get('poi', False),\n",
424
+ " 'AutoV1': auto_hashes.get('AutoV1', ''),\n",
425
+ " 'AutoV2': auto_hashes.get('AutoV2', ''),\n",
426
+ " 'AutoV3': auto_hashes.get('AutoV3', ''),\n",
427
+ " 'SHA256': auto_hashes.get('SHA256', ''),\n",
428
+ " 'CRC32': auto_hashes.get('CRC32', ''),\n",
429
+ " 'BLAKE3': auto_hashes.get('BLAKE3', ''),\n",
430
+ " 'previewImage': latest_image_url\n",
431
+ " }\n",
432
+ "\n",
433
+ " for i in range(20):\n",
434
+ " record[f'version_id_{i+1}'] = version_ids[i] if i < len(version_ids) else ''\n",
435
+ "\n",
436
+ " tags = item.get('tags', [])\n",
437
+ " for i in range(7):\n",
438
+ " record[f'tag_{i+1}'] = tags[i] if i < len(tags) else ''\n",
439
+ "\n",
440
+ " data_records.append(record)\n",
441
+ "\n",
442
+ " return pd.DataFrame(data_records)\n",
443
+ "\n",
444
+ "# Usage Example\n",
445
+ "root_directory = current_dir.parent / 'data/raw/model_metadata/'\n",
446
+ "df = process_directory_recursively(root_directory)\n",
447
+ "df_sorted = df.sort_values(by='downloadCount', ascending=False)\n",
448
+ "\n",
449
+ "# Optionally save\n",
450
+ "# df_sorted.to_csv('combined_metadata.csv', index=False)\n"
451
+ ]
452
+ },
453
+ {
454
+ "cell_type": "markdown",
455
+ "id": "2d6f822f-754f-4feb-9ed6-6f868421c454",
456
+ "metadata": {},
457
+ "source": [
458
+ "### Save model-data to CSV"
459
+ ]
460
+ },
461
+ {
462
+ "cell_type": "code",
463
+ "execution_count": null,
464
+ "id": "f861cf7b-5eb3-46ad-ab2d-9e2fdf2169b4",
465
+ "metadata": {
466
+ "execution": {
467
+ "iopub.execute_input": "2025-02-08T19:38:02.047421Z",
468
+ "iopub.status.busy": "2025-02-08T19:38:02.046761Z",
469
+ "iopub.status.idle": "2025-02-08T19:38:02.193381Z",
470
+ "shell.execute_reply": "2025-02-08T19:38:02.192886Z",
471
+ "shell.execute_reply.started": "2025-02-08T19:38:02.047397Z"
472
+ }
473
+ },
474
+ "outputs": [
475
+ {
476
+ "name": "stdout",
477
+ "output_type": "stream",
478
+ "text": [
479
+ "Data has been saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/Civiverse-Models.csv\n"
480
+ ]
481
+ }
482
+ ],
483
+ "source": [
484
+ "output_csv = current_dir.parent / 'data/CSV/Civiverse-Models_2025.csv'\n",
485
+ "output_csv.parent.mkdir(parents=True, exist_ok=True)\n",
486
+ "df_sorted.to_csv(output_csv, index=False)\n",
487
+ "print(f\"Data has been saved to {output_csv}\")"
488
+ ]
489
+ },
490
+ {
491
+ "cell_type": "markdown",
492
+ "id": "9b948933-2a2f-42b0-9083-2d81586ae3f2",
493
+ "metadata": {
494
+ "execution": {
495
+ "iopub.execute_input": "2025-02-06T13:42:32.447860Z",
496
+ "iopub.status.busy": "2025-02-06T13:42:32.446832Z",
497
+ "iopub.status.idle": "2025-02-06T13:42:32.453827Z",
498
+ "shell.execute_reply": "2025-02-06T13:42:32.453271Z",
499
+ "shell.execute_reply.started": "2025-02-06T13:42:32.447821Z"
500
+ }
501
+ },
502
+ "source": [
503
+ "### Step 3 Create Subsets: Checkpoint only, POI True, POI False"
504
+ ]
505
+ },
506
+ {
507
+ "cell_type": "code",
508
+ "execution_count": 11,
509
+ "id": "ae894e26-4984-40e6-80a6-54fb6b61c873",
510
+ "metadata": {
511
+ "execution": {
512
+ "iopub.execute_input": "2025-02-08T19:38:03.089160Z",
513
+ "iopub.status.busy": "2025-02-08T19:38:03.088490Z",
514
+ "iopub.status.idle": "2025-02-08T19:38:03.093067Z",
515
+ "shell.execute_reply": "2025-02-08T19:38:03.092634Z",
516
+ "shell.execute_reply.started": "2025-02-08T19:38:03.089138Z"
517
+ }
518
+ },
519
+ "outputs": [],
520
+ "source": [
521
+ "file_path = current_dir.parent / 'data/CSV/Civiverse-Models.csv' # Update this with your actual file path\n",
522
+ "(current_dir.parent / 'data/CSV/model_subsets').mkdir(parents=True, exist_ok=True)\n"
523
+ ]
524
+ },
525
+ {
526
+ "cell_type": "code",
527
+ "execution_count": 13,
528
+ "id": "88438f88-c723-423e-a4e5-e07f59096b72",
529
+ "metadata": {
530
+ "execution": {
531
+ "iopub.execute_input": "2025-02-08T19:41:39.279495Z",
532
+ "iopub.status.busy": "2025-02-08T19:41:39.279039Z",
533
+ "iopub.status.idle": "2025-02-08T19:41:39.674445Z",
534
+ "shell.execute_reply": "2025-02-08T19:41:39.673981Z",
535
+ "shell.execute_reply.started": "2025-02-08T19:41:39.279476Z"
536
+ }
537
+ },
538
+ "outputs": [
539
+ {
540
+ "name": "stdout",
541
+ "output_type": "stream",
542
+ "text": [
543
+ "Files saved successfully!\n"
544
+ ]
545
+ }
546
+ ],
547
+ "source": [
548
+ "import pandas as pd\n",
549
+ "\n",
550
+ "# Load the dataset\n",
551
+ "\n",
552
+ "data = pd.read_csv(file_path)\n",
553
+ "\n",
554
+ "# Version 1: Only 'poi' true models\n",
555
+ "poi_true_models = data[data['poi'] == True]\n",
556
+ "\n",
557
+ "# Version 2: Only types lora, dora, locon, textual inversion\n",
558
+ "specific_types = ['LORA', 'DORA', 'LOCON', 'textualInversion']\n",
559
+ "adapters = data[data['type'].isin(specific_types)]\n",
560
+ "\n",
561
+ "# Version 3: Only type checkpoint\n",
562
+ "checkpoint_models = data[data['type'] == 'Checkpoint']\n",
563
+ "\n",
564
+ "# Version 4: All models apart from 'poi' true\n",
565
+ "non_poi_models = data[data['poi'] != True]\n",
566
+ "\n",
567
+ "# Version 5: All models apart from 'poi' true and with nsfwLevel below 13\n",
568
+ "non_poi_low_nsfw_models = data[(data['poi'] != True) & (data['nsfwLevel'] < 13)]\n",
569
+ "\n",
570
+ "# Save the versions as separate CSV files\n",
571
+ "poi_true_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_true.csv', index=False)\n",
572
+ "adapters.to_csv(current_dir.parent / 'data/CSV/adapters.csv', index=False)\n",
573
+ "checkpoint_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv', index=False)\n",
574
+ "non_poi_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_false.csv', index=False)\n",
575
+ "\n",
576
+ "print(\"Files saved successfully!\")\n"
577
+ ]
578
+ },
579
+ {
580
+ "cell_type": "code",
581
+ "execution_count": null,
582
+ "id": "49e35088-8b2e-4189-83e1-3098d55dcad2",
583
+ "metadata": {},
584
+ "outputs": [],
585
+ "source": [
586
+ "import pandas as pd\n",
587
+ "import os\n",
588
+ "\n",
589
+ "# Load the dataset\n",
590
+ "df = pd.read_csv('data/all_models_with_tags.csv')\n",
591
+ "\n",
592
+ "# Filter for rows where poi is True\n",
593
+ "filtered_df = df[df['poi'] == True]\n",
594
+ "os.makedirs('data/model_subsets', exist_ok=True)\n",
595
+ "\n",
596
+ "# Save the filtered DataFrame to a new CSV file\n",
597
+ "filtered_df.to_csv('data/model_subsets/all_models_poi.csv', index=False)\n"
598
+ ]
599
+ },
600
+ {
601
+ "cell_type": "code",
602
+ "execution_count": null,
603
+ "id": "06c15f2c",
604
+ "metadata": {},
605
+ "outputs": [],
606
+ "source": [
607
+ "import pandas as pd\n",
608
+ "import os\n",
609
+ "\n",
610
+ "# Load the dataset\n",
611
+ "df = pd.read_csv('data/all_models_with_tags.csv')\n",
612
+ "\n",
613
+ "# Filter for rows where poi is True\n",
614
+ "filtered_df = df[df['poi'] == False]\n",
615
+ "os.makedirs('data/model_subsets', exist_ok=True)\n",
616
+ "\n",
617
+ "# Save the filtered DataFrame to a new CSV file\n",
618
+ "filtered_df.to_csv('data/model_subsets/all_models_poi_false.csv', index=False)\n"
619
+ ]
620
+ }
621
+ ],
622
+ "metadata": {
623
+ "kernelspec": {
624
+ "display_name": "latm",
625
+ "language": "python",
626
+ "name": "python3"
627
+ },
628
+ "language_info": {
629
+ "codemirror_mode": {
630
+ "name": "ipython",
631
+ "version": 3
632
+ },
633
+ "file_extension": ".py",
634
+ "mimetype": "text/x-python",
635
+ "name": "python",
636
+ "nbconvert_exporter": "python",
637
+ "pygments_lexer": "ipython3",
638
+ "version": "3.10.15"
639
+ }
640
+ },
641
+ "nbformat": 4,
642
+ "nbformat_minor": 5
643
+ }
jupyter_notebooks/Section_1_Figure_1_image_grid.ipynb ADDED
@@ -0,0 +1,417 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "e8752a29-06b6-49a2-a724-a7b1c8f3d23b",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-06T14:16:50.014780Z",
9
+ "iopub.status.busy": "2025-02-06T14:16:50.013936Z",
10
+ "iopub.status.idle": "2025-02-06T14:16:50.018030Z",
11
+ "shell.execute_reply": "2025-02-06T14:16:50.017506Z",
12
+ "shell.execute_reply.started": "2025-02-06T14:16:50.014757Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# Section 2. Download Images and create image Grid"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "6f81ff2d-7763-4ac5-a9e3-d9a74541e143",
22
+ "metadata": {
23
+ "execution": {
24
+ "iopub.execute_input": "2025-02-06T16:47:42.502304Z",
25
+ "iopub.status.busy": "2025-02-06T16:47:42.501200Z",
26
+ "iopub.status.idle": "2025-02-06T16:47:42.746600Z",
27
+ "shell.execute_reply": "2025-02-06T16:47:42.745736Z",
28
+ "shell.execute_reply.started": "2025-02-06T16:47:42.502269Z"
29
+ }
30
+ },
31
+ "source": [
32
+ "This script downloads images and places them under data/sorted/YYYY/YYYY-MM/YYYY-MM-DD. A grid is created that blurs all NSFW true images and assigns a color overlay based on the color-coding of the BrowsingLevel"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "id": "d6d7d826-353c-4c1b-9d69-fa67f07e5245",
38
+ "metadata": {
39
+ "execution": {
40
+ "iopub.execute_input": "2025-02-06T16:17:00.985310Z",
41
+ "iopub.status.busy": "2025-02-06T16:17:00.984540Z",
42
+ "iopub.status.idle": "2025-02-06T16:17:01.124348Z",
43
+ "shell.execute_reply": "2025-02-06T16:17:01.123740Z",
44
+ "shell.execute_reply.started": "2025-02-06T16:17:00.985282Z"
45
+ }
46
+ },
47
+ "source": [
48
+ "![Alt text](../plots/grid2x24.png)"
49
+ ]
50
+ },
51
+ {
52
+ "cell_type": "code",
53
+ "execution_count": 1,
54
+ "id": "6e5227f5-925c-4155-8015-53045794b986",
55
+ "metadata": {
56
+ "execution": {
57
+ "iopub.execute_input": "2025-02-08T21:58:48.202108Z",
58
+ "iopub.status.busy": "2025-02-08T21:58:48.201629Z",
59
+ "iopub.status.idle": "2025-02-08T21:58:48.485464Z",
60
+ "shell.execute_reply": "2025-02-08T21:58:48.484783Z",
61
+ "shell.execute_reply.started": "2025-02-08T21:58:48.202077Z"
62
+ }
63
+ },
64
+ "outputs": [],
65
+ "source": [
66
+ "from PIL import Image, ImageFilter, ImageDraw\n",
67
+ "import os\n",
68
+ "from pathlib import Path\n",
69
+ "import json\n",
70
+ "import matplotlib.colors as mcolors\n",
71
+ "import os\n",
72
+ "import json\n",
73
+ "import requests\n",
74
+ "from datetime import datetime\n",
75
+ "import argparse"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "code",
80
+ "execution_count": 2,
81
+ "id": "25cb523d-6d22-4d13-84f1-f9e38534d08e",
82
+ "metadata": {
83
+ "execution": {
84
+ "iopub.execute_input": "2025-02-08T21:58:49.153040Z",
85
+ "iopub.status.busy": "2025-02-08T21:58:49.152592Z",
86
+ "iopub.status.idle": "2025-02-08T21:58:49.156982Z",
87
+ "shell.execute_reply": "2025-02-08T21:58:49.156561Z",
88
+ "shell.execute_reply.started": "2025-02-08T21:58:49.153017Z"
89
+ }
90
+ },
91
+ "outputs": [],
92
+ "source": [
93
+ "current_dir = Path.cwd()"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "markdown",
98
+ "id": "037ecae3-a060-4d90-beaf-040dfb018696",
99
+ "metadata": {},
100
+ "source": [
101
+ "## Step 1: Download images "
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "code",
106
+ "execution_count": 3,
107
+ "id": "948809b0-e78a-4821-86d0-fe4eb156265a",
108
+ "metadata": {
109
+ "execution": {
110
+ "iopub.execute_input": "2025-02-08T21:58:50.811934Z",
111
+ "iopub.status.busy": "2025-02-08T21:58:50.811459Z",
112
+ "iopub.status.idle": "2025-02-08T21:58:50.823112Z",
113
+ "shell.execute_reply": "2025-02-08T21:58:50.822600Z",
114
+ "shell.execute_reply.started": "2025-02-08T21:58:50.811911Z"
115
+ }
116
+ },
117
+ "outputs": [],
118
+ "source": [
119
+ "import os\n",
120
+ "import json\n",
121
+ "import requests\n",
122
+ "from datetime import datetime\n",
123
+ "from PIL import Image\n",
124
+ "\n",
125
+ "def download_and_save_data(input_dir, output_dir):\n",
126
+ " print(f\"Scanning directory: {input_dir}\")\n",
127
+ " found_json = False\n",
128
+ " for root, dirs, files in os.walk(input_dir):\n",
129
+ " for file in files:\n",
130
+ " if file.endswith('.json'):\n",
131
+ " found_json = True\n",
132
+ " file_path = os.path.join(root, file)\n",
133
+ " print(f\"Processing JSON file: {file_path}\")\n",
134
+ " try:\n",
135
+ " with open(file_path, 'r', encoding='utf-8') as json_file:\n",
136
+ " items = json.load(json_file)\n",
137
+ " for item in items:\n",
138
+ " if isinstance(item, dict):\n",
139
+ " process_item(item, root, output_dir, input_dir)\n",
140
+ " except json.JSONDecodeError:\n",
141
+ " print(f\"Error decoding JSON from file {file_path}\")\n",
142
+ " except Exception as e:\n",
143
+ " print(f\"An error occurred with file {file_path}: {e}\")\n",
144
+ " if not found_json:\n",
145
+ " print(\"No JSON files found in the directory.\")\n",
146
+ "\n",
147
+ "def process_item(item, root, output_dir, input_dir):\n",
148
+ " if 'createdAt' in item and 'url' in item:\n",
149
+ " created_at = datetime.strptime(item['createdAt'], \"%Y-%m-%dT%H:%M:%S.%fZ\")\n",
150
+ " time_str = created_at.strftime(\"%Y%m%dT%H%M%S%f\") # Include microseconds in the filename\n",
151
+ " relative_path = os.path.relpath(root, input_dir)\n",
152
+ " save_path = os.path.join(output_dir, relative_path)\n",
153
+ " os.makedirs(save_path, exist_ok=True)\n",
154
+ " image_path = os.path.join(save_path, f\"{time_str}.jpeg\")\n",
155
+ " download_image(item['url'], image_path)\n",
156
+ " resize_image(image_path, 512) # Resize if necessary\n",
157
+ " save_prompt(item.get('meta', {}).get('prompt', 'No prompt available'), os.path.join(save_path, f\"{time_str}_positive.txt\"))\n",
158
+ " save_prompt(item.get('meta', {}).get('negativePrompt', 'No negative prompt available'), os.path.join(save_path, f\"{time_str}_negative.txt\"))\n",
159
+ " save_json(item, os.path.join(save_path, f\"{time_str}.json\"))\n",
160
+ " else:\n",
161
+ " print(\"Item missing 'createdAt' or 'url', skipping...\")\n",
162
+ "\n",
163
+ "def download_image(url, path):\n",
164
+ " try:\n",
165
+ " response = requests.get(url)\n",
166
+ " if response.status_code == 200:\n",
167
+ " with open(path, 'wb') as f:\n",
168
+ " f.write(response.content)\n",
169
+ " except requests.RequestException as e:\n",
170
+ " print(f\"Request failed for {url}: {e}\")\n",
171
+ "\n",
172
+ "def resize_image(image_path, max_size):\n",
173
+ " with Image.open(image_path) as img:\n",
174
+ " if img.width > max_size or img.height > max_size:\n",
175
+ " img.thumbnail((max_size, max_size))\n",
176
+ " img.save(image_path)\n",
177
+ "\n",
178
+ "def save_prompt(prompt, path):\n",
179
+ " with open(path, 'w') as f:\n",
180
+ " f.write(prompt)\n",
181
+ "\n",
182
+ "def save_json(data, path):\n",
183
+ " with open(path, 'w', encoding='utf-8') as f:\n",
184
+ " json.dump(data, f, indent=4)\n",
185
+ "\n"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": 4,
191
+ "id": "133e143d-1bb5-4145-8d59-323570bb6e95",
192
+ "metadata": {
193
+ "execution": {
194
+ "iopub.execute_input": "2025-02-08T21:58:51.566234Z",
195
+ "iopub.status.busy": "2025-02-08T21:58:51.565711Z",
196
+ "iopub.status.idle": "2025-02-08T21:58:51.568973Z",
197
+ "shell.execute_reply": "2025-02-08T21:58:51.568474Z",
198
+ "shell.execute_reply.started": "2025-02-08T21:58:51.566214Z"
199
+ }
200
+ },
201
+ "outputs": [],
202
+ "source": [
203
+ "input_dir = current_dir.parent / 'data/sorted/image_metadata/'\n",
204
+ "images = current_dir.parent / 'data/sorted/images'"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "markdown",
209
+ "id": "e11c4ad5-c040-42b7-a1b9-a0e21b33ce20",
210
+ "metadata": {},
211
+ "source": [
212
+ "uncomment this to download the images, otherwise proceed with grid creation"
213
+ ]
214
+ },
215
+ {
216
+ "cell_type": "code",
217
+ "execution_count": 6,
218
+ "id": "88a6ca8e-51ee-4d27-ba0c-5f026d5750df",
219
+ "metadata": {
220
+ "execution": {
221
+ "iopub.execute_input": "2025-02-08T21:58:52.986899Z",
222
+ "iopub.status.busy": "2025-02-08T21:58:52.986456Z",
223
+ "iopub.status.idle": "2025-02-08T21:58:52.989508Z",
224
+ "shell.execute_reply": "2025-02-08T21:58:52.988892Z",
225
+ "shell.execute_reply.started": "2025-02-08T21:58:52.986878Z"
226
+ }
227
+ },
228
+ "outputs": [
229
+ {
230
+ "name": "stdout",
231
+ "output_type": "stream",
232
+ "text": [
233
+ "Scanning directory: /home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/sorted/image_metadata\n",
234
+ "No JSON files found in the directory.\n"
235
+ ]
236
+ }
237
+ ],
238
+ "source": [
239
+ "download_and_save_data(input_dir, images) "
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "code",
244
+ "execution_count": null,
245
+ "id": "f247091b-7b45-4cd4-a6ca-3d4dea414e19",
246
+ "metadata": {
247
+ "execution": {
248
+ "iopub.execute_input": "2025-02-08T21:58:54.127935Z",
249
+ "iopub.status.busy": "2025-02-08T21:58:54.127626Z",
250
+ "iopub.status.idle": "2025-02-08T21:58:54.254290Z",
251
+ "shell.execute_reply": "2025-02-08T21:58:54.253243Z",
252
+ "shell.execute_reply.started": "2025-02-08T21:58:54.127914Z"
253
+ }
254
+ },
255
+ "outputs": [
256
+ {
257
+ "name": "stdout",
258
+ "output_type": "stream",
259
+ "text": [
260
+ "Grid image saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/plots/grid10x120.png\n"
261
+ ]
262
+ }
263
+ ],
264
+ "source": [
265
+ "import random\n",
266
+ "\n",
267
+ "# For creating the figure grid\n",
268
+ "images = []\n",
269
+ "all_valid_images = []\n",
270
+ "\n",
271
+ "\n",
272
+ "directory = current_dir.parent / 'data/sorted/images/'\n",
273
+ "#directory = '/home/lauwag/shares/laura_wagner/Civitai_page_analysis/Civitai_dataset/dataset/chronological/full/prompts-images/2024/2024-05/2024-05-31/'\n",
274
+ "output = current_dir.parent / 'plots/grid10x120.png'\n",
275
+ "\n",
276
+ "cell_size = 64 # Resize to 32x32 pixels\n",
277
+ "grid_size = (10, 120)\n",
278
+ "\n",
279
+ "\n",
280
+ "def apply_pixelation(img, level):\n",
281
+ " \"\"\" Apply pixelation based on the level. \"\"\"\n",
282
+ " if level > 1:\n",
283
+ " pixel_size = 8 # Adjust this value for more or less pixelation\n",
284
+ " img_small = img.resize(\n",
285
+ " (img.width // pixel_size, img.height // pixel_size), Image.NEAREST\n",
286
+ " )\n",
287
+ " img_pixelated = img_small.resize(img.size, Image.NEAREST)\n",
288
+ " return img_pixelated\n",
289
+ " return img\n",
290
+ "\n",
291
+ "def apply_blur(img, level):\n",
292
+ " \"\"\" Apply Gaussian blur based on the level of browsing. \"\"\"\n",
293
+ " if level > 1:\n",
294
+ " return img.filter(ImageFilter.GaussianBlur(radius=1.75)) # Adjust radius as needed\n",
295
+ " return img\n",
296
+ "\n",
297
+ "def apply_border_and_overlay(img, color):\n",
298
+ " \"\"\" Apply a colored border and a slight color overlay. \"\"\"\n",
299
+ " # Convert Matplotlib color name to RGB\n",
300
+ " rgb_color = tuple(int(x * 255) for x in mcolors.to_rgb(color))\n",
301
+ " # Create a border\n",
302
+ " border_size = 0 # Adjust border size as needed\n",
303
+ " border_img = Image.new('RGB', (img.width + 2 * border_size, img.height + 2 * border_size), rgb_color)\n",
304
+ " border_img.paste(img, (border_size, border_size))\n",
305
+ " # Create an overlay\n",
306
+ " overlay = Image.new('RGBA', border_img.size, (*rgb_color, 0)) # Semi-transparent overlay\n",
307
+ " #overlay = Image.new('RGBA', border_img.size, (*rgb_color, 128)) # Semi-transparent overlay\n",
308
+ " final_img = Image.alpha_composite(border_img.convert('RGBA'), overlay)\n",
309
+ " return final_img.convert('RGB')\n",
310
+ "\n",
311
+ "def process_image(image_path, json_path, cell_size):\n",
312
+ " \"\"\" Process each image: resize, crop, blur, and add border and overlay based on JSON metadata. \"\"\"\n",
313
+ " with open(json_path, 'r') as f:\n",
314
+ " metadata = json.load(f)\n",
315
+ " browsing_level = metadata.get('browsingLevel', 1)\n",
316
+ "\n",
317
+ " with Image.open(image_path) as img:\n",
318
+ " img = resize_and_crop_image(img, cell_size)\n",
319
+ " img = apply_pixelation(img, browsing_level)\n",
320
+ "\n",
321
+ " color_map = {\n",
322
+ " 2: 'rosybrown', # Matplotlib color name\n",
323
+ " 4: 'coral',\n",
324
+ " 8: 'red',\n",
325
+ " 16: 'magenta'\n",
326
+ " }\n",
327
+ " if browsing_level in color_map:\n",
328
+ " img = apply_border_and_overlay(img, color_map[browsing_level])\n",
329
+ "\n",
330
+ " return img\n",
331
+ "\n",
332
+ "def resize_and_crop_image(img, output_size):\n",
333
+ " \"\"\" Resize and crop the image to a square of the specified size. \"\"\"\n",
334
+ " ratio = min(img.width / output_size, img.height / output_size)\n",
335
+ " new_size = (int(img.width / ratio), int(img.height / ratio))\n",
336
+ " img = img.resize(new_size, Image.Resampling.LANCZOS)\n",
337
+ " left = (img.width - output_size) // 2\n",
338
+ " top = (img.height - output_size) // 2\n",
339
+ " right = left + output_size\n",
340
+ " bottom = top + output_size\n",
341
+ " return img.crop((left, top, right, bottom))\n",
342
+ "\n",
343
+ "# Example usage and the rest of your script remains unchanged\n",
344
+ "\n",
345
+ "# Example usage\n",
346
+ " # 4x4 grid\n",
347
+ "\n",
348
+ "file_types = ('png', 'jpg', 'jpeg') # Define acceptable image file types\n",
349
+ "\n",
350
+ "\n",
351
+ "\n",
352
+ "\n",
353
+ "\n",
354
+ "# First, collect all valid image paths\n",
355
+ "for root, dirs, files in os.walk(directory):\n",
356
+ " for file in files:\n",
357
+ " if file.lower().endswith(file_types):\n",
358
+ " image_path = os.path.join(root, file)\n",
359
+ " json_path = image_path.rsplit('.', 1)[0] + '.json'\n",
360
+ " if os.path.exists(json_path):\n",
361
+ " all_valid_images.append((image_path, json_path))\n",
362
+ "\n",
363
+ "# Randomly sample from valid images\n",
364
+ "num_needed = grid_size[0] * grid_size[1]\n",
365
+ "if len(all_valid_images) >= num_needed:\n",
366
+ " random.seed(42) # For reproducibility\n",
367
+ " sampled_images = random.sample(all_valid_images, num_needed)\n",
368
+ " \n",
369
+ " for image_path, json_path in sampled_images:\n",
370
+ " img = process_image(image_path, json_path, cell_size)\n",
371
+ " images.append(img)\n",
372
+ "else:\n",
373
+ " print(f\"Warning: Only {len(all_valid_images)} valid images found, need {num_needed}\")\n",
374
+ "\n",
375
+ "\n",
376
+ "# Create the grid image\n",
377
+ "grid_img = Image.new('RGB', (grid_size[1] * cell_size, grid_size[0] * cell_size))\n",
378
+ "for index, img in enumerate(images):\n",
379
+ " x = (index % grid_size[1]) * cell_size\n",
380
+ " y = (index // grid_size[1]) * cell_size\n",
381
+ " grid_img.paste(img, (x, y))\n",
382
+ "\n",
383
+ "grid_img.save(output)\n",
384
+ "print(f\"Grid image saved to {output}\")"
385
+ ]
386
+ },
387
+ {
388
+ "cell_type": "code",
389
+ "execution_count": null,
390
+ "id": "28bba299-c5ef-449b-a7fa-1afdc5e26262",
391
+ "metadata": {},
392
+ "outputs": [],
393
+ "source": []
394
+ }
395
+ ],
396
+ "metadata": {
397
+ "kernelspec": {
398
+ "display_name": "latm",
399
+ "language": "python",
400
+ "name": "python3"
401
+ },
402
+ "language_info": {
403
+ "codemirror_mode": {
404
+ "name": "ipython",
405
+ "version": 3
406
+ },
407
+ "file_extension": ".py",
408
+ "mimetype": "text/x-python",
409
+ "name": "python",
410
+ "nbconvert_exporter": "python",
411
+ "pygments_lexer": "ipython3",
412
+ "version": "3.10.15"
413
+ }
414
+ },
415
+ "nbformat": 4,
416
+ "nbformat_minor": 5
417
+ }
jupyter_notebooks/Section_2-2-2_Figure_3_histogram_monthly_images_nsfw_levels.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images.ipynb ADDED
@@ -0,0 +1,1795 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "d8161f06-4fd2-436d-9e59-3f68b5a67f2c",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-06T18:30:16.974712Z",
9
+ "iopub.status.busy": "2025-02-06T18:30:16.974296Z",
10
+ "iopub.status.idle": "2025-02-06T18:30:16.976909Z",
11
+ "shell.execute_reply": "2025-02-06T18:30:16.976526Z",
12
+ "shell.execute_reply.started": "2025-02-06T18:30:16.974692Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# Section 6.2: Age and Gender Estimation using MiVOLO"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "64aaedec-ef56-4a62-b61e-12de2675a1ae",
22
+ "metadata": {
23
+ "execution": {
24
+ "iopub.execute_input": "2025-02-06T19:52:51.171282Z",
25
+ "iopub.status.busy": "2025-02-06T19:52:51.170711Z",
26
+ "iopub.status.idle": "2025-02-06T19:52:55.405039Z",
27
+ "shell.execute_reply": "2025-02-06T19:52:55.404308Z",
28
+ "shell.execute_reply.started": "2025-02-06T19:52:51.171245Z"
29
+ }
30
+ },
31
+ "source": [
32
+ "![Alt text](../plots/mivolo.svg)"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "code",
37
+ "execution_count": 1,
38
+ "id": "4293f307-44fd-455e-90fe-6e6928be9af5",
39
+ "metadata": {
40
+ "execution": {
41
+ "iopub.execute_input": "2025-02-08T21:59:21.970807Z",
42
+ "iopub.status.busy": "2025-02-08T21:59:21.969931Z",
43
+ "iopub.status.idle": "2025-02-08T22:00:09.724295Z",
44
+ "shell.execute_reply": "2025-02-08T22:00:09.723583Z",
45
+ "shell.execute_reply.started": "2025-02-08T21:59:21.970784Z"
46
+ }
47
+ },
48
+ "outputs": [
49
+ {
50
+ "name": "stderr",
51
+ "output_type": "stream",
52
+ "text": [
53
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
54
+ " from .autonotebook import tqdm as notebook_tqdm\n"
55
+ ]
56
+ }
57
+ ],
58
+ "source": [
59
+ "import csv\n",
60
+ "from pathlib import Path\n",
61
+ "import logging\n",
62
+ "import os\n",
63
+ "import pandas as pd\n",
64
+ "import requests\n",
65
+ "import numpy as np\n",
66
+ "import torch\n",
67
+ "import cv2\n",
68
+ "from io import BytesIO\n",
69
+ "from PIL import Image, UnidentifiedImageError\n",
70
+ "from datetime import datetime, timedelta\n",
71
+ "from dateutil.relativedelta import relativedelta\n",
72
+ "from mivolo.predictor import Predictor\n",
73
+ "import matplotlib.pyplot as plt\n",
74
+ "import matplotlib.patches as mpatches"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": 2,
80
+ "id": "63a54f5f-900c-48dd-8932-2632e56c5670",
81
+ "metadata": {
82
+ "execution": {
83
+ "iopub.execute_input": "2025-02-08T22:00:09.726069Z",
84
+ "iopub.status.busy": "2025-02-08T22:00:09.725699Z",
85
+ "iopub.status.idle": "2025-02-08T22:00:09.730626Z",
86
+ "shell.execute_reply": "2025-02-08T22:00:09.730099Z",
87
+ "shell.execute_reply.started": "2025-02-08T22:00:09.726050Z"
88
+ }
89
+ },
90
+ "outputs": [],
91
+ "source": [
92
+ "current_dir = Path.cwd()\n",
93
+ "mini = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
94
+ "mivolo_in = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/'\n",
95
+ "(current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/').mkdir(parents=True, exist_ok=True)"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "code",
100
+ "execution_count": 3,
101
+ "id": "1fdfb89a-6094-4382-8755-fae213221ea5",
102
+ "metadata": {
103
+ "execution": {
104
+ "iopub.execute_input": "2025-02-08T22:00:09.731540Z",
105
+ "iopub.status.busy": "2025-02-08T22:00:09.731359Z",
106
+ "iopub.status.idle": "2025-02-08T22:00:09.825738Z",
107
+ "shell.execute_reply": "2025-02-08T22:00:09.825258Z",
108
+ "shell.execute_reply.started": "2025-02-08T22:00:09.731524Z"
109
+ }
110
+ },
111
+ "outputs": [],
112
+ "source": [
113
+ "def split_by_month(input_path, output_dir):\n",
114
+ " # Load the dataset\n",
115
+ " df = pd.read_csv(input_path)\n",
116
+ " \n",
117
+ " # Convert the 'createdAt' column to datetime\n",
118
+ " df['createdAt'] = pd.to_datetime(df['createdAt'], errors='coerce')\n",
119
+ " \n",
120
+ " # Extract year and month\n",
121
+ " df['year_month'] = df['createdAt'].dt.to_period('M')\n",
122
+ " \n",
123
+ " # Group the data by year and month and save each group as a CSV file\n",
124
+ " unique_months = df['year_month'].unique()\n",
125
+ "\n",
126
+ " for month in unique_months:\n",
127
+ " # Filter data for the specific month\n",
128
+ " df_month = df[df['year_month'] == month]\n",
129
+ " \n",
130
+ " # Define the file name based on the year and month\n",
131
+ " file_name = f'{output_dir}/Civiverse-{month}.csv'\n",
132
+ " \n",
133
+ " # Save the file\n",
134
+ " df_month.to_csv(file_name, index=False)\n",
135
+ "\n",
136
+ " print(f\"Data has been split and saved to {output_dir}\")"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": 4,
142
+ "id": "2c909d7c-7d16-4dc7-8364-7f1c0784414c",
143
+ "metadata": {
144
+ "execution": {
145
+ "iopub.execute_input": "2025-02-08T22:00:09.827095Z",
146
+ "iopub.status.busy": "2025-02-08T22:00:09.826919Z",
147
+ "iopub.status.idle": "2025-02-08T22:00:10.479484Z",
148
+ "shell.execute_reply": "2025-02-08T22:00:10.478777Z",
149
+ "shell.execute_reply.started": "2025-02-08T22:00:09.827079Z"
150
+ }
151
+ },
152
+ "outputs": [
153
+ {
154
+ "name": "stderr",
155
+ "output_type": "stream",
156
+ "text": [
157
+ "/sctmp/lauwag/ipykernel_1497673/1825509207.py:9: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.\n",
158
+ " df['year_month'] = df['createdAt'].dt.to_period('M')\n"
159
+ ]
160
+ },
161
+ {
162
+ "name": "stdout",
163
+ "output_type": "stream",
164
+ "text": [
165
+ "Data has been split and saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month\n"
166
+ ]
167
+ }
168
+ ],
169
+ "source": [
170
+ "split_by_month(mini, mivolo_in)"
171
+ ]
172
+ },
173
+ {
174
+ "cell_type": "code",
175
+ "execution_count": 5,
176
+ "id": "4c543306-ffc8-4b9c-a3df-b49b2271caa9",
177
+ "metadata": {
178
+ "execution": {
179
+ "iopub.execute_input": "2025-02-08T22:00:10.480505Z",
180
+ "iopub.status.busy": "2025-02-08T22:00:10.480310Z",
181
+ "iopub.status.idle": "2025-02-08T22:00:10.483961Z",
182
+ "shell.execute_reply": "2025-02-08T22:00:10.483400Z",
183
+ "shell.execute_reply.started": "2025-02-08T22:00:10.480486Z"
184
+ }
185
+ },
186
+ "outputs": [],
187
+ "source": [
188
+ "mivolo_out = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
189
+ "mivolo_out.mkdir(parents=True, exist_ok=True) # Create the output directory if it doesn't exist"
190
+ ]
191
+ },
192
+ {
193
+ "cell_type": "markdown",
194
+ "id": "ffb7dd23",
195
+ "metadata": {},
196
+ "source": [
197
+ "## MiVOLO gender and age inference"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": null,
203
+ "id": "304ed12f-c7b6-4129-b24d-7ccc793a62c7",
204
+ "metadata": {
205
+ "execution": {
206
+ "iopub.execute_input": "2025-02-08T22:00:10.484802Z",
207
+ "iopub.status.busy": "2025-02-08T22:00:10.484639Z"
208
+ }
209
+ },
210
+ "outputs": [
211
+ {
212
+ "name": "stderr",
213
+ "output_type": "stream",
214
+ "text": [
215
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/ultralytics/nn/tasks.py:634: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
216
+ " return torch.load(file, map_location=\"cpu\"), file # load\n"
217
+ ]
218
+ },
219
+ {
220
+ "name": "stdout",
221
+ "output_type": "stream",
222
+ "text": [
223
+ "Model summary (fused): 268 layers, 68125494 parameters, 0 gradients, 257.4 GFLOPs\n"
224
+ ]
225
+ },
226
+ {
227
+ "name": "stderr",
228
+ "output_type": "stream",
229
+ "text": [
230
+ "[W208 23:00:15.738708520 NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.\n",
231
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/mivolo/model/mi_volo.py:33: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
232
+ " state = torch.load(ckpt_path, map_location=\"cpu\")\n",
233
+ "INFO:MiVOLO:Model meta:\n",
234
+ "min_age: 1, max_age: 95, avg_age: 48.0, num_classes: 3, in_chans: 6, with_persons_model: True, disable_faces: False, use_persons: True, only_age: False, num_classes_gender: 2, input_size: 224, use_person_crops: True, use_face_crops: True\n",
235
+ "/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/timm/models/_helpers.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
236
+ " checkpoint = torch.load(checkpoint_path, map_location='cpu')\n",
237
+ "INFO:timm.models._helpers:Loaded state_dict from checkpoint '/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
238
+ "INFO:MiVOLO:Model mivolo_d1_224 created, param count: 27432414\n",
239
+ "INFO:timm.data.config:Data processing configuration for current model + dataset:\n",
240
+ "INFO:timm.data.config:\tinput_size: (3, 224, 224)\n",
241
+ "INFO:timm.data.config:\tinterpolation: bicubic\n",
242
+ "INFO:timm.data.config:\tmean: (0.485, 0.456, 0.406)\n",
243
+ "INFO:timm.data.config:\tstd: (0.229, 0.224, 0.225)\n",
244
+ "INFO:timm.data.config:\tcrop_pct: 0.96\n",
245
+ "INFO:timm.data.config:\tcrop_mode: center\n"
246
+ ]
247
+ },
248
+ {
249
+ "name": "stdout",
250
+ "output_type": "stream",
251
+ "text": [
252
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-11.csv\n",
253
+ "\n",
254
+ "0: 640x640 (no detections), 723.9ms\n",
255
+ "Speed: 12.1ms preprocess, 723.9ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n",
256
+ "Processed and saved 1 images so far.\n",
257
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2022-11.csv\n",
258
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-12.csv\n",
259
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-01.csv\n",
260
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-02.csv\n",
261
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-03.csv\n",
262
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-04.csv\n",
263
+ "File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-05.csv\n",
264
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-06.csv\n",
265
+ "\n",
266
+ "0: 416x640 1 person, 455.1ms\n",
267
+ "Speed: 3.5ms preprocess, 455.1ms inference, 33.5ms postprocess per image at shape (1, 3, 416, 640)\n"
268
+ ]
269
+ },
270
+ {
271
+ "name": "stderr",
272
+ "output_type": "stream",
273
+ "text": [
274
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
275
+ "INFO:MiVOLO:\tage: 32.89\n",
276
+ "INFO:MiVOLO:\tgender: male [99%]\n"
277
+ ]
278
+ },
279
+ {
280
+ "name": "stdout",
281
+ "output_type": "stream",
282
+ "text": [
283
+ "Processed and saved 1 images so far.\n",
284
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-06.csv\n",
285
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-07.csv\n",
286
+ "\n",
287
+ "0: 640x320 1 person, 395.7ms\n",
288
+ "Speed: 2.9ms preprocess, 395.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
289
+ ]
290
+ },
291
+ {
292
+ "name": "stderr",
293
+ "output_type": "stream",
294
+ "text": [
295
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
296
+ "INFO:MiVOLO:\tage: 33.49\n",
297
+ "INFO:MiVOLO:\tgender: female [99%]\n"
298
+ ]
299
+ },
300
+ {
301
+ "name": "stdout",
302
+ "output_type": "stream",
303
+ "text": [
304
+ "Processed and saved 1 images so far.\n",
305
+ "\n",
306
+ "0: 640x448 1 person, 1 face, 478.5ms\n",
307
+ "Speed: 1.9ms preprocess, 478.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
308
+ ]
309
+ },
310
+ {
311
+ "name": "stderr",
312
+ "output_type": "stream",
313
+ "text": [
314
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
315
+ "INFO:MiVOLO:\tage: 17.81\n",
316
+ "INFO:MiVOLO:\tgender: female [99%]\n"
317
+ ]
318
+ },
319
+ {
320
+ "name": "stdout",
321
+ "output_type": "stream",
322
+ "text": [
323
+ "Processed and saved 2 images so far.\n",
324
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-07.csv\n",
325
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-08.csv\n",
326
+ "\n",
327
+ "0: 640x448 1 person, 478.0ms\n",
328
+ "Speed: 2.9ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
329
+ ]
330
+ },
331
+ {
332
+ "name": "stderr",
333
+ "output_type": "stream",
334
+ "text": [
335
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
336
+ "INFO:MiVOLO:\tage: 40.62\n",
337
+ "INFO:MiVOLO:\tgender: male [99%]\n"
338
+ ]
339
+ },
340
+ {
341
+ "name": "stdout",
342
+ "output_type": "stream",
343
+ "text": [
344
+ "Processed and saved 1 images so far.\n",
345
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-08.csv\n",
346
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-09.csv\n",
347
+ "\n",
348
+ "0: 416x640 (no detections), 567.1ms\n",
349
+ "Speed: 2.4ms preprocess, 567.1ms inference, 0.4ms postprocess per image at shape (1, 3, 416, 640)\n",
350
+ "Processed and saved 1 images so far.\n",
351
+ "\n",
352
+ "0: 320x640 (no detections), 393.6ms\n",
353
+ "Speed: 1.7ms preprocess, 393.6ms inference, 0.4ms postprocess per image at shape (1, 3, 320, 640)\n",
354
+ "\n",
355
+ "0: 640x640 (no detections), 711.9ms\n",
356
+ "Speed: 3.4ms preprocess, 711.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
357
+ "\n",
358
+ "0: 640x640 (no detections), 699.8ms\n",
359
+ "Speed: 2.3ms preprocess, 699.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
360
+ "\n",
361
+ "0: 640x576 1 person, 629.6ms\n",
362
+ "Speed: 2.4ms preprocess, 629.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 576)\n"
363
+ ]
364
+ },
365
+ {
366
+ "name": "stderr",
367
+ "output_type": "stream",
368
+ "text": [
369
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
370
+ "INFO:MiVOLO:\tage: 28.65\n",
371
+ "INFO:MiVOLO:\tgender: female [99%]\n"
372
+ ]
373
+ },
374
+ {
375
+ "name": "stdout",
376
+ "output_type": "stream",
377
+ "text": [
378
+ "\n",
379
+ "0: 640x448 1 person, 1 face, 598.3ms\n",
380
+ "Speed: 2.1ms preprocess, 598.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
381
+ ]
382
+ },
383
+ {
384
+ "name": "stderr",
385
+ "output_type": "stream",
386
+ "text": [
387
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
388
+ "INFO:MiVOLO:\tage: 25.85\n",
389
+ "INFO:MiVOLO:\tgender: female [99%]\n",
390
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/98848c97-3d1e-4b52-9967-aeeca354a30e/width=656/98848c97-3d1e-4b52-9967-aeeca354a30e.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00133740>\n"
391
+ ]
392
+ },
393
+ {
394
+ "name": "stdout",
395
+ "output_type": "stream",
396
+ "text": [
397
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-09.csv\n",
398
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-10.csv\n"
399
+ ]
400
+ },
401
+ {
402
+ "name": "stderr",
403
+ "output_type": "stream",
404
+ "text": [
405
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/e6469288-b487-4a06-99c1-59e7ac22fa77/width=1024/e6469288-b487-4a06-99c1-59e7ac22fa77.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ecbdd0>\n"
406
+ ]
407
+ },
408
+ {
409
+ "name": "stdout",
410
+ "output_type": "stream",
411
+ "text": [
412
+ "\n",
413
+ "0: 448x640 (no detections), 536.6ms\n",
414
+ "Speed: 10.1ms preprocess, 536.6ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
415
+ "Processed and saved 2 images so far.\n",
416
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-10.csv\n",
417
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-11.csv\n",
418
+ "\n",
419
+ "0: 640x448 1 person, 1 face, 662.9ms\n",
420
+ "Speed: 2.6ms preprocess, 662.9ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
421
+ ]
422
+ },
423
+ {
424
+ "name": "stderr",
425
+ "output_type": "stream",
426
+ "text": [
427
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
428
+ "INFO:MiVOLO:\tage: 17.0\n",
429
+ "INFO:MiVOLO:\tgender: female [99%]\n"
430
+ ]
431
+ },
432
+ {
433
+ "name": "stdout",
434
+ "output_type": "stream",
435
+ "text": [
436
+ "Processed and saved 1 images so far.\n",
437
+ "\n",
438
+ "0: 640x384 1 person, 895.9ms\n",
439
+ "Speed: 2.0ms preprocess, 895.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
440
+ ]
441
+ },
442
+ {
443
+ "name": "stderr",
444
+ "output_type": "stream",
445
+ "text": [
446
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
447
+ "INFO:MiVOLO:\tage: 43.33\n",
448
+ "INFO:MiVOLO:\tgender: male [99%]\n"
449
+ ]
450
+ },
451
+ {
452
+ "name": "stdout",
453
+ "output_type": "stream",
454
+ "text": [
455
+ "\n",
456
+ "0: 640x448 (no detections), 529.4ms\n",
457
+ "Speed: 2.6ms preprocess, 529.4ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
458
+ "\n",
459
+ "0: 640x448 1 person, 539.3ms\n",
460
+ "Speed: 2.8ms preprocess, 539.3ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
461
+ ]
462
+ },
463
+ {
464
+ "name": "stderr",
465
+ "output_type": "stream",
466
+ "text": [
467
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
468
+ "INFO:MiVOLO:\tage: 39.15\n",
469
+ "INFO:MiVOLO:\tgender: male [99%]\n"
470
+ ]
471
+ },
472
+ {
473
+ "name": "stdout",
474
+ "output_type": "stream",
475
+ "text": [
476
+ "\n",
477
+ "0: 640x448 1 person, 1 face, 708.6ms\n",
478
+ "Speed: 2.5ms preprocess, 708.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
479
+ ]
480
+ },
481
+ {
482
+ "name": "stderr",
483
+ "output_type": "stream",
484
+ "text": [
485
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
486
+ "INFO:MiVOLO:\tage: 29.64\n",
487
+ "INFO:MiVOLO:\tgender: female [99%]\n",
488
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc/width=1080/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc.mp4: cannot identify image file <_io.BytesIO object at 0x14cb010c24d0>\n"
489
+ ]
490
+ },
491
+ {
492
+ "name": "stdout",
493
+ "output_type": "stream",
494
+ "text": [
495
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-11.csv\n",
496
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-12.csv\n",
497
+ "\n",
498
+ "0: 640x384 1 person, 1 face, 461.0ms\n",
499
+ "Speed: 2.4ms preprocess, 461.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
500
+ ]
501
+ },
502
+ {
503
+ "name": "stderr",
504
+ "output_type": "stream",
505
+ "text": [
506
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
507
+ "INFO:MiVOLO:\tage: 19.61\n",
508
+ "INFO:MiVOLO:\tgender: female [99%]\n"
509
+ ]
510
+ },
511
+ {
512
+ "name": "stdout",
513
+ "output_type": "stream",
514
+ "text": [
515
+ "Processed and saved 1 images so far.\n",
516
+ "\n",
517
+ "0: 640x448 1 person, 1 face, 501.3ms\n",
518
+ "Speed: 3.1ms preprocess, 501.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
519
+ ]
520
+ },
521
+ {
522
+ "name": "stderr",
523
+ "output_type": "stream",
524
+ "text": [
525
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
526
+ "INFO:MiVOLO:\tage: 22.58\n",
527
+ "INFO:MiVOLO:\tgender: female [99%]\n",
528
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3004b5fa-af81-4de7-829d-1d809d70b878/width=512/3004b5fa-af81-4de7-829d-1d809d70b878.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
529
+ ]
530
+ },
531
+ {
532
+ "name": "stdout",
533
+ "output_type": "stream",
534
+ "text": [
535
+ "\n",
536
+ "0: 640x640 (no detections), 842.5ms\n",
537
+ "Speed: 4.5ms preprocess, 842.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n",
538
+ "\n",
539
+ "0: 640x416 (no detections), 446.8ms\n",
540
+ "Speed: 2.5ms preprocess, 446.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
541
+ "Processed and saved 5 images so far.\n",
542
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-12.csv\n",
543
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-01.csv\n",
544
+ "\n",
545
+ "0: 640x448 (no detections), 638.5ms\n",
546
+ "Speed: 2.3ms preprocess, 638.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
547
+ "Processed and saved 1 images so far.\n",
548
+ "\n",
549
+ "0: 640x416 (no detections), 441.7ms\n",
550
+ "Speed: 2.5ms preprocess, 441.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
551
+ "\n",
552
+ "0: 640x448 (no detections), 470.3ms\n",
553
+ "Speed: 2.3ms preprocess, 470.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
554
+ "\n",
555
+ "0: 640x448 (no detections), 693.9ms\n",
556
+ "Speed: 2.5ms preprocess, 693.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
557
+ "\n",
558
+ "0: 640x512 1 person, 1 face, 808.6ms\n",
559
+ "Speed: 3.2ms preprocess, 808.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
560
+ ]
561
+ },
562
+ {
563
+ "name": "stderr",
564
+ "output_type": "stream",
565
+ "text": [
566
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
567
+ "INFO:MiVOLO:\tage: 15.01\n",
568
+ "INFO:MiVOLO:\tgender: female [99%]\n"
569
+ ]
570
+ },
571
+ {
572
+ "name": "stdout",
573
+ "output_type": "stream",
574
+ "text": [
575
+ "\n",
576
+ "0: 640x320 1 person, 1 face, 345.6ms\n",
577
+ "Speed: 2.0ms preprocess, 345.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
578
+ ]
579
+ },
580
+ {
581
+ "name": "stderr",
582
+ "output_type": "stream",
583
+ "text": [
584
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
585
+ "INFO:MiVOLO:\tage: 20.86\n",
586
+ "INFO:MiVOLO:\tgender: female [99%]\n"
587
+ ]
588
+ },
589
+ {
590
+ "name": "stdout",
591
+ "output_type": "stream",
592
+ "text": [
593
+ "Processed and saved 6 images so far.\n",
594
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-01.csv\n",
595
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-02.csv\n",
596
+ "\n",
597
+ "0: 640x384 1 person, 1 face, 387.8ms\n",
598
+ "Speed: 1.9ms preprocess, 387.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
599
+ ]
600
+ },
601
+ {
602
+ "name": "stderr",
603
+ "output_type": "stream",
604
+ "text": [
605
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
606
+ "INFO:MiVOLO:\tage: 17.31\n",
607
+ "INFO:MiVOLO:\tgender: female [99%]\n"
608
+ ]
609
+ },
610
+ {
611
+ "name": "stdout",
612
+ "output_type": "stream",
613
+ "text": [
614
+ "Processed and saved 1 images so far.\n",
615
+ "\n",
616
+ "0: 640x480 1 person, 1 face, 540.4ms\n",
617
+ "Speed: 2.5ms preprocess, 540.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
618
+ ]
619
+ },
620
+ {
621
+ "name": "stderr",
622
+ "output_type": "stream",
623
+ "text": [
624
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
625
+ "INFO:MiVOLO:\tage: 17.47\n",
626
+ "INFO:MiVOLO:\tgender: female [99%]\n"
627
+ ]
628
+ },
629
+ {
630
+ "name": "stdout",
631
+ "output_type": "stream",
632
+ "text": [
633
+ "\n",
634
+ "0: 640x640 1 person, 1 face, 713.1ms\n",
635
+ "Speed: 3.8ms preprocess, 713.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n"
636
+ ]
637
+ },
638
+ {
639
+ "name": "stderr",
640
+ "output_type": "stream",
641
+ "text": [
642
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
643
+ "INFO:MiVOLO:\tage: 17.85\n",
644
+ "INFO:MiVOLO:\tgender: female [99%]\n"
645
+ ]
646
+ },
647
+ {
648
+ "name": "stdout",
649
+ "output_type": "stream",
650
+ "text": [
651
+ "\n",
652
+ "0: 640x640 (no detections), 778.8ms\n",
653
+ "Speed: 28.7ms preprocess, 778.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
654
+ "\n",
655
+ "0: 640x448 1 person, 1 face, 528.2ms\n",
656
+ "Speed: 2.3ms preprocess, 528.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
657
+ ]
658
+ },
659
+ {
660
+ "name": "stderr",
661
+ "output_type": "stream",
662
+ "text": [
663
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
664
+ "INFO:MiVOLO:\tage: 21.63\n",
665
+ "INFO:MiVOLO:\tgender: female [99%]\n"
666
+ ]
667
+ },
668
+ {
669
+ "name": "stdout",
670
+ "output_type": "stream",
671
+ "text": [
672
+ "\n",
673
+ "0: 640x448 1 person, 1 face, 518.4ms\n",
674
+ "Speed: 3.9ms preprocess, 518.4ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
675
+ ]
676
+ },
677
+ {
678
+ "name": "stderr",
679
+ "output_type": "stream",
680
+ "text": [
681
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
682
+ "INFO:MiVOLO:\tage: 18.25\n",
683
+ "INFO:MiVOLO:\tgender: female [99%]\n"
684
+ ]
685
+ },
686
+ {
687
+ "name": "stdout",
688
+ "output_type": "stream",
689
+ "text": [
690
+ "\n",
691
+ "0: 640x448 1 person, 1 face, 470.7ms\n",
692
+ "Speed: 2.5ms preprocess, 470.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
693
+ ]
694
+ },
695
+ {
696
+ "name": "stderr",
697
+ "output_type": "stream",
698
+ "text": [
699
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
700
+ "INFO:MiVOLO:\tage: 20.51\n",
701
+ "INFO:MiVOLO:\tgender: female [99%]\n"
702
+ ]
703
+ },
704
+ {
705
+ "name": "stdout",
706
+ "output_type": "stream",
707
+ "text": [
708
+ "\n",
709
+ "0: 640x480 1 person, 1 face, 647.1ms\n",
710
+ "Speed: 2.4ms preprocess, 647.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
711
+ ]
712
+ },
713
+ {
714
+ "name": "stderr",
715
+ "output_type": "stream",
716
+ "text": [
717
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
718
+ "INFO:MiVOLO:\tage: 58.87\n",
719
+ "INFO:MiVOLO:\tgender: male [99%]\n"
720
+ ]
721
+ },
722
+ {
723
+ "name": "stdout",
724
+ "output_type": "stream",
725
+ "text": [
726
+ "\n",
727
+ "0: 640x448 (no detections), 469.8ms\n",
728
+ "Speed: 2.6ms preprocess, 469.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
729
+ "\n",
730
+ "0: 640x448 1 person, 1 face, 477.5ms\n",
731
+ "Speed: 2.3ms preprocess, 477.5ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
732
+ ]
733
+ },
734
+ {
735
+ "name": "stderr",
736
+ "output_type": "stream",
737
+ "text": [
738
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
739
+ "INFO:MiVOLO:\tage: 23.79\n",
740
+ "INFO:MiVOLO:\tgender: female [99%]\n"
741
+ ]
742
+ },
743
+ {
744
+ "name": "stdout",
745
+ "output_type": "stream",
746
+ "text": [
747
+ "Processed and saved 10 images so far.\n",
748
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-02.csv\n",
749
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-03.csv\n",
750
+ "\n",
751
+ "0: 640x448 1 face, 511.4ms\n",
752
+ "Speed: 2.5ms preprocess, 511.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
753
+ ]
754
+ },
755
+ {
756
+ "name": "stderr",
757
+ "output_type": "stream",
758
+ "text": [
759
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
760
+ "INFO:MiVOLO:\tage: 24.87\n",
761
+ "INFO:MiVOLO:\tgender: female [99%]\n"
762
+ ]
763
+ },
764
+ {
765
+ "name": "stdout",
766
+ "output_type": "stream",
767
+ "text": [
768
+ "Processed and saved 1 images so far.\n",
769
+ "\n",
770
+ "0: 640x544 (no detections), 576.5ms\n",
771
+ "Speed: 2.9ms preprocess, 576.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 544)\n",
772
+ "\n",
773
+ "0: 640x448 1 person, 1 face, 687.1ms\n",
774
+ "Speed: 9.9ms preprocess, 687.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
775
+ ]
776
+ },
777
+ {
778
+ "name": "stderr",
779
+ "output_type": "stream",
780
+ "text": [
781
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
782
+ "INFO:MiVOLO:\tage: 25.76\n",
783
+ "INFO:MiVOLO:\tgender: female [99%]\n"
784
+ ]
785
+ },
786
+ {
787
+ "name": "stdout",
788
+ "output_type": "stream",
789
+ "text": [
790
+ "\n",
791
+ "0: 640x448 (no detections), 498.3ms\n",
792
+ "Speed: 2.3ms preprocess, 498.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
793
+ "\n",
794
+ "0: 640x512 (no detections), 573.2ms\n",
795
+ "Speed: 3.0ms preprocess, 573.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
796
+ "Processed and saved 5 images so far.\n",
797
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-03.csv\n",
798
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-04.csv\n",
799
+ "\n",
800
+ "0: 640x384 (no detections), 518.2ms\n",
801
+ "Speed: 2.7ms preprocess, 518.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
802
+ "Processed and saved 1 images so far.\n",
803
+ "\n",
804
+ "0: 640x512 (no detections), 707.7ms\n",
805
+ "Speed: 3.6ms preprocess, 707.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
806
+ "\n",
807
+ "0: 640x416 (no detections), 453.7ms\n",
808
+ "Speed: 2.4ms preprocess, 453.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
809
+ "\n",
810
+ "0: 640x384 (no detections), 391.0ms\n",
811
+ "Speed: 2.0ms preprocess, 391.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
812
+ "\n",
813
+ "0: 640x448 1 person, 1 face, 449.8ms\n",
814
+ "Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
815
+ ]
816
+ },
817
+ {
818
+ "name": "stderr",
819
+ "output_type": "stream",
820
+ "text": [
821
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
822
+ "INFO:MiVOLO:\tage: 22.39\n",
823
+ "INFO:MiVOLO:\tgender: female [99%]\n"
824
+ ]
825
+ },
826
+ {
827
+ "name": "stdout",
828
+ "output_type": "stream",
829
+ "text": [
830
+ "\n",
831
+ "0: 640x448 (no detections), 618.4ms\n",
832
+ "Speed: 2.3ms preprocess, 618.4ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
833
+ "\n",
834
+ "0: 640x448 1 person, 1 face, 631.0ms\n",
835
+ "Speed: 2.2ms preprocess, 631.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
836
+ ]
837
+ },
838
+ {
839
+ "name": "stderr",
840
+ "output_type": "stream",
841
+ "text": [
842
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
843
+ "INFO:MiVOLO:\tage: 24.05\n",
844
+ "INFO:MiVOLO:\tgender: female [99%]\n"
845
+ ]
846
+ },
847
+ {
848
+ "name": "stdout",
849
+ "output_type": "stream",
850
+ "text": [
851
+ "\n",
852
+ "0: 640x512 1 person, 496.4ms\n",
853
+ "Speed: 2.6ms preprocess, 496.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
854
+ ]
855
+ },
856
+ {
857
+ "name": "stderr",
858
+ "output_type": "stream",
859
+ "text": [
860
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
861
+ "INFO:MiVOLO:\tage: 22.81\n",
862
+ "INFO:MiVOLO:\tgender: male [99%]\n"
863
+ ]
864
+ },
865
+ {
866
+ "name": "stdout",
867
+ "output_type": "stream",
868
+ "text": [
869
+ "\n",
870
+ "0: 640x448 (no detections), 442.8ms\n",
871
+ "Speed: 2.3ms preprocess, 442.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
872
+ "\n",
873
+ "0: 640x448 1 person, 1 face, 477.7ms\n",
874
+ "Speed: 2.4ms preprocess, 477.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
875
+ ]
876
+ },
877
+ {
878
+ "name": "stderr",
879
+ "output_type": "stream",
880
+ "text": [
881
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
882
+ "INFO:MiVOLO:\tage: 21.62\n",
883
+ "INFO:MiVOLO:\tgender: female [99%]\n"
884
+ ]
885
+ },
886
+ {
887
+ "name": "stdout",
888
+ "output_type": "stream",
889
+ "text": [
890
+ "\n",
891
+ "0: 640x448 1 person, 1 face, 447.0ms\n",
892
+ "Speed: 2.2ms preprocess, 447.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
893
+ ]
894
+ },
895
+ {
896
+ "name": "stderr",
897
+ "output_type": "stream",
898
+ "text": [
899
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
900
+ "INFO:MiVOLO:\tage: 54.31\n",
901
+ "INFO:MiVOLO:\tgender: male [99%]\n"
902
+ ]
903
+ },
904
+ {
905
+ "name": "stdout",
906
+ "output_type": "stream",
907
+ "text": [
908
+ "\n",
909
+ "0: 640x640 (no detections), 819.0ms\n",
910
+ "Speed: 3.6ms preprocess, 819.0ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
911
+ "\n",
912
+ "0: 640x448 1 person, 1 face, 478.2ms\n",
913
+ "Speed: 1.8ms preprocess, 478.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
914
+ ]
915
+ },
916
+ {
917
+ "name": "stderr",
918
+ "output_type": "stream",
919
+ "text": [
920
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
921
+ "INFO:MiVOLO:\tage: 20.56\n",
922
+ "INFO:MiVOLO:\tgender: female [99%]\n"
923
+ ]
924
+ },
925
+ {
926
+ "name": "stdout",
927
+ "output_type": "stream",
928
+ "text": [
929
+ "\n",
930
+ "0: 640x448 1 person, 1 face, 471.2ms\n",
931
+ "Speed: 2.7ms preprocess, 471.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
932
+ ]
933
+ },
934
+ {
935
+ "name": "stderr",
936
+ "output_type": "stream",
937
+ "text": [
938
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
939
+ "INFO:MiVOLO:\tage: 21.31\n",
940
+ "INFO:MiVOLO:\tgender: female [99%]\n"
941
+ ]
942
+ },
943
+ {
944
+ "name": "stdout",
945
+ "output_type": "stream",
946
+ "text": [
947
+ "\n",
948
+ "0: 640x448 (no detections), 484.0ms\n",
949
+ "Speed: 2.2ms preprocess, 484.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
950
+ "\n",
951
+ "0: 640x640 (no detections), 832.6ms\n",
952
+ "Speed: 3.0ms preprocess, 832.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 640)\n",
953
+ "\n",
954
+ "0: 640x448 1 person, 1 face, 508.9ms\n",
955
+ "Speed: 2.5ms preprocess, 508.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
956
+ ]
957
+ },
958
+ {
959
+ "name": "stderr",
960
+ "output_type": "stream",
961
+ "text": [
962
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
963
+ "INFO:MiVOLO:\tage: 27.19\n",
964
+ "INFO:MiVOLO:\tgender: female [99%]\n",
965
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6879c7b9-5cb3-42db-b409-30b4e2f71945/width=1080/6879c7b9-5cb3-42db-b409-30b4e2f71945.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
966
+ ]
967
+ },
968
+ {
969
+ "name": "stdout",
970
+ "output_type": "stream",
971
+ "text": [
972
+ "\n",
973
+ "0: 640x448 9 persons, 461.8ms\n",
974
+ "Speed: 2.4ms preprocess, 461.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
975
+ ]
976
+ },
977
+ {
978
+ "name": "stderr",
979
+ "output_type": "stream",
980
+ "text": [
981
+ "INFO:MiVOLO:faces_input: torch.Size([9, 3, 224, 224]), person_input: torch.Size([9, 3, 224, 224])\n",
982
+ "INFO:MiVOLO:\tage: 30.4\n",
983
+ "INFO:MiVOLO:\tgender: male [55%]\n",
984
+ "INFO:MiVOLO:\tage: 28.89\n",
985
+ "INFO:MiVOLO:\tgender: female [63%]\n",
986
+ "INFO:MiVOLO:\tage: 30.31\n",
987
+ "INFO:MiVOLO:\tgender: female [68%]\n",
988
+ "INFO:MiVOLO:\tage: 31.62\n",
989
+ "INFO:MiVOLO:\tgender: female [53%]\n",
990
+ "INFO:MiVOLO:\tage: 35.17\n",
991
+ "INFO:MiVOLO:\tgender: male [53%]\n",
992
+ "INFO:MiVOLO:\tage: 33.02\n",
993
+ "INFO:MiVOLO:\tgender: male [95%]\n",
994
+ "INFO:MiVOLO:\tage: 35.17\n",
995
+ "INFO:MiVOLO:\tgender: male [53%]\n",
996
+ "INFO:MiVOLO:\tage: 35.17\n",
997
+ "INFO:MiVOLO:\tgender: male [53%]\n",
998
+ "INFO:MiVOLO:\tage: 35.17\n",
999
+ "INFO:MiVOLO:\tgender: male [53%]\n"
1000
+ ]
1001
+ },
1002
+ {
1003
+ "name": "stdout",
1004
+ "output_type": "stream",
1005
+ "text": [
1006
+ "Processed and saved 19 images so far.\n",
1007
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-04.csv\n",
1008
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-05.csv\n",
1009
+ "\n",
1010
+ "0: 640x448 1 person, 455.5ms\n",
1011
+ "Speed: 2.2ms preprocess, 455.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1012
+ ]
1013
+ },
1014
+ {
1015
+ "name": "stderr",
1016
+ "output_type": "stream",
1017
+ "text": [
1018
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1019
+ "INFO:MiVOLO:\tage: 37.57\n",
1020
+ "INFO:MiVOLO:\tgender: male [95%]\n"
1021
+ ]
1022
+ },
1023
+ {
1024
+ "name": "stdout",
1025
+ "output_type": "stream",
1026
+ "text": [
1027
+ "Processed and saved 1 images so far.\n",
1028
+ "\n",
1029
+ "0: 640x448 1 person, 1 face, 438.7ms\n",
1030
+ "Speed: 2.2ms preprocess, 438.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1031
+ ]
1032
+ },
1033
+ {
1034
+ "name": "stderr",
1035
+ "output_type": "stream",
1036
+ "text": [
1037
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1038
+ "INFO:MiVOLO:\tage: 15.62\n",
1039
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1040
+ ]
1041
+ },
1042
+ {
1043
+ "name": "stdout",
1044
+ "output_type": "stream",
1045
+ "text": [
1046
+ "\n",
1047
+ "0: 640x448 (no detections), 444.8ms\n",
1048
+ "Speed: 2.3ms preprocess, 444.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n"
1049
+ ]
1050
+ },
1051
+ {
1052
+ "name": "stderr",
1053
+ "output_type": "stream",
1054
+ "text": [
1055
+ "ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6032bd70-6d53-4007-9e89-e69d4748efb5/width=528/6032bd70-6d53-4007-9e89-e69d4748efb5.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ed0950>\n"
1056
+ ]
1057
+ },
1058
+ {
1059
+ "name": "stdout",
1060
+ "output_type": "stream",
1061
+ "text": [
1062
+ "\n",
1063
+ "0: 640x448 (no detections), 453.9ms\n",
1064
+ "Speed: 2.3ms preprocess, 453.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
1065
+ "\n",
1066
+ "0: 640x448 1 person, 1 face, 475.0ms\n",
1067
+ "Speed: 1.6ms preprocess, 475.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1068
+ ]
1069
+ },
1070
+ {
1071
+ "name": "stderr",
1072
+ "output_type": "stream",
1073
+ "text": [
1074
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1075
+ "INFO:MiVOLO:\tage: 22.5\n",
1076
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1077
+ ]
1078
+ },
1079
+ {
1080
+ "name": "stdout",
1081
+ "output_type": "stream",
1082
+ "text": [
1083
+ "\n",
1084
+ "0: 640x448 1 person, 1 face, 447.6ms\n",
1085
+ "Speed: 2.5ms preprocess, 447.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
1086
+ ]
1087
+ },
1088
+ {
1089
+ "name": "stderr",
1090
+ "output_type": "stream",
1091
+ "text": [
1092
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1093
+ "INFO:MiVOLO:\tage: 23.46\n",
1094
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1095
+ ]
1096
+ },
1097
+ {
1098
+ "name": "stdout",
1099
+ "output_type": "stream",
1100
+ "text": [
1101
+ "\n",
1102
+ "0: 640x512 (no detections), 528.5ms\n",
1103
+ "Speed: 3.2ms preprocess, 528.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
1104
+ "\n",
1105
+ "0: 640x448 1 person, 1 face, 449.8ms\n",
1106
+ "Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1107
+ ]
1108
+ },
1109
+ {
1110
+ "name": "stderr",
1111
+ "output_type": "stream",
1112
+ "text": [
1113
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1114
+ "INFO:MiVOLO:\tage: 29.32\n",
1115
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1116
+ ]
1117
+ },
1118
+ {
1119
+ "name": "stdout",
1120
+ "output_type": "stream",
1121
+ "text": [
1122
+ "\n",
1123
+ "0: 640x448 1 person, 1 face, 617.7ms\n",
1124
+ "Speed: 2.4ms preprocess, 617.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1125
+ ]
1126
+ },
1127
+ {
1128
+ "name": "stderr",
1129
+ "output_type": "stream",
1130
+ "text": [
1131
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1132
+ "INFO:MiVOLO:\tage: 21.32\n",
1133
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1134
+ ]
1135
+ },
1136
+ {
1137
+ "name": "stdout",
1138
+ "output_type": "stream",
1139
+ "text": [
1140
+ "\n",
1141
+ "0: 640x448 (no detections), 609.1ms\n",
1142
+ "Speed: 2.3ms preprocess, 609.1ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
1143
+ "\n",
1144
+ "0: 640x448 (no detections), 436.2ms\n",
1145
+ "Speed: 2.5ms preprocess, 436.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1146
+ "\n",
1147
+ "0: 640x512 1 person, 1 face, 585.6ms\n",
1148
+ "Speed: 3.1ms preprocess, 585.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
1149
+ ]
1150
+ },
1151
+ {
1152
+ "name": "stderr",
1153
+ "output_type": "stream",
1154
+ "text": [
1155
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1156
+ "INFO:MiVOLO:\tage: 20.5\n",
1157
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1158
+ ]
1159
+ },
1160
+ {
1161
+ "name": "stdout",
1162
+ "output_type": "stream",
1163
+ "text": [
1164
+ "\n",
1165
+ "0: 640x448 1 person, 457.3ms\n",
1166
+ "Speed: 2.1ms preprocess, 457.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1167
+ ]
1168
+ },
1169
+ {
1170
+ "name": "stderr",
1171
+ "output_type": "stream",
1172
+ "text": [
1173
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1174
+ "INFO:MiVOLO:\tage: 25.19\n",
1175
+ "INFO:MiVOLO:\tgender: male [81%]\n"
1176
+ ]
1177
+ },
1178
+ {
1179
+ "name": "stdout",
1180
+ "output_type": "stream",
1181
+ "text": [
1182
+ "Processed and saved 14 images so far.\n",
1183
+ "Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-05.csv\n",
1184
+ "Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-06.csv\n",
1185
+ "\n",
1186
+ "0: 640x448 (no detections), 484.5ms\n",
1187
+ "Speed: 2.8ms preprocess, 484.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1188
+ "Processed and saved 1 images so far.\n",
1189
+ "\n",
1190
+ "0: 640x512 (no detections), 524.8ms\n",
1191
+ "Speed: 2.9ms preprocess, 524.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
1192
+ "\n",
1193
+ "0: 640x480 1 person, 478.0ms\n",
1194
+ "Speed: 2.6ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
1195
+ ]
1196
+ },
1197
+ {
1198
+ "name": "stderr",
1199
+ "output_type": "stream",
1200
+ "text": [
1201
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1202
+ "INFO:MiVOLO:\tage: 39.4\n",
1203
+ "INFO:MiVOLO:\tgender: male [99%]\n"
1204
+ ]
1205
+ },
1206
+ {
1207
+ "name": "stdout",
1208
+ "output_type": "stream",
1209
+ "text": [
1210
+ "\n",
1211
+ "0: 640x512 1 person, 1 face, 539.8ms\n",
1212
+ "Speed: 2.6ms preprocess, 539.8ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 512)\n"
1213
+ ]
1214
+ },
1215
+ {
1216
+ "name": "stderr",
1217
+ "output_type": "stream",
1218
+ "text": [
1219
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1220
+ "INFO:MiVOLO:\tage: 21.33\n",
1221
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1222
+ ]
1223
+ },
1224
+ {
1225
+ "name": "stdout",
1226
+ "output_type": "stream",
1227
+ "text": [
1228
+ "\n",
1229
+ "0: 640x448 1 person, 2 faces, 446.7ms\n",
1230
+ "Speed: 2.4ms preprocess, 446.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1231
+ ]
1232
+ },
1233
+ {
1234
+ "name": "stderr",
1235
+ "output_type": "stream",
1236
+ "text": [
1237
+ "INFO:MiVOLO:faces_input: torch.Size([2, 3, 224, 224]), person_input: torch.Size([2, 3, 224, 224])\n",
1238
+ "INFO:MiVOLO:\tage: 20.65\n",
1239
+ "INFO:MiVOLO:\tgender: female [99%]\n",
1240
+ "INFO:MiVOLO:\tage: 20.53\n",
1241
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1242
+ ]
1243
+ },
1244
+ {
1245
+ "name": "stdout",
1246
+ "output_type": "stream",
1247
+ "text": [
1248
+ "\n",
1249
+ "0: 640x640 1 person, 1 face, 655.0ms\n",
1250
+ "Speed: 3.3ms preprocess, 655.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n"
1251
+ ]
1252
+ },
1253
+ {
1254
+ "name": "stderr",
1255
+ "output_type": "stream",
1256
+ "text": [
1257
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1258
+ "INFO:MiVOLO:\tage: 26.34\n",
1259
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1260
+ ]
1261
+ },
1262
+ {
1263
+ "name": "stdout",
1264
+ "output_type": "stream",
1265
+ "text": [
1266
+ "\n",
1267
+ "0: 640x384 (no detections), 400.6ms\n",
1268
+ "Speed: 2.1ms preprocess, 400.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
1269
+ "\n",
1270
+ "0: 640x448 1 person, 587.9ms\n",
1271
+ "Speed: 2.2ms preprocess, 587.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
1272
+ ]
1273
+ },
1274
+ {
1275
+ "name": "stderr",
1276
+ "output_type": "stream",
1277
+ "text": [
1278
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1279
+ "INFO:MiVOLO:\tage: 30.4\n",
1280
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1281
+ ]
1282
+ },
1283
+ {
1284
+ "name": "stdout",
1285
+ "output_type": "stream",
1286
+ "text": [
1287
+ "\n",
1288
+ "0: 640x448 (no detections), 610.3ms\n",
1289
+ "Speed: 2.3ms preprocess, 610.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1290
+ "\n",
1291
+ "0: 640x448 (no detections), 453.6ms\n",
1292
+ "Speed: 2.3ms preprocess, 453.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1293
+ "\n",
1294
+ "0: 640x512 1 person, 1 face, 511.3ms\n",
1295
+ "Speed: 2.8ms preprocess, 511.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
1296
+ ]
1297
+ },
1298
+ {
1299
+ "name": "stderr",
1300
+ "output_type": "stream",
1301
+ "text": [
1302
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1303
+ "INFO:MiVOLO:\tage: 34.28\n",
1304
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1305
+ ]
1306
+ },
1307
+ {
1308
+ "name": "stdout",
1309
+ "output_type": "stream",
1310
+ "text": [
1311
+ "\n",
1312
+ "0: 640x448 (no detections), 441.2ms\n",
1313
+ "Speed: 2.3ms preprocess, 441.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1314
+ "\n",
1315
+ "0: 640x448 (no detections), 586.3ms\n",
1316
+ "Speed: 2.3ms preprocess, 586.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1317
+ "\n",
1318
+ "0: 640x448 (no detections), 437.5ms\n",
1319
+ "Speed: 2.4ms preprocess, 437.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1320
+ "\n",
1321
+ "0: 448x640 1 person, 1 face, 437.4ms\n",
1322
+ "Speed: 2.4ms preprocess, 437.4ms inference, 0.7ms postprocess per image at shape (1, 3, 448, 640)\n"
1323
+ ]
1324
+ },
1325
+ {
1326
+ "name": "stderr",
1327
+ "output_type": "stream",
1328
+ "text": [
1329
+ "INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
1330
+ "INFO:MiVOLO:\tage: 22.81\n",
1331
+ "INFO:MiVOLO:\tgender: female [99%]\n"
1332
+ ]
1333
+ },
1334
+ {
1335
+ "name": "stdout",
1336
+ "output_type": "stream",
1337
+ "text": [
1338
+ "\n",
1339
+ "0: 640x448 (no detections), 436.8ms\n",
1340
+ "Speed: 2.6ms preprocess, 436.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1341
+ "\n",
1342
+ "0: 448x640 (no detections), 433.0ms\n",
1343
+ "Speed: 1.9ms preprocess, 433.0ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
1344
+ "\n",
1345
+ "0: 640x448 (no detections), 599.7ms\n",
1346
+ "Speed: 2.5ms preprocess, 599.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
1347
+ "\n"
1348
+ ]
1349
+ }
1350
+ ],
1351
+ "source": [
1352
+ "# Set up logging\n",
1353
+ "detector_weights = current_dir.parent / 'ext/MiVOLO/models/yolov8x_person_face.pt'\n",
1354
+ "checkpoint = current_dir.parent / 'ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
1355
+ "\n",
1356
+ "_logger = logging.getLogger(\"inference\")\n",
1357
+ "logging.basicConfig(level=logging.INFO)\n",
1358
+ "\n",
1359
+ "# Placeholder configuration and predictor initialization for MiVOLO\n",
1360
+ "class Config:\n",
1361
+ " def __init__(self, detector_weights, checkpoint, device, with_persons=True, disable_faces=False, draw=False):\n",
1362
+ " self.detector_weights = detector_weights\n",
1363
+ " self.checkpoint = checkpoint\n",
1364
+ " self.device = device\n",
1365
+ " self.with_persons = with_persons\n",
1366
+ " self.disable_faces = disable_faces\n",
1367
+ " self.draw = draw\n",
1368
+ "\n",
1369
+ "\n",
1370
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1371
+ "config = Config(detector_weights=detector_weights, checkpoint=checkpoint, device=device)\n",
1372
+ "predictor = Predictor(config, verbose=True)\n",
1373
+ "\n",
1374
+ "def download_image(url):\n",
1375
+ " try:\n",
1376
+ " response = requests.get(url)\n",
1377
+ " response.raise_for_status()\n",
1378
+ " return Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
1379
+ " except requests.RequestException as e:\n",
1380
+ " _logger.error(f\"Failed to download image from {url}: {e}\")\n",
1381
+ " return None\n",
1382
+ " except UnidentifiedImageError as e:\n",
1383
+ " _logger.error(f\"Unidentified image error for URL {url}: {e}\")\n",
1384
+ " return None\n",
1385
+ "\n",
1386
+ "def process_images_with_progress(data, predictor, output_file, start_idx=0):\n",
1387
+ " results = []\n",
1388
+ " total_images = len(data)\n",
1389
+ "\n",
1390
+ " for idx, row in data.iterrows():\n",
1391
+ " if idx < start_idx:\n",
1392
+ " continue\n",
1393
+ "\n",
1394
+ " img_url = row[\"url\"]\n",
1395
+ " pil_image = download_image(img_url)\n",
1396
+ " if pil_image is None:\n",
1397
+ " continue\n",
1398
+ "\n",
1399
+ " np_image = np.array(pil_image)\n",
1400
+ " np_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)\n",
1401
+ " detected_objects, _ = predictor.recognize(np_image)\n",
1402
+ "\n",
1403
+ " row_result = row.to_dict() # Start with the original row's data\n",
1404
+ "\n",
1405
+ " if detected_objects and detected_objects.ages:\n",
1406
+ " for i in range(len(detected_objects.ages)):\n",
1407
+ " age = detected_objects.ages[i]\n",
1408
+ " gender = detected_objects.genders[i]\n",
1409
+ " gender_confidence = detected_objects.gender_scores[i]\n",
1410
+ "\n",
1411
+ " if gender_confidence >= 0.83:\n",
1412
+ " detection = {\n",
1413
+ " \"detection_type\": 'face' if i in detected_objects.face_to_person_map else 'person',\n",
1414
+ " \"gender\": gender,\n",
1415
+ " \"gender_confidence\": gender_confidence,\n",
1416
+ " \"age\": age,\n",
1417
+ " \"n_persons\": detected_objects.n_persons,\n",
1418
+ " \"n_faces\": detected_objects.n_faces,\n",
1419
+ " \"detected\": True\n",
1420
+ " }\n",
1421
+ " else:\n",
1422
+ " detection = {\n",
1423
+ " \"detection_type\": \"N/A\",\n",
1424
+ " \"gender\": \"N/A\",\n",
1425
+ " \"gender_confidence\": 0,\n",
1426
+ " \"age\": 0,\n",
1427
+ " \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
1428
+ " \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
1429
+ " \"detected\": False\n",
1430
+ " }\n",
1431
+ "\n",
1432
+ " results.append({**row_result, **detection})\n",
1433
+ " else:\n",
1434
+ " detection = {\n",
1435
+ " \"detection_type\": \"N/A\",\n",
1436
+ " \"gender\": \"N/A\",\n",
1437
+ " \"gender_confidence\": 0,\n",
1438
+ " \"age\": 0,\n",
1439
+ " \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
1440
+ " \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
1441
+ " \"detected\": False\n",
1442
+ " }\n",
1443
+ " results.append({**row_result, **detection})\n",
1444
+ "\n",
1445
+ " if idx % 100 == 0 or idx == total_images - 1:\n",
1446
+ " df = pd.DataFrame(results)\n",
1447
+ " if os.path.exists(output_file):\n",
1448
+ " df.to_csv(output_file, mode='a', header=False, index=False)\n",
1449
+ " else:\n",
1450
+ " df.to_csv(output_file, mode='w', header=True, index=False)\n",
1451
+ " results = []\n",
1452
+ " print(f\"Processed and saved {idx + 1} images so far.\")\n",
1453
+ "\n",
1454
+ "def generate_months(start, end):\n",
1455
+ " start_date = datetime.strptime(start, '%Y-%m')\n",
1456
+ " end_date = datetime.strptime(end, '%Y-%m')\n",
1457
+ " while start_date <= end_date:\n",
1458
+ " yield start_date.strftime('%Y-%m')\n",
1459
+ " start_date += relativedelta(months=1) # Increment by calendar months\n",
1460
+ "\n",
1461
+ "\n",
1462
+ "start_month = '2022-11'\n",
1463
+ "end_month = '2024-12'\n",
1464
+ "\n",
1465
+ "for month in generate_months(start_month, end_month):\n",
1466
+ " input_file_path = mivolo_in / f'Civiverse-{month}.csv'\n",
1467
+ " output_file_path = mivolo_out / f'{month}.csv'\n",
1468
+ "\n",
1469
+ " if input_file_path.exists():\n",
1470
+ " print(f\"Processing: {input_file_path}\")\n",
1471
+ "\n",
1472
+ " data = pd.read_csv(input_file_path)\n",
1473
+ " start_index = 0\n",
1474
+ " process_images_with_progress(data, predictor, output_file_path, start_idx=start_index)\n",
1475
+ "\n",
1476
+ " print(f\"Processed and saved to: {output_file_path}\")\n",
1477
+ " else:\n",
1478
+ " print(f\"File not found: {input_file_path}\")"
1479
+ ]
1480
+ },
1481
+ {
1482
+ "cell_type": "markdown",
1483
+ "id": "26aeeef7",
1484
+ "metadata": {},
1485
+ "source": [
1486
+ "## Visualization code"
1487
+ ]
1488
+ },
1489
+ {
1490
+ "cell_type": "code",
1491
+ "execution_count": null,
1492
+ "id": "88ec896a-bf9b-4cc6-a787-c1343f8acb41",
1493
+ "metadata": {},
1494
+ "outputs": [],
1495
+ "source": [
1496
+ "import matplotlib.pyplot as plt\n",
1497
+ "import matplotlib.patches as mpatches\n",
1498
+ "\n",
1499
+ "input_dir = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
1500
+ "plot_dir = current_dir.parent / 'plots/'\n",
1501
+ "\n",
1502
+ "\n",
1503
+ "all_data = pd.DataFrame()\n",
1504
+ "for file_path in input_dir.glob('*.csv'): # Reads all CSV files in the folder\n",
1505
+ " #print(f\"Loading: {file_path}\")\n",
1506
+ " data = pd.read_csv(file_path)\n",
1507
+ " all_data = pd.concat([all_data, data], ignore_index=True)\n",
1508
+ "\n",
1509
+ "# Filter rows where detection_type equals \"person\"\n",
1510
+ "person_data = all_data[all_data['detection_type'] == 'person']\n",
1511
+ "\n",
1512
+ "# Count unique images and categorize by persons detected\n",
1513
+ "n_images = all_data['id'].nunique()\n",
1514
+ "images_with_zero_persons = all_data[all_data['n_persons'] == 0]['id'].nunique()\n",
1515
+ "images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
1516
+ "\n",
1517
+ "n_persons_detected = person_data['id'].nunique() # Unique images with at least one detected person\n",
1518
+ "total_persons_detected = person_data.shape[0] # Total number of persons detected\n",
1519
+ "\n",
1520
+ "\n",
1521
+ "\n",
1522
+ "n_total_female = person_data[person_data['gender'] == 'female']['id'].nunique()\n",
1523
+ "n_total_male = person_data[person_data['gender'] == 'male']['id'].nunique()\n",
1524
+ "\n",
1525
+ "# Filter the data further for non-missing age and gender\n",
1526
+ "filtered_data = person_data.dropna(subset=['age', 'gender'])\n",
1527
+ "\n",
1528
+ "# Round ages for consistent plotting\n",
1529
+ "filtered_data['rounded_age'] = np.round(filtered_data['age'] * 4) / 4\n",
1530
+ "\n",
1531
+ "# Map browsingLevel to colors\n",
1532
+ "def get_browsing_color(browsing_level):\n",
1533
+ " color_mapping = {\n",
1534
+ " 1: 'silver',\n",
1535
+ " 2: 'rosybrown',\n",
1536
+ " 4: 'coral',\n",
1537
+ " 8: 'crimson',\n",
1538
+ " 16: 'blueviolet'\n",
1539
+ " }\n",
1540
+ " return color_mapping.get(browsing_level, 'black') # Default to black for unknown values\n",
1541
+ "\n",
1542
+ "filtered_data['color'] = filtered_data['browsingLevel'].apply(get_browsing_color)\n",
1543
+ "\n",
1544
+ "# Aggregate data for plotting\n",
1545
+ "aggregated_data = (\n",
1546
+ " filtered_data.groupby(['rounded_age', 'gender', 'color'])\n",
1547
+ " .size()\n",
1548
+ " .unstack(fill_value=0)\n",
1549
+ ")\n",
1550
+ "\n",
1551
+ "# Define NSFW colors\n",
1552
+ "nsfw_colors = ['blueviolet', 'crimson', 'coral', 'rosybrown', 'silver']\n",
1553
+ "\n",
1554
+ "# Plotting function\n",
1555
+ "def plot_gender_data(ax, data, gender_label):\n",
1556
+ " ages = data.index\n",
1557
+ " bottom = np.zeros(len(ages))\n",
1558
+ " \n",
1559
+ " for color in nsfw_colors:\n",
1560
+ " counts = data[color] if color in data.columns else np.zeros(len(ages))\n",
1561
+ " ax.bar(\n",
1562
+ " ages,\n",
1563
+ " counts,\n",
1564
+ " color=color,\n",
1565
+ " edgecolor=color,\n",
1566
+ " linewidth=1,\n",
1567
+ " width=0.2,\n",
1568
+ " bottom=bottom,\n",
1569
+ " alpha=0.5\n",
1570
+ " )\n",
1571
+ " bottom += counts\n",
1572
+ "\n",
1573
+ " x_min = 5\n",
1574
+ " x_max = filtered_data['rounded_age'].max()\n",
1575
+ " ax.set_xticks(np.arange(x_min, x_max + 0.5, 5))\n",
1576
+ " ax.set_xticklabels([f'{int(age)}' for age in np.arange(x_min, x_max + 0.5, 5)], fontsize=12, fontweight='bold')\n",
1577
+ " ax.set_xticks(np.arange(x_min, x_max + 0.5, 0.5), minor=True)\n",
1578
+ "\n",
1579
+ " y_min = 0\n",
1580
+ " y_max = bottom.max() + 100\n",
1581
+ " y_ticks = np.arange(y_min, y_max + 1, 100) # Fine-grained steps of 100\n",
1582
+ " ax.set_yticks(y_ticks)\n",
1583
+ " ax.set_yticklabels([str(int(y)) for y in y_ticks], fontsize=12, fontweight='bold')\n",
1584
+ "\n",
1585
+ " ax.grid(True, which='major', color='lightgrey', linestyle='-', linewidth=0.5)\n",
1586
+ " ax.grid(True, which='minor', color='lightgrey', linestyle=':', linewidth=0.5)\n",
1587
+ "\n",
1588
+ " ax.spines['top'].set_visible(False)\n",
1589
+ " ax.spines['right'].set_visible(False)\n",
1590
+ " ax.spines['left'].set_visible(False)\n",
1591
+ " ax.spines['bottom'].set_visible(False)\n",
1592
+ " \n",
1593
+ " ax.set_xlabel('Age', fontsize=12, fontweight='bold')\n",
1594
+ " if gender_label == 'Female':\n",
1595
+ " ax.set_ylabel('Number of Subjects', fontsize=14, fontweight='bold')\n",
1596
+ " ax.set_title(f'{gender_label} Read', fontsize=14, fontweight='bold')\n",
1597
+ "\n",
1598
+ "# Set up the subplots\n",
1599
+ "fig, axes = plt.subplots(1, 2, figsize=(14, 6.5), sharey=True)\n",
1600
+ "\n",
1601
+ "plot_gender_data(axes[0], aggregated_data.xs('male', level='gender'), 'Male')\n",
1602
+ "plot_gender_data(axes[1], aggregated_data.xs('female', level='gender'), 'Female')\n",
1603
+ "\n",
1604
+ "legend_patches = [\n",
1605
+ " mpatches.Patch(facecolor='blueviolet', edgecolor='blueviolet', linewidth=2, label='Level 16: XXX'),\n",
1606
+ " mpatches.Patch(facecolor='crimson', edgecolor='crimson', linewidth=2, label='Level 8: X'),\n",
1607
+ " mpatches.Patch(facecolor='coral', edgecolor='coral', linewidth=2, label='Level 4: Mature'),\n",
1608
+ " mpatches.Patch(facecolor='rosybrown', edgecolor='rosybrown', linewidth=2, label='Level 2: Soft'),\n",
1609
+ " mpatches.Patch(facecolor='silver', edgecolor='silver', linewidth=2, label='Level 1: SFW'),\n",
1610
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'n images: {n_images}', alpha=0),\n",
1611
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'Total persons detected: {total_persons_detected}', alpha=0),\n",
1612
+ " mpatches.Patch(facecolor='none', edgecolor='none', label=f'Unique images containing persons: {n_persons_detected}', alpha=0),\n",
1613
+ "]\n",
1614
+ "\n",
1615
+ "axes[0].legend(handles=legend_patches, title=\"Browsing Levels\", loc='upper left', fontsize=12, title_fontsize=12, frameon=True)\n",
1616
+ "plt.savefig(f'{plot_dir}/mivolo.svg', format='svg', bbox_inches='tight')\n",
1617
+ "plt.tight_layout()\n",
1618
+ "\n",
1619
+ "# Count images with at least one person\n",
1620
+ "images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
1621
+ "\n",
1622
+ "# Count unique images in `person_data`\n",
1623
+ "n_persons_detected = person_data['id'].nunique()\n",
1624
+ "\n",
1625
+ "# Count total persons detected\n",
1626
+ "total_persons = person_data.shape[0] # This counts all detected persons\n",
1627
+ "\n",
1628
+ "# Display potential inconsistencies\n",
1629
+ "print(f\"Total images: {n_images}\")\n",
1630
+ "print(f\"Images with at least one person: {images_with_one_or_more_persons}\")\n",
1631
+ "print(f\"Unique images in `person_data`: {n_persons_detected}\")\n",
1632
+ "print(f\"Total number of persons detected: {total_persons}\")\n",
1633
+ "\n",
1634
+ "\n",
1635
+ "\n",
1636
+ "plt.show()"
1637
+ ]
1638
+ },
1639
+ {
1640
+ "cell_type": "markdown",
1641
+ "id": "42b2a557-b8f4-4d0f-8907-98e3012a1b34",
1642
+ "metadata": {
1643
+ "execution": {
1644
+ "iopub.execute_input": "2025-02-06T20:01:54.848400Z",
1645
+ "iopub.status.busy": "2025-02-06T20:01:54.847713Z",
1646
+ "iopub.status.idle": "2025-02-06T20:01:54.851533Z",
1647
+ "shell.execute_reply": "2025-02-06T20:01:54.851102Z",
1648
+ "shell.execute_reply.started": "2025-02-06T20:01:54.848376Z"
1649
+ }
1650
+ },
1651
+ "source": [
1652
+ "### Latex Table"
1653
+ ]
1654
+ },
1655
+ {
1656
+ "cell_type": "code",
1657
+ "execution_count": null,
1658
+ "id": "3e506c41-6497-4ece-99f3-73f09fe1129e",
1659
+ "metadata": {},
1660
+ "outputs": [],
1661
+ "source": [
1662
+ "import os\n",
1663
+ "import pandas as pd\n",
1664
+ "\n",
1665
+ "# Define the directory containing CSV files\n",
1666
+ "directory_path = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/' # Update this path with the actual directory path\n",
1667
+ "\n",
1668
+ "# Prepare data for LaTeX table\n",
1669
+ "table_rows = []\n",
1670
+ "\n",
1671
+ "# Loop through each file in the directory\n",
1672
+ "for file_name in os.listdir(directory_path):\n",
1673
+ " if file_name.endswith('.csv'):\n",
1674
+ " file_path = os.path.join(directory_path, file_name)\n",
1675
+ " print(f\"Processing file: {file_name}\")\n",
1676
+ "\n",
1677
+ " # Load the data\n",
1678
+ " data = pd.read_csv(file_path)\n",
1679
+ "\n",
1680
+ " # Total images analyzed\n",
1681
+ " total_images = data['id'].nunique()\n",
1682
+ "\n",
1683
+ " # Count of images with no persons detected\n",
1684
+ " images_no_persons = data[data['n_persons'] == 0]['id'].nunique()\n",
1685
+ "\n",
1686
+ " # Total persons detected (only using \"person\" detection type)\n",
1687
+ " total_persons_count = data[data['detection_type'] == 'person'].shape[0]\n",
1688
+ "\n",
1689
+ " # Average age and standard deviation for male and female individuals\n",
1690
+ " male_age_stats = data[data['gender'] == 'male']['age'].agg(['mean', 'std']).fillna(0)\n",
1691
+ " female_age_stats = data[data['gender'] == 'female']['age'].agg(['mean', 'std']).fillna(0)\n",
1692
+ "\n",
1693
+ " # Count of female and male subjects\n",
1694
+ " female_images_count = data[data['gender'] == 'female']['id'].nunique()\n",
1695
+ " male_images_count = data[data['gender'] == 'male']['id'].nunique()\n",
1696
+ "\n",
1697
+ " # Female to male ratio\n",
1698
+ " female_to_male_ratio = female_images_count / male_images_count if male_images_count else None\n",
1699
+ "\n",
1700
+ " # Browsing level analysis for females\n",
1701
+ " female_browsing_level_1 = data[(data['gender'] == 'female') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
1702
+ " female_browsing_level_2_16 = data[(data['gender'] == 'female') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
1703
+ " \n",
1704
+ " female_browsing_level_1_percentage = (female_browsing_level_1 / female_images_count * 100) if female_images_count else 0\n",
1705
+ " female_browsing_level_2_16_percentage = (female_browsing_level_2_16 / female_images_count * 100) if female_images_count else 0\n",
1706
+ "\n",
1707
+ " # Browsing level analysis for males\n",
1708
+ " male_browsing_level_1 = data[(data['gender'] == 'male') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
1709
+ " male_browsing_level_2_16 = data[(data['gender'] == 'male') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
1710
+ "\n",
1711
+ " male_browsing_level_1_percentage = (male_browsing_level_1 / male_images_count * 100) if male_images_count else 0\n",
1712
+ " male_browsing_level_2_16_percentage = (male_browsing_level_2_16 / male_images_count * 100) if male_images_count else 0\n",
1713
+ "\n",
1714
+ " # Add row to table data\n",
1715
+ " table_rows.append([\n",
1716
+ " file_name.replace('.csv', ''), # Remove file extension\n",
1717
+ " total_images,\n",
1718
+ " total_persons_count,\n",
1719
+ " images_no_persons,\n",
1720
+ " f\"{female_browsing_level_1_percentage:.2f}\",\n",
1721
+ " f\"{female_browsing_level_2_16_percentage:.2f}\",\n",
1722
+ " f\"{male_browsing_level_1_percentage:.2f}\",\n",
1723
+ " f\"{male_browsing_level_2_16_percentage:.2f}\",\n",
1724
+ " f\"{female_to_male_ratio:.2f}\" if female_to_male_ratio is not None else \"N/A\",\n",
1725
+ " f\"{female_age_stats['mean']:.2f} ({female_age_stats['std']:.2f})\",\n",
1726
+ " f\"{male_age_stats['mean']:.2f} ({male_age_stats['std']:.2f})\"\n",
1727
+ " ])\n",
1728
+ "\n",
1729
+ "# Sort table rows by the filename (assumes filenames are formatted with sortable dates)\n",
1730
+ "table_rows = sorted(table_rows, key=lambda x: x[0])\n",
1731
+ "\n",
1732
+ "# Generate LaTeX table\n",
1733
+ "latex_table = r\"\"\"\n",
1734
+ "\\begin{table}[H]\n",
1735
+ "\\centering\n",
1736
+ "\\scriptsize\n",
1737
+ "\\renewcommand{\\arraystretch}{0.9}\n",
1738
+ "\\caption{Summary of Image Classification for 2023-2024}\n",
1739
+ "\\label{table:image_classification_2023_2024}\n",
1740
+ "\\begin{tabular}{lrrrrrrrrrr}\n",
1741
+ "\\toprule\n",
1742
+ "File Name & Total Images & Total Persons & No Persons & \\multicolumn{2}{c}{Female (\\%)} & \\multicolumn{2}{c}{Male (\\%)} & Female:Male & Female Age (Mean ± SD) & Male Age (Mean ± SD) \\\\\n",
1743
+ " & & & & L1 & L2-16 & L1 & L2-16 & & & \\\\\n",
1744
+ "\\midrule\n",
1745
+ "\"\"\"\n",
1746
+ "for row in table_rows:\n",
1747
+ " latex_table += \" & \".join(map(str, row)) + r\" \\\\\\\\\\n\"\n",
1748
+ "\n",
1749
+ "latex_table += r\"\"\"\n",
1750
+ "\\bottomrule\n",
1751
+ "\\end{tabular}\n",
1752
+ "\\vspace{1em}\n",
1753
+ "\\noindent\n",
1754
+ "\\textbf{Disclaimer:} \\(\\female\\) and \\(\\male\\) refer to female-read and male-read classifications as determined by the MiVOLO system's weights. \n",
1755
+ "We acknowledge the complexities of gender presentations and stress that these terms do not necessarily correspond to biological sex.\n",
1756
+ "\\end{table}\n",
1757
+ "\"\"\"\n",
1758
+ "\n",
1759
+ "# Output LaTeX table\n",
1760
+ "print(\"\\nGenerated LaTeX Table:\")\n",
1761
+ "print(latex_table)\n",
1762
+ "\n"
1763
+ ]
1764
+ },
1765
+ {
1766
+ "cell_type": "code",
1767
+ "execution_count": null,
1768
+ "id": "3ef428be-856b-4c4c-b1b0-a052d181d03b",
1769
+ "metadata": {},
1770
+ "outputs": [],
1771
+ "source": []
1772
+ }
1773
+ ],
1774
+ "metadata": {
1775
+ "kernelspec": {
1776
+ "display_name": "Python 3 (ipykernel)",
1777
+ "language": "python",
1778
+ "name": "python3"
1779
+ },
1780
+ "language_info": {
1781
+ "codemirror_mode": {
1782
+ "name": "ipython",
1783
+ "version": 3
1784
+ },
1785
+ "file_extension": ".py",
1786
+ "mimetype": "text/x-python",
1787
+ "name": "python",
1788
+ "nbconvert_exporter": "python",
1789
+ "pygments_lexer": "ipython3",
1790
+ "version": "3.12.10"
1791
+ }
1792
+ },
1793
+ "nbformat": 4,
1794
+ "nbformat_minor": 5
1795
+ }
jupyter_notebooks/Section_2-3-1_Tag_occurences.ipynb ADDED
@@ -0,0 +1,801 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "71008ae2-4465-45d1-9ad0-6e6d54c99a69",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-12-09T20:15:54.294327Z",
9
+ "iopub.status.busy": "2025-12-09T20:15:54.294119Z",
10
+ "iopub.status.idle": "2025-12-09T20:15:54.296418Z",
11
+ "shell.execute_reply": "2025-12-09T20:15:54.295943Z",
12
+ "shell.execute_reply.started": "2025-12-09T20:15:54.294313Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# Tag occurence percentages"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": 31,
22
+ "id": "1577b529-19b4-471a-af8b-bc331087bb61",
23
+ "metadata": {
24
+ "execution": {
25
+ "iopub.execute_input": "2025-12-09T20:33:24.090495Z",
26
+ "iopub.status.busy": "2025-12-09T20:33:24.090266Z",
27
+ "iopub.status.idle": "2025-12-09T20:33:35.733409Z",
28
+ "shell.execute_reply": "2025-12-09T20:33:35.732815Z",
29
+ "shell.execute_reply.started": "2025-12-09T20:33:24.090478Z"
30
+ }
31
+ },
32
+ "outputs": [
33
+ {
34
+ "name": "stdout",
35
+ "output_type": "stream",
36
+ "text": [
37
+ "\n",
38
+ "==================================================\n",
39
+ "Tag Analysis for 'anime'\n",
40
+ "==================================================\n",
41
+ "Models with tag: 74721\n",
42
+ "Total models: 232164\n",
43
+ "Percentage: 32.18%\n",
44
+ "==================================================\n",
45
+ "\n"
46
+ ]
47
+ }
48
+ ],
49
+ "source": [
50
+ "from pathlib import Path\n",
51
+ "import pandas as pd\n",
52
+ "import sys\n",
53
+ "\n",
54
+ "current_dir = Path.cwd()\n",
55
+ "\n",
56
+ "# ============================================\n",
57
+ "# INPUT: Change these values\n",
58
+ "# ============================================\n",
59
+ "csv_file = current_dir.parent / \"data/CSV/models/Civi_models.csv\" # Your CSV file path\n",
60
+ "tag_to_find = \"anime\" # Tag to search for\n",
61
+ "# ============================================\n",
62
+ "\n",
63
+ "def calculate_tag_percentage(csv_file, tag_to_find):\n",
64
+ " \"\"\"\n",
65
+ " Calculate what percentage of models contain a specific tag.\n",
66
+ " \"\"\"\n",
67
+ " # Read the CSV file\n",
68
+ " df = pd.read_csv(csv_file)\n",
69
+ " \n",
70
+ " # Get all tag columns\n",
71
+ " tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
72
+ " \n",
73
+ " # Count total models\n",
74
+ " total_models = len(df)\n",
75
+ " \n",
76
+ " # Count models containing the tag (case-insensitive search)\n",
77
+ " tag_lower = tag_to_find.lower()\n",
78
+ " models_with_tag = 0\n",
79
+ " \n",
80
+ " for idx, row in df.iterrows():\n",
81
+ " # Check if the tag appears in any of the tag columns\n",
82
+ " for tag_col in tag_columns:\n",
83
+ " tag_value = str(row[tag_col]).lower().strip()\n",
84
+ " if tag_value == tag_lower:\n",
85
+ " models_with_tag += 1\n",
86
+ " break # Count each model only once\n",
87
+ " \n",
88
+ " # Calculate percentage\n",
89
+ " percentage = (models_with_tag / total_models * 100) if total_models > 0 else 0\n",
90
+ " \n",
91
+ " return {\n",
92
+ " 'tag': tag_to_find,\n",
93
+ " 'count': models_with_tag,\n",
94
+ " 'total': total_models,\n",
95
+ " 'percentage': percentage\n",
96
+ " }\n",
97
+ "\n",
98
+ "# Calculate and display results\n",
99
+ "result = calculate_tag_percentage(csv_file, tag_to_find)\n",
100
+ "\n",
101
+ "print(f\"\\n{'='*50}\")\n",
102
+ "print(f\"Tag Analysis for '{result['tag']}'\")\n",
103
+ "print(f\"{'='*50}\")\n",
104
+ "print(f\"Models with tag: {result['count']}\")\n",
105
+ "print(f\"Total models: {result['total']}\")\n",
106
+ "print(f\"Percentage: {result['percentage']:.2f}%\")\n",
107
+ "print(f\"{'='*50}\\n\")"
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "code",
112
+ "execution_count": 28,
113
+ "id": "825dc9b8-ff0e-4afd-b2f1-e4bca1036aea",
114
+ "metadata": {
115
+ "execution": {
116
+ "iopub.execute_input": "2025-12-09T20:22:12.641296Z",
117
+ "iopub.status.busy": "2025-12-09T20:22:12.641105Z",
118
+ "iopub.status.idle": "2025-12-09T20:22:25.293171Z",
119
+ "shell.execute_reply": "2025-12-09T20:22:25.292530Z",
120
+ "shell.execute_reply.started": "2025-12-09T20:22:12.641281Z"
121
+ }
122
+ },
123
+ "outputs": [
124
+ {
125
+ "name": "stdout",
126
+ "output_type": "stream",
127
+ "text": [
128
+ "\n",
129
+ "=== Tag Analysis for '-f' ===\n",
130
+ "Models with tag: 0\n",
131
+ "Total models: 232164\n",
132
+ "Percentage: 0.00%\n",
133
+ "\n"
134
+ ]
135
+ }
136
+ ],
137
+ "source": [
138
+ "from pathlib import Path\n",
139
+ "import pandas as pd\n",
140
+ "import sys\n",
141
+ "\n",
142
+ "current_dir = Path.cwd()\n",
143
+ "\n",
144
+ "\n",
145
+ "def calculate_tag_percentage(csv_file, tag_to_find):\n",
146
+ " # Read the CSV file\n",
147
+ " df = pd.read_csv(csv_file)\n",
148
+ " \n",
149
+ " # Get all tag columns\n",
150
+ " tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
151
+ " \n",
152
+ " # Count total models\n",
153
+ " total_models = len(df)\n",
154
+ " \n",
155
+ " # Count models containing the tag (case-insensitive search)\n",
156
+ " tag_lower = tag_to_find.lower()\n",
157
+ " models_with_tag = 0\n",
158
+ " \n",
159
+ " for idx, row in df.iterrows():\n",
160
+ " # Check if the tag appears in any of the tag columns\n",
161
+ " for tag_col in tag_columns:\n",
162
+ " tag_value = str(row[tag_col]).lower().strip()\n",
163
+ " if tag_value == tag_lower:\n",
164
+ " models_with_tag += 1\n",
165
+ " break # Count each model only once\n",
166
+ " \n",
167
+ " # Calculate percentage\n",
168
+ " percentage = (models_with_tag / total_models * 100) if total_models > 0 else 0\n",
169
+ " \n",
170
+ " return {\n",
171
+ " 'tag': tag_to_find,\n",
172
+ " 'count': models_with_tag,\n",
173
+ " 'total': total_models,\n",
174
+ " 'percentage': percentage\n",
175
+ " }\n",
176
+ "\n",
177
+ "\n",
178
+ "def analyze_all_tags(csv_file):\n",
179
+ "\n",
180
+ " df = pd.read_csv(csv_file)\n",
181
+ " tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
182
+ " total_models = len(df)\n",
183
+ " \n",
184
+ " # Collect all tags and count occurrences\n",
185
+ " tag_counts = {}\n",
186
+ " for tag_col in tag_columns:\n",
187
+ " for tag in df[tag_col].dropna():\n",
188
+ " tag = str(tag).strip()\n",
189
+ " if tag: # Ignore empty strings\n",
190
+ " tag_counts[tag] = tag_counts.get(tag, 0) + 1\n",
191
+ " \n",
192
+ " # Create results DataFrame\n",
193
+ " results = []\n",
194
+ " for tag, count in tag_counts.items():\n",
195
+ " percentage = (count / total_models * 100)\n",
196
+ " results.append({\n",
197
+ " 'tag': tag,\n",
198
+ " 'count': count,\n",
199
+ " 'percentage': round(percentage, 2)\n",
200
+ " })\n",
201
+ " \n",
202
+ " results_df = pd.DataFrame(results)\n",
203
+ " results_df = results_df.sort_values('count', ascending=False)\n",
204
+ " \n",
205
+ " return results_df\n",
206
+ "\n",
207
+ "\n",
208
+ "if __name__ == \"__main__\":\n",
209
+ " # Default CSV file path\n",
210
+ " csv_file = current_dir.parent / \"data/CSV/models/Civi_models.csv\"\n",
211
+ " \n",
212
+ " # Check if a specific tag is provided as argument\n",
213
+ " if len(sys.argv) > 1:\n",
214
+ " tag = sys.argv[1]\n",
215
+ " result = calculate_tag_percentage(csv_file, tag)\n",
216
+ " \n",
217
+ " print(f\"\\n=== Tag Analysis for '{result['tag']}' ===\")\n",
218
+ " print(f\"Models with tag: {result['count']}\")\n",
219
+ " print(f\"Total models: {result['total']}\")\n",
220
+ " print(f\"Percentage: {result['percentage']:.2f}%\\n\")\n",
221
+ " else:\n",
222
+ " # If no specific tag provided, show all tags\n",
223
+ " print(\"\\n=== All Tags Analysis ===\\n\")\n",
224
+ " results_df = analyze_all_tags(csv_file)\n",
225
+ " print(results_df.to_string(index=False))\n",
226
+ " print(f\"\\nTotal unique tags: {len(results_df)}\")\n",
227
+ " print(f\"Total models: {len(pd.read_csv(csv_file))}\\n\")\n",
228
+ " \n",
229
+ " # Show example usage\n",
230
+ " print(\"\\nTo search for a specific tag, run:\")\n",
231
+ " print(\" python tag_percentage_calculator.py <tag_name>\")\n",
232
+ " print(\"\\nExample:\")\n",
233
+ " print(\" python tag_percentage_calculator.py anime\")"
234
+ ]
235
+ },
236
+ {
237
+ "cell_type": "code",
238
+ "execution_count": null,
239
+ "id": "63187f58-9777-4ffb-bdf9-93b191a60241",
240
+ "metadata": {
241
+ "execution": {
242
+ "iopub.execute_input": "2025-12-09T19:08:49.572641Z",
243
+ "iopub.status.busy": "2025-12-09T19:08:49.572453Z",
244
+ "iopub.status.idle": "2025-12-09T19:08:49.634561Z",
245
+ "shell.execute_reply": "2025-12-09T19:08:49.634109Z",
246
+ "shell.execute_reply.started": "2025-12-09T19:08:49.572627Z"
247
+ }
248
+ },
249
+ "outputs": [],
250
+ "source": [
251
+ "from pathlib import Path\n",
252
+ "import json\n",
253
+ "from collections import defaultdict\n",
254
+ "\n",
255
+ "current_dir = Path.cwd()\n",
256
+ "\n",
257
+ "\n",
258
+ "\n",
259
+ "def load_data(filepath):\n",
260
+ " \"\"\"Load the JSON data from file.\"\"\"\n",
261
+ " with open(filepath, 'r', encoding='utf-8') as f:\n",
262
+ " return json.load(f)\n",
263
+ "\n",
264
+ "def calculate_cooccurrence_rate(data, target_tag, cooccurring_tags):\n",
265
+ " \"\"\"\n",
266
+ " Calculate what percentage of target_tag occurrences co-occur with each tag in cooccurring_tags.\n",
267
+ " \n",
268
+ " Args:\n",
269
+ " data: Dictionary with 'nodes' and 'links'\n",
270
+ " target_tag: The main tag to analyze (e.g., \"woman\")\n",
271
+ " cooccurring_tags: List of tags to check co-occurrence with (e.g., [\"sexy\", \"pose\"])\n",
272
+ " \n",
273
+ " Returns:\n",
274
+ " Dictionary with results\n",
275
+ " \"\"\"\n",
276
+ " # Find the target tag's total occurrences\n",
277
+ " target_size = None\n",
278
+ " for node in data['nodes']:\n",
279
+ " if node['id'] == target_tag:\n",
280
+ " target_size = node['size']\n",
281
+ " break\n",
282
+ " \n",
283
+ " if target_size is None:\n",
284
+ " print(f\"Warning: Tag '{target_tag}' not found in nodes!\")\n",
285
+ " return None\n",
286
+ " \n",
287
+ " print(f\"\\n{'='*60}\")\n",
288
+ " print(f\"Analysis for tag: '{target_tag}'\")\n",
289
+ " print(f\"{'='*60}\")\n",
290
+ " print(f\"Total occurrences of '{target_tag}': {target_size:,}\")\n",
291
+ " print()\n",
292
+ " \n",
293
+ " # Find co-occurrences in links\n",
294
+ " results = {}\n",
295
+ " for cooccurring_tag in cooccurring_tags:\n",
296
+ " cooccurrence_count = 0\n",
297
+ " \n",
298
+ " # Check both directions in links\n",
299
+ " for link in data['links']:\n",
300
+ " if (link['source'] == target_tag and link['target'] == cooccurring_tag) or \\\n",
301
+ " (link['source'] == cooccurring_tag and link['target'] == target_tag):\n",
302
+ " cooccurrence_count = link['value']\n",
303
+ " break\n",
304
+ " \n",
305
+ " if cooccurrence_count > 0:\n",
306
+ " percentage = (cooccurrence_count / target_size) * 100\n",
307
+ " results[cooccurring_tag] = {\n",
308
+ " 'count': cooccurrence_count,\n",
309
+ " 'percentage': percentage\n",
310
+ " }\n",
311
+ " print(f\"Tag: '{cooccurring_tag}'\")\n",
312
+ " print(f\" Co-occurrences: {cooccurrence_count:,}\")\n",
313
+ " print(f\" Percentage: {percentage:.2f}%\")\n",
314
+ " print(f\" (i.e., {percentage:.2f}% of '{target_tag}' occurrences also have '{cooccurring_tag}')\")\n",
315
+ " else:\n",
316
+ " results[cooccurring_tag] = {\n",
317
+ " 'count': 0,\n",
318
+ " 'percentage': 0.0\n",
319
+ " }\n",
320
+ " print(f\"Tag: '{cooccurring_tag}'\")\n",
321
+ " print(f\" No co-occurrences found\")\n",
322
+ " print()\n",
323
+ " \n",
324
+ " # Calculate combined co-occurrence (both tags together)\n",
325
+ " print(f\"\\n{'='*60}\")\n",
326
+ " print(\"Combined Analysis\")\n",
327
+ " print(f\"{'='*60}\")\n",
328
+ " \n",
329
+ " # To find items with ALL tags, we'd need to look at the underlying data\n",
330
+ " # With just the graph structure, we can only report individual co-occurrences\n",
331
+ " print(f\"Individual co-occurrence rates calculated above.\")\n",
332
+ " print(f\"Note: To calculate how often ALL tags appear together,\")\n",
333
+ " print(f\"we would need access to the raw item-level data.\")\n",
334
+ " \n",
335
+ " return results\n",
336
+ "\n",
337
+ "def main():\n",
338
+ " # Load the data\n",
339
+ " filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
340
+ " print(\"Loading data...\")\n",
341
+ " data = load_data(filepath)\n",
342
+ " print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\")\n",
343
+ " \n",
344
+ " # Calculate co-occurrence rates\n",
345
+ " target_tag = \"woman\"\n",
346
+ " cooccurring_tags = [\"sexy\", \"pose\"]\n",
347
+ " \n",
348
+ " results = calculate_cooccurrence_rate(data, target_tag, cooccurring_tags)\n",
349
+ " \n",
350
+ " # Summary\n",
351
+ " print(f\"\\n{'='*60}\")\n",
352
+ " print(\"SUMMARY\")\n",
353
+ " print(f\"{'='*60}\")\n",
354
+ " if results:\n",
355
+ " for tag, stats in results.items():\n",
356
+ " print(f\"'{target_tag}' + '{tag}': {stats['percentage']:.2f}% ({stats['count']:,} occurrences)\")\n",
357
+ "\n",
358
+ "if __name__ == \"__main__\":\n",
359
+ " main()"
360
+ ]
361
+ },
362
+ {
363
+ "cell_type": "code",
364
+ "execution_count": 26,
365
+ "id": "6e0e8c6b-547e-4899-b001-1d4c6b31476f",
366
+ "metadata": {
367
+ "execution": {
368
+ "iopub.execute_input": "2025-12-09T19:48:36.009465Z",
369
+ "iopub.status.busy": "2025-12-09T19:48:36.009249Z",
370
+ "iopub.status.idle": "2025-12-09T19:48:36.063251Z",
371
+ "shell.execute_reply": "2025-12-09T19:48:36.062750Z",
372
+ "shell.execute_reply.started": "2025-12-09T19:48:36.009449Z"
373
+ }
374
+ },
375
+ "outputs": [
376
+ {
377
+ "name": "stdout",
378
+ "output_type": "stream",
379
+ "text": [
380
+ "Loading data...\n",
381
+ "Loaded 60,330 nodes and 16,921 links\n",
382
+ "\n",
383
+ "================================================================================\n",
384
+ "Top 100 Co-occurring Tags for: 'anime'\n",
385
+ "================================================================================\n",
386
+ "Total occurrences of 'anime': 74,187\n",
387
+ "\n",
388
+ "Rank Tag Count Percentage \n",
389
+ "------ ------------------------------ ------------ ------------\n",
390
+ "1 character 53,792 72.51%\n",
391
+ "2 woman 30,731 41.42%\n",
392
+ "3 girls 21,434 28.89%\n",
393
+ "4 female 14,286 19.26%\n",
394
+ "5 style 11,593 15.63%\n",
395
+ "6 game character 9,309 12.55%\n",
396
+ "7 sexy 8,876 11.96%\n",
397
+ "8 male 4,476 6.03%\n",
398
+ "9 video game 3,411 4.60%\n",
399
+ "10 man 3,333 4.49%\n",
400
+ "11 lora 3,260 4.39%\n",
401
+ "12 concept 2,568 3.46%\n",
402
+ "13 girl 2,472 3.33%\n",
403
+ "14 base model 2,197 2.96%\n",
404
+ "15 photorealistic 2,191 2.95%\n",
405
+ "16 manga 2,015 2.72%\n",
406
+ "17 boys 2,006 2.70%\n",
407
+ "18 anime character 1,841 2.48%\n",
408
+ "19 cartoon 1,759 2.37%\n",
409
+ "20 game 1,536 2.07%\n",
410
+ "21 men 1,431 1.93%\n",
411
+ "22 cute 1,325 1.79%\n",
412
+ "23 hentai 1,222 1.65%\n",
413
+ "24 clothing 1,131 1.52%\n",
414
+ "25 furry 1,114 1.50%\n",
415
+ "26 realistic 1,074 1.45%\n",
416
+ "27 styles 1,057 1.42%\n",
417
+ "28 illustration 971 1.31%\n",
418
+ "29 characters 967 1.30%\n",
419
+ "30 art style 949 1.28%\n",
420
+ "31 pokemon 932 1.26%\n",
421
+ "32 vtuber 890 1.20%\n",
422
+ "33 person 872 1.18%\n",
423
+ "34 artstyle 801 1.08%\n",
424
+ "35 anime girl 756 1.02%\n",
425
+ "36 blue archive 684 0.92%\n",
426
+ "37 2d 666 0.90%\n",
427
+ "38 fantasy 643 0.87%\n",
428
+ "39 art 633 0.85%\n",
429
+ "40 poses 619 0.83%\n",
430
+ "41 3d 578 0.78%\n",
431
+ "42 nsfw 550 0.74%\n",
432
+ "43 artist 544 0.73%\n",
433
+ "44 genshin impact 519 0.70%\n",
434
+ "45 idolmaster 483 0.65%\n",
435
+ "46 fire emblem 481 0.65%\n",
436
+ "47 fate 432 0.58%\n",
437
+ "48 waifu 406 0.55%\n",
438
+ "49 azur lane 399 0.54%\n",
439
+ "50 dragon ball 388 0.52%\n",
440
+ "51 ponyxl 379 0.51%\n",
441
+ "52 naruto 379 0.51%\n",
442
+ "53 precure 379 0.51%\n",
443
+ "54 videogame 372 0.50%\n",
444
+ "55 retro 356 0.48%\n",
445
+ "56 meme 355 0.48%\n",
446
+ "57 arknights 354 0.48%\n",
447
+ "58 hololive 347 0.47%\n",
448
+ "59 virtual youtuber 347 0.47%\n",
449
+ "60 umamusume 333 0.45%\n",
450
+ "61 falcom 326 0.44%\n",
451
+ "62 one piece 325 0.44%\n",
452
+ "63 boy 321 0.43%\n",
453
+ "64 chibi 315 0.42%\n",
454
+ "65 comics 303 0.41%\n",
455
+ "66 idolm@ster 295 0.40%\n",
456
+ "67 gundam 294 0.40%\n",
457
+ "68 bleach 294 0.40%\n",
458
+ "69 pose 288 0.39%\n",
459
+ "70 guy 284 0.38%\n",
460
+ "71 milf 281 0.38%\n",
461
+ "72 my hero academia 279 0.38%\n",
462
+ "73 genshin 276 0.37%\n",
463
+ "74 porn 268 0.36%\n",
464
+ "75 kawaii 257 0.35%\n",
465
+ "76 kantai collection 255 0.34%\n",
466
+ "77 galgame 250 0.34%\n",
467
+ "78 eiyuu densetsu 246 0.33%\n",
468
+ "79 animals 239 0.32%\n",
469
+ "80 yu-gi-oh! 236 0.32%\n",
470
+ "81 comic 232 0.31%\n",
471
+ "82 sex 228 0.31%\n",
472
+ "83 cinderella girls 227 0.31%\n",
473
+ "84 kancolle 223 0.30%\n",
474
+ "85 huge breasts 220 0.30%\n",
475
+ "86 clothes 219 0.30%\n",
476
+ "87 digital art 217 0.29%\n",
477
+ "88 oc 216 0.29%\n",
478
+ "89 scenery 215 0.29%\n",
479
+ "90 nintendo 215 0.29%\n",
480
+ "91 manhwa 214 0.29%\n",
481
+ "92 final fantasy 211 0.28%\n",
482
+ "93 nikke 211 0.28%\n",
483
+ "94 cosplay 208 0.28%\n",
484
+ "95 beautiful 207 0.28%\n",
485
+ "96 dragon ball z 207 0.28%\n",
486
+ "97 concepts 207 0.28%\n",
487
+ "98 videogame character 207 0.28%\n",
488
+ "99 thick thighs 205 0.28%\n",
489
+ "100 wide hips 202 0.27%\n",
490
+ "\n",
491
+ "================================================================================\n",
492
+ "\n"
493
+ ]
494
+ }
495
+ ],
496
+ "source": [
497
+ "from pathlib import Path\n",
498
+ "import json\n",
499
+ "from collections import defaultdict\n",
500
+ "\n",
501
+ "current_dir = Path.cwd()\n",
502
+ "\n",
503
+ "def load_data(filepath):\n",
504
+ " \"\"\"Load the JSON data from file.\"\"\"\n",
505
+ " with open(filepath, 'r', encoding='utf-8') as f:\n",
506
+ " return json.load(f)\n",
507
+ "\n",
508
+ "def get_top_cooccurrences(data, target_tag, top_n=10):\n",
509
+ " \"\"\"\n",
510
+ " Find the top N tags that co-occur with the target tag.\n",
511
+ " \n",
512
+ " Args:\n",
513
+ " data: Dictionary with 'nodes' and 'links'\n",
514
+ " target_tag: The main tag to analyze (e.g., \"woman\")\n",
515
+ " top_n: Number of top co-occurring tags to return (default: 10)\n",
516
+ " \n",
517
+ " Returns:\n",
518
+ " List of tuples (tag, count, percentage) sorted by count\n",
519
+ " \"\"\"\n",
520
+ " # Find the target tag's total occurrences\n",
521
+ " target_size = None\n",
522
+ " for node in data['nodes']:\n",
523
+ " if node['id'] == target_tag:\n",
524
+ " target_size = node['size']\n",
525
+ " break\n",
526
+ " \n",
527
+ " if target_size is None:\n",
528
+ " print(f\"Error: Tag '{target_tag}' not found in nodes!\")\n",
529
+ " return None, None\n",
530
+ " \n",
531
+ " # Find all co-occurrences in links\n",
532
+ " cooccurrences = []\n",
533
+ " \n",
534
+ " for link in data['links']:\n",
535
+ " if link['source'] == target_tag:\n",
536
+ " cooccurrences.append({\n",
537
+ " 'tag': link['target'],\n",
538
+ " 'count': link['value']\n",
539
+ " })\n",
540
+ " elif link['target'] == target_tag:\n",
541
+ " cooccurrences.append({\n",
542
+ " 'tag': link['source'],\n",
543
+ " 'count': link['value']\n",
544
+ " })\n",
545
+ " \n",
546
+ " # Sort by count (descending) and take top N\n",
547
+ " cooccurrences.sort(key=lambda x: x['count'], reverse=True)\n",
548
+ " top_cooccurrences = cooccurrences[:top_n]\n",
549
+ " \n",
550
+ " # Calculate percentages\n",
551
+ " results = []\n",
552
+ " for item in top_cooccurrences:\n",
553
+ " percentage = (item['count'] / target_size) * 100\n",
554
+ " results.append((item['tag'], item['count'], percentage))\n",
555
+ " \n",
556
+ " return results, target_size\n",
557
+ "\n",
558
+ "def display_results(target_tag, results, target_size, top_n):\n",
559
+ " \"\"\"Display the results in a formatted table.\"\"\"\n",
560
+ " if results is None:\n",
561
+ " return\n",
562
+ " \n",
563
+ " print(f\"\\n{'='*80}\")\n",
564
+ " print(f\"Top {top_n} Co-occurring Tags for: '{target_tag}'\")\n",
565
+ " print(f\"{'='*80}\")\n",
566
+ " print(f\"Total occurrences of '{target_tag}': {target_size:,}\\n\")\n",
567
+ " \n",
568
+ " if not results:\n",
569
+ " print(f\"No co-occurrences found for '{target_tag}'\")\n",
570
+ " return\n",
571
+ " \n",
572
+ " # Print header\n",
573
+ " print(f\"{'Rank':<6} {'Tag':<30} {'Count':<12} {'Percentage':<12}\")\n",
574
+ " print(f\"{'-'*6} {'-'*30} {'-'*12} {'-'*12}\")\n",
575
+ " \n",
576
+ " # Print results\n",
577
+ " for i, (tag, count, percentage) in enumerate(results, 1):\n",
578
+ " print(f\"{i:<6} {tag:<30} {count:<12,} {percentage:>10.2f}%\")\n",
579
+ " \n",
580
+ " print(f\"\\n{'='*80}\\n\")\n",
581
+ "\n",
582
+ "def main():\n",
583
+ " # Load the data\n",
584
+ " filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
585
+ " print(\"Loading data...\")\n",
586
+ " data = load_data(filepath)\n",
587
+ " print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\")\n",
588
+ " \n",
589
+ " # Analyze different tags\n",
590
+ " target_tags = [\"anime\"] # Add more tags here to analyze multiple\n",
591
+ " top_n = 100\n",
592
+ " \n",
593
+ " for target_tag in target_tags:\n",
594
+ " results, target_size = get_top_cooccurrences(data, target_tag, top_n)\n",
595
+ " display_results(target_tag, results, target_size, top_n)\n",
596
+ "\n",
597
+ "if __name__ == \"__main__\":\n",
598
+ " main()"
599
+ ]
600
+ },
601
+ {
602
+ "cell_type": "code",
603
+ "execution_count": 25,
604
+ "id": "e8af35af-a4b9-4011-b8c3-5ec8e75ce6c1",
605
+ "metadata": {
606
+ "execution": {
607
+ "iopub.execute_input": "2025-12-09T19:47:44.592679Z",
608
+ "iopub.status.busy": "2025-12-09T19:47:44.592464Z",
609
+ "iopub.status.idle": "2025-12-09T19:47:44.649077Z",
610
+ "shell.execute_reply": "2025-12-09T19:47:44.648523Z",
611
+ "shell.execute_reply.started": "2025-12-09T19:47:44.592664Z"
612
+ }
613
+ },
614
+ "outputs": [
615
+ {
616
+ "name": "stdout",
617
+ "output_type": "stream",
618
+ "text": [
619
+ "Loading data...\n",
620
+ "Loaded 60,330 nodes and 16,921 links\n",
621
+ "\n",
622
+ "\n",
623
+ "================================================================================\n",
624
+ "Co-occurrence Analysis: 'anime' + 'dragon ball'\n",
625
+ "================================================================================\n",
626
+ "\n",
627
+ "Total occurrences of 'anime': 74,187\n",
628
+ "Total occurrences of 'dragon ball': 479\n",
629
+ "\n",
630
+ "Items with BOTH tags: 388\n",
631
+ "\n",
632
+ ">>> 0.52% of 'anime' occurrences also have 'dragon ball'\n",
633
+ "\n",
634
+ "================================================================================\n",
635
+ "\n"
636
+ ]
637
+ }
638
+ ],
639
+ "source": [
640
+ "from pathlib import Path\n",
641
+ "import json\n",
642
+ "from collections import defaultdict\n",
643
+ "\n",
644
+ "current_dir = Path.cwd()\n",
645
+ "\n",
646
+ "def load_data(filepath):\n",
647
+ " \"\"\"Load the JSON data from file.\"\"\"\n",
648
+ " with open(filepath, 'r', encoding='utf-8') as f:\n",
649
+ " return json.load(f)\n",
650
+ "\n",
651
+ "def get_tag_cooccurrence(data, tag1, tag2):\n",
652
+ " \"\"\"\n",
653
+ " Find what percentage of tag1 occurrences also have tag2.\n",
654
+ " \n",
655
+ " Args:\n",
656
+ " data: Dictionary with 'nodes' and 'links'\n",
657
+ " tag1: Primary tag to analyze (e.g., \"cat\")\n",
658
+ " tag2: Secondary tag to check for (e.g., \"dog\")\n",
659
+ " \n",
660
+ " Returns:\n",
661
+ " Dictionary with co-occurrence information\n",
662
+ " \"\"\"\n",
663
+ " # Find the tags' total occurrences\n",
664
+ " tag1_size = None\n",
665
+ " tag2_size = None\n",
666
+ " \n",
667
+ " for node in data['nodes']:\n",
668
+ " if node['id'] == tag1:\n",
669
+ " tag1_size = node['size']\n",
670
+ " if node['id'] == tag2:\n",
671
+ " tag2_size = node['size']\n",
672
+ " \n",
673
+ " if tag1_size is None:\n",
674
+ " print(f\"Error: Tag '{tag1}' not found in nodes!\")\n",
675
+ " return None\n",
676
+ " \n",
677
+ " if tag2_size is None:\n",
678
+ " print(f\"Error: Tag '{tag2}' not found in nodes!\")\n",
679
+ " return None\n",
680
+ " \n",
681
+ " # Find co-occurrence count in links\n",
682
+ " # This represents how many items have BOTH tag1 AND tag2\n",
683
+ " cooccurrence_count = 0\n",
684
+ " \n",
685
+ " for link in data['links']:\n",
686
+ " if (link['source'] == tag1 and link['target'] == tag2) or \\\n",
687
+ " (link['source'] == tag2 and link['target'] == tag1):\n",
688
+ " cooccurrence_count = link['value']\n",
689
+ " break\n",
690
+ " \n",
691
+ " # Calculate percentage: what % of tag1 items also have tag2\n",
692
+ " percentage_with_tag2 = (cooccurrence_count / tag1_size) * 100 if tag1_size > 0 else 0\n",
693
+ " \n",
694
+ " return {\n",
695
+ " 'primary_tag': tag1,\n",
696
+ " 'secondary_tag': tag2,\n",
697
+ " 'primary_tag_total': tag1_size,\n",
698
+ " 'secondary_tag_total': tag2_size,\n",
699
+ " 'cooccurrence_count': cooccurrence_count,\n",
700
+ " 'percentage_with_secondary': percentage_with_tag2\n",
701
+ " }\n",
702
+ "\n",
703
+ "def display_cooccurrence_results(result):\n",
704
+ " \"\"\"Display the co-occurrence results in a formatted way.\"\"\"\n",
705
+ " if result is None:\n",
706
+ " return\n",
707
+ " \n",
708
+ " print(f\"\\n{'='*80}\")\n",
709
+ " print(f\"Co-occurrence Analysis: '{result['primary_tag']}' + '{result['secondary_tag']}'\")\n",
710
+ " print(f\"{'='*80}\\n\")\n",
711
+ " \n",
712
+ " print(f\"Total occurrences of '{result['primary_tag']}': {result['primary_tag_total']:,}\")\n",
713
+ " print(f\"Total occurrences of '{result['secondary_tag']}': {result['secondary_tag_total']:,}\")\n",
714
+ " print(f\"\\nItems with BOTH tags: {result['cooccurrence_count']:,}\")\n",
715
+ " print(f\"\\n>>> {result['percentage_with_secondary']:.2f}% of '{result['primary_tag']}' occurrences also have '{result['secondary_tag']}'\")\n",
716
+ " \n",
717
+ " print(f\"\\n{'='*80}\\n\")\n",
718
+ "\n",
719
+ "def analyze_multiple_pairs(data, tag_pairs):\n",
720
+ " \"\"\"\n",
721
+ " Analyze multiple tag pairs at once.\n",
722
+ " \n",
723
+ " Args:\n",
724
+ " data: Dictionary with 'nodes' and 'links'\n",
725
+ " tag_pairs: List of tuples, each containing two tags to compare\n",
726
+ " \"\"\"\n",
727
+ " results = []\n",
728
+ " \n",
729
+ " for tag1, tag2 in tag_pairs:\n",
730
+ " result = get_tag_cooccurrence(data, tag1, tag2)\n",
731
+ " if result:\n",
732
+ " results.append(result)\n",
733
+ " display_cooccurrence_results(result)\n",
734
+ " \n",
735
+ " return results\n",
736
+ "\n",
737
+ "def main():\n",
738
+ " # Load the data\n",
739
+ " filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
740
+ " print(\"Loading data...\")\n",
741
+ " data = load_data(filepath)\n",
742
+ " print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\\n\")\n",
743
+ " \n",
744
+ " # Analyze: What percentage of \"cat\" occurrences also have \"dog\"?\n",
745
+ " primary_tag = \"anime\" # The main tag you're interested in\n",
746
+ " secondary_tag = \"dragon ball\" # The tag you want to check for\n",
747
+ " \n",
748
+ " result = get_tag_cooccurrence(data, primary_tag, secondary_tag)\n",
749
+ " display_cooccurrence_results(result)\n",
750
+ " \n",
751
+ " # You can also check the reverse: What percentage of \"dog\" occurrences also have \"cat\"?\n",
752
+ " # result_reverse = get_tag_cooccurrence(data, \"dog\", \"cat\")\n",
753
+ " # display_cooccurrence_results(result_reverse)\n",
754
+ " \n",
755
+ " # Option: Analyze multiple pairs at once\n",
756
+ " # Uncomment the lines below to analyze multiple pairs\n",
757
+ " \"\"\"\n",
758
+ " tag_pairs = [\n",
759
+ " (\"cat\", \"dog\"),\n",
760
+ " (\"boy\", \"anime\"),\n",
761
+ " (\"girl\", \"anime\"),\n",
762
+ " (\"man\", \"photorealistic\")\n",
763
+ " ]\n",
764
+ " results = analyze_multiple_pairs(data, tag_pairs)\n",
765
+ " \"\"\"\n",
766
+ "\n",
767
+ "if __name__ == \"__main__\":\n",
768
+ " main()"
769
+ ]
770
+ },
771
+ {
772
+ "cell_type": "code",
773
+ "execution_count": null,
774
+ "id": "a79ee96c-a060-4cc5-82d8-5642ccbef328",
775
+ "metadata": {},
776
+ "outputs": [],
777
+ "source": []
778
+ }
779
+ ],
780
+ "metadata": {
781
+ "kernelspec": {
782
+ "display_name": "Python 3 (ipykernel)",
783
+ "language": "python",
784
+ "name": "python3"
785
+ },
786
+ "language_info": {
787
+ "codemirror_mode": {
788
+ "name": "ipython",
789
+ "version": 3
790
+ },
791
+ "file_extension": ".py",
792
+ "mimetype": "text/x-python",
793
+ "name": "python",
794
+ "nbconvert_exporter": "python",
795
+ "pygments_lexer": "ipython3",
796
+ "version": "3.13.9"
797
+ }
798
+ },
799
+ "nbformat": 4,
800
+ "nbformat_minor": 5
801
+ }
jupyter_notebooks/Section_2-3-2_top_10_most_popular_checkpoints.ipynb ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "14f6b6e3-5edb-458a-9553-09616455ba9f",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Section 6.3: Download Models"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "64d0907d-780d-48a0-b2c4-a9de650a3760",
14
+ "metadata": {},
15
+ "source": [
16
+ "### Download models from CIVITAI to the respective folders to ~/ComfyUI/models\n"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": 1,
22
+ "id": "2f788cc1-d8ff-43b8-9051-a3d5d8d79086",
23
+ "metadata": {
24
+ "execution": {
25
+ "iopub.execute_input": "2025-02-06T18:25:06.875713Z",
26
+ "iopub.status.busy": "2025-02-06T18:25:06.875579Z",
27
+ "iopub.status.idle": "2025-02-06T18:25:07.380814Z",
28
+ "shell.execute_reply": "2025-02-06T18:25:07.380184Z",
29
+ "shell.execute_reply.started": "2025-02-06T18:25:06.875698Z"
30
+ }
31
+ },
32
+ "outputs": [],
33
+ "source": [
34
+ "import os\n",
35
+ "import re\n",
36
+ "import csv\n",
37
+ "import json\n",
38
+ "import time\n",
39
+ "import requests\n",
40
+ "from itertools import cycle\n",
41
+ "import pandas as pd\n",
42
+ "from requests.adapters import HTTPAdapter\n",
43
+ "from requests.packages.urllib3.util.retry import Retry\n",
44
+ "from pathlib import Path"
45
+ ]
46
+ },
47
+ {
48
+ "cell_type": "code",
49
+ "execution_count": 2,
50
+ "id": "7b6349d2-b839-4b8e-895e-cc7eb3fba88f",
51
+ "metadata": {
52
+ "execution": {
53
+ "iopub.execute_input": "2025-02-06T18:25:07.848340Z",
54
+ "iopub.status.busy": "2025-02-06T18:25:07.848183Z",
55
+ "iopub.status.idle": "2025-02-06T18:25:07.851420Z",
56
+ "shell.execute_reply": "2025-02-06T18:25:07.850952Z",
57
+ "shell.execute_reply.started": "2025-02-06T18:25:07.848325Z"
58
+ }
59
+ },
60
+ "outputs": [],
61
+ "source": [
62
+ "current_dir = Path.cwd()\n",
63
+ "api_karussell = current_dir.parent / 'misc/api_keys.txt'\n",
64
+ "target_dir = current_dir.parent / 'data/models/checkpoints/' \n",
65
+ "target_dir.parent.mkdir(parents=True, exist_ok=True)"
66
+ ]
67
+ },
68
+ {
69
+ "cell_type": "markdown",
70
+ "id": "3c2a9244-5070-4159-821f-39bb92051daf",
71
+ "metadata": {},
72
+ "source": [
73
+ "## function definition:"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": 3,
79
+ "id": "aeaaf1eb-9768-47ad-9298-7d7a8be59094",
80
+ "metadata": {
81
+ "execution": {
82
+ "iopub.execute_input": "2025-02-06T18:25:09.379460Z",
83
+ "iopub.status.busy": "2025-02-06T18:25:09.379303Z",
84
+ "iopub.status.idle": "2025-02-06T18:25:09.382034Z",
85
+ "shell.execute_reply": "2025-02-06T18:25:09.381557Z",
86
+ "shell.execute_reply.started": "2025-02-06T18:25:09.379446Z"
87
+ }
88
+ },
89
+ "outputs": [],
90
+ "source": [
91
+ "csv_file_path = current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv'"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "code",
96
+ "execution_count": null,
97
+ "id": "2cdc7051-af24-439c-b985-b5088bb13cc0",
98
+ "metadata": {
99
+ "execution": {
100
+ "iopub.execute_input": "2025-02-06T18:25:10.192171Z",
101
+ "iopub.status.busy": "2025-02-06T18:25:10.191997Z"
102
+ }
103
+ },
104
+ "outputs": [
105
+ {
106
+ "name": "stdout",
107
+ "output_type": "stream",
108
+ "text": [
109
+ "Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Pony/realDream_sdxlPony14.safetensors\n",
110
+ "Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Flux.1 D/acornIsSpinningFLUX_aisFluxV15.safetensors\n",
111
+ "Failed to download https://civitai.com/api/download/models/1376998: 401 Client Error: Unauthorized for url: https://civitai.com/api/download/models/1376998\n",
112
+ "Failed to download https://civitai.com/api/download/models/1379842: 401 Client Error: Unauthorized for url: https://civitai.com/api/download/models/1379842\n",
113
+ "Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Illustrious/miaomiaoHarem_v15a.safetensors\n"
114
+ ]
115
+ }
116
+ ],
117
+ "source": [
118
+ "with open(api_karussell, 'r') as f:\n",
119
+ " api_keys = [line.strip() for line in f if line.strip()]\n",
120
+ "api_key_cycle = cycle(api_keys) # Rotate API keys\n",
121
+ "\n",
122
+ "# Load CSV file\n",
123
+ "\n",
124
+ "df = pd.read_csv(csv_file_path)\n",
125
+ "\n",
126
+ "\n",
127
+ "\n",
128
+ "def get_filename_from_response(response, url):\n",
129
+ " if 'content-disposition' in response.headers:\n",
130
+ " content_disposition = response.headers['content-disposition']\n",
131
+ " filename = content_disposition.split(\"filename=\")[-1].strip(\"\\\"\")\n",
132
+ " else:\n",
133
+ " filename = url.split(\"/\")[-1]\n",
134
+ " return filename\n",
135
+ "\n",
136
+ "def download_file(url, dest_folder, api_key):\n",
137
+ " headers = {'Authorization': f'Bearer {api_key}'} # If authentication is needed\n",
138
+ " \n",
139
+ " try:\n",
140
+ " response = requests.get(url, headers=headers, stream=True)\n",
141
+ " response.raise_for_status()\n",
142
+ " \n",
143
+ " filename = get_filename_from_response(response, url)\n",
144
+ " dest_path = dest_folder / filename\n",
145
+ " \n",
146
+ " with open(dest_path, 'wb') as f:\n",
147
+ " for chunk in response.iter_content(chunk_size=8192):\n",
148
+ " f.write(chunk)\n",
149
+ " \n",
150
+ " print(f\"Downloaded: {dest_path}\")\n",
151
+ " except requests.exceptions.RequestException as e:\n",
152
+ " print(f\"Failed to download {url}: {e}\")\n",
153
+ "\n",
154
+ "def main():\n",
155
+ " for _, row in df.iterrows():\n",
156
+ " url = row['downloadUrl']\n",
157
+ " base_model = row['baseModel']\n",
158
+ " \n",
159
+ " if pd.isna(url) or pd.isna(base_model):\n",
160
+ " continue # Skip missing values\n",
161
+ " \n",
162
+ " # Create model-specific folder\n",
163
+ " model_folder = target_dir / base_model\n",
164
+ " model_folder.mkdir(parents=True, exist_ok=True)\n",
165
+ " \n",
166
+ " # Rotate API keys\n",
167
+ " api_key = next(api_key_cycle)\n",
168
+ " \n",
169
+ " # Download filfile\n",
170
+ " download_file(url, model_folder, api_key)\n",
171
+ " \n",
172
+ " # Sleep to avoid rate limits\n",
173
+ " time.sleep(2) # Adjust based on API limits\n",
174
+ "\n",
175
+ "\n",
176
+ "if __name__ == \"__main__\":\n",
177
+ " main()\n"
178
+ ]
179
+ },
180
+ {
181
+ "cell_type": "code",
182
+ "execution_count": null,
183
+ "id": "c99e2fee-eef2-4530-b69a-9a3db3bda84f",
184
+ "metadata": {},
185
+ "outputs": [],
186
+ "source": []
187
+ }
188
+ ],
189
+ "metadata": {
190
+ "kernelspec": {
191
+ "display_name": "Python 3 (ipykernel)",
192
+ "language": "python",
193
+ "name": "python3"
194
+ },
195
+ "language_info": {
196
+ "codemirror_mode": {
197
+ "name": "ipython",
198
+ "version": 3
199
+ },
200
+ "file_extension": ".py",
201
+ "mimetype": "text/x-python",
202
+ "name": "python",
203
+ "nbconvert_exporter": "python",
204
+ "pygments_lexer": "ipython3",
205
+ "version": "3.11.9"
206
+ }
207
+ },
208
+ "nbformat": 4,
209
+ "nbformat_minor": 5
210
+ }
jupyter_notebooks/Section_2-3-3_Figure_7_top_30_adapters.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/Section_2-3-4_Figure_8_Step_1_LLM_annotation.ipynb ADDED
@@ -0,0 +1,1941 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "23d0ae58",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Deepfake Adapter Dataset - LLM Annotation Pipeline"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "e4407358",
14
+ "metadata": {},
15
+ "source": [
16
+ "### Unified Model Loading & Inference\n",
17
+ "Code for querying Mistral, Gemma, and Qwen models."
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "id": "1a1b9d0e",
23
+ "metadata": {},
24
+ "source": [
25
+ "## CLEANING & PREPROCESSING"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "id": "3df42c46",
31
+ "metadata": {},
32
+ "source": [
33
+ "#### Named Entity Recognitition (NER) using SpaCy "
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "code",
38
+ "execution_count": null,
39
+ "id": "a287eef4",
40
+ "metadata": {},
41
+ "outputs": [],
42
+ "source": [
43
+ "import pandas as pd\n",
44
+ "import re\n",
45
+ "from pathlib import Path\n",
46
+ "import emoji\n",
47
+ "import spacy\n",
48
+ "\n",
49
+ "# Load spaCy model\n",
50
+ "# You may need to download it first: python -m spacy download en_core_web_sm\n",
51
+ "try:\n",
52
+ " nlp = spacy.load(\"en_core_web_sm\")\n",
53
+ " print(\"✅ spaCy model loaded: en_core_web_sm\")\n",
54
+ "except OSError:\n",
55
+ " print(\"❌ spaCy model not found. Downloading...\")\n",
56
+ " import subprocess\n",
57
+ " subprocess.run([\"python\", \"-m\", \"spacy\", \"download\", \"en_core_web_sm\"])\n",
58
+ " nlp = spacy.load(\"en_core_web_sm\")\n",
59
+ " print(\"✅ spaCy model downloaded and loaded\")\n",
60
+ "\n",
61
+ "# Set up paths\n",
62
+ "current_dir = Path.cwd()\n",
63
+ "#input_file = current_dir.parent / \"data/CSV/real_person_adapters.csv\"\n",
64
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter.csv\"\n",
65
+ "\n",
66
+ "# Load dataset\n",
67
+ "df = pd.read_csv(input_file)\n",
68
+ "print(f\"Loaded {len(df)} rows\")\n",
69
+ "\n",
70
+ "def translate_leetspeak(text: str) -> str:\n",
71
+ " \"\"\"\n",
72
+ " Translate common leetspeak patterns to normal letters.\n",
73
+ " Examples: 4kira -> Akira, 3mma -> Emma, 1rene -> Irene\n",
74
+ " \"\"\"\n",
75
+ " if not text:\n",
76
+ " return text\n",
77
+ " \n",
78
+ " # Common leetspeak mappings (order matters!)\n",
79
+ " leetspeak_map = {\n",
80
+ " '4': 'a',\n",
81
+ " '3': 'e', \n",
82
+ " '1': 'i',\n",
83
+ " '0': 'o',\n",
84
+ " '7': 't',\n",
85
+ " '5': 's',\n",
86
+ " '8': 'b',\n",
87
+ " '9': 'g',\n",
88
+ " '@': 'a',\n",
89
+ " '$': 's',\n",
90
+ " '!': 'i',\n",
91
+ " }\n",
92
+ " \n",
93
+ " result = text\n",
94
+ " # Apply mappings at word boundaries or start of string\n",
95
+ " for leet, normal in leetspeak_map.items():\n",
96
+ " # Replace at start of word\n",
97
+ " result = re.sub(rf'\\b{re.escape(leet)}', normal, result, flags=re.IGNORECASE)\n",
98
+ " # Replace standalone numbers that look like letters in context\n",
99
+ " result = re.sub(rf'(?<=[a-z]){re.escape(leet)}(?=[a-z])', normal, result, flags=re.IGNORECASE)\n",
100
+ " \n",
101
+ " return result\n",
102
+ "\n",
103
+ "def preprocess_for_ner(name: str) -> str:\n",
104
+ " \"\"\"\n",
105
+ " Preprocess the name before spaCy NER.\n",
106
+ " Remove noise but keep the actual name parts.\n",
107
+ " \"\"\"\n",
108
+ " if pd.isna(name):\n",
109
+ " return \"\"\n",
110
+ " \n",
111
+ " name = str(name)\n",
112
+ " \n",
113
+ " # FIRST: Translate leetspeak\n",
114
+ " name = translate_leetspeak(name)\n",
115
+ " \n",
116
+ " # Remove emoji\n",
117
+ " name = emoji.replace_emoji(name, replace=' ')\n",
118
+ " \n",
119
+ " # Remove version indicators (v1, v2, v1.0, etc.)\n",
120
+ " name = re.sub(r'\\s*[vV]\\d+(\\.\\d+)?\\s*', ' ', name)\n",
121
+ " \n",
122
+ " # Remove LoRA-related terms (case insensitive)\n",
123
+ " lora_terms = ['lora', 'loha', 'lycoris', 'controlnet', 'textual inversion', \n",
124
+ " 'embedding', 'ti', 'checkpoint', 'model', 'adapter', 'pony', 'sdxl', 'flux', 'illustrious', 'sd14', 'sd14', 'sd2', 'sd3', 'diffusion', 'stable', 'hunyuan']\n",
125
+ " for term in lora_terms:\n",
126
+ " name = re.sub(rf'\\b{term}\\b', '', name, flags=re.IGNORECASE)\n",
127
+ " \n",
128
+ " # Remove content in parentheses or brackets (often metadata)\n",
129
+ " name = re.sub(r'\\([^)]*\\)', '', name)\n",
130
+ " name = re.sub(r'\\[[^\\]]*\\]', '', name)\n",
131
+ " \n",
132
+ " # Remove special characters like 「」\n",
133
+ " name = re.sub(r'[「」『』【】〈〉《》]', '', name)\n",
134
+ " \n",
135
+ " # Handle pipe - keep first part\n",
136
+ " if '|' in name:\n",
137
+ " name = name.split('|')[0]\n",
138
+ " \n",
139
+ " # Handle forward slash - keep first part\n",
140
+ " if '/' in name:\n",
141
+ " name = name.split('/')[0]\n",
142
+ " \n",
143
+ " # Replace underscores with spaces\n",
144
+ " name = name.replace('_', ' ')\n",
145
+ " \n",
146
+ " # Remove multiple spaces\n",
147
+ " name = re.sub(r'\\s+', ' ', name)\n",
148
+ " \n",
149
+ " # Strip\n",
150
+ " name = name.strip()\n",
151
+ " \n",
152
+ " return name\n",
153
+ "\n",
154
+ "def extract_person_name(text: str) -> str:\n",
155
+ " \"\"\"\n",
156
+ " Use spaCy NER to extract person names from text.\n",
157
+ " Falls back to cleaned text if no PERSON entity found.\n",
158
+ " \"\"\"\n",
159
+ " if not text:\n",
160
+ " return \"\"\n",
161
+ " \n",
162
+ " # Run spaCy NER\n",
163
+ " doc = nlp(text)\n",
164
+ " \n",
165
+ " # Extract PERSON entities\n",
166
+ " person_entities = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
167
+ " \n",
168
+ " if person_entities:\n",
169
+ " # Return the first (usually longest/best) person name\n",
170
+ " return person_entities[0].strip()\n",
171
+ " \n",
172
+ " # If no PERSON entity found, try to extract capitalized words (likely names)\n",
173
+ " # This helps with names spaCy might miss\n",
174
+ " words = text.split()\n",
175
+ " capitalized_words = [w for w in words if w and w[0].isupper() and len(w) > 1]\n",
176
+ " \n",
177
+ " if capitalized_words:\n",
178
+ " # Join first few capitalized words (likely the name)\n",
179
+ " return ' '.join(capitalized_words[:3]).strip()\n",
180
+ " \n",
181
+ " # Last resort: return cleaned text\n",
182
+ " return text.strip()\n",
183
+ "\n",
184
+ "def clean_name_with_spacy(name: str) -> str:\n",
185
+ " \"\"\"\n",
186
+ " Complete name cleaning pipeline with spaCy NER.\n",
187
+ " \n",
188
+ " Pipeline:\n",
189
+ " 1. Translate leetspeak (4→a, 3→e, 1→i, etc.)\n",
190
+ " 2. Remove noise (emoji, version tags, LoRA terms)\n",
191
+ " 3. Use spaCy to extract PERSON entities\n",
192
+ " 4. Fallback to capitalized words or cleaned text\n",
193
+ " \"\"\"\n",
194
+ " # Step 1 & 2: Preprocess (leetspeak + noise removal)\n",
195
+ " preprocessed = preprocess_for_ner(name)\n",
196
+ " \n",
197
+ " if not preprocessed:\n",
198
+ " return \"\"\n",
199
+ " \n",
200
+ " # Step 3: Extract person name using spaCy NER\n",
201
+ " person_name = extract_person_name(preprocessed)\n",
202
+ " \n",
203
+ " return person_name\n",
204
+ "\n",
205
+ "# Apply name cleaning with spaCy\n",
206
+ "print(\"\\n🔄 Processing names with spaCy NER...\")\n",
207
+ "df['real_name'] = df['name'].apply(clean_name_with_spacy)\n",
208
+ "\n",
209
+ "# Show examples with detailed comparison\n",
210
+ "print(\"\\n📊 Name cleaning examples (with spaCy NER):\")\n",
211
+ "print(\"=\" * 100)\n",
212
+ "print(f\"{'Original Name':<50} | {'Cleaned Name':<30}\")\n",
213
+ "print(\"=\" * 100)\n",
214
+ "\n",
215
+ "examples = df[['name', 'real_name']].head(30)\n",
216
+ "shown = 0\n",
217
+ "for idx, row in examples.iterrows():\n",
218
+ " if row['name'] != row['real_name'] and shown < 20:\n",
219
+ " print(f\"{row['name']:<50} | {row['real_name']:<30}\")\n",
220
+ " shown += 1\n",
221
+ "\n",
222
+ "print(\"=\" * 100)\n",
223
+ "\n",
224
+ "# Show specific test cases\n",
225
+ "print(\"\\n🧪 Leetspeak translation examples:\")\n",
226
+ "test_names = ['4kira LoRA', '3mma Watson v2', '1rene LORA', 'L3vi Ackerman']\n",
227
+ "for test in test_names:\n",
228
+ " result = clean_name_with_spacy(test)\n",
229
+ " print(f\" {test:<30} -> {result}\")\n",
230
+ "\n",
231
+ "# Statistics\n",
232
+ "print(f\"\\n📈 Statistics:\")\n",
233
+ "print(f\" Total rows: {len(df)}\")\n",
234
+ "print(f\" Non-empty names: {(df['real_name'] != '').sum()}\")\n",
235
+ "print(f\" Empty names: {(df['real_name'] == '').sum()}\")\n",
236
+ "\n",
237
+ "# Show some examples of what spaCy identified\n",
238
+ "print(\"\\n🎯 Sample spaCy NER results:\")\n",
239
+ "sample_names = df['real_name'].head(20).tolist()\n",
240
+ "for i, name in enumerate(sample_names[:10], 1):\n",
241
+ " if name:\n",
242
+ " print(f\" {i}. {name}\")\n",
243
+ "\n",
244
+ "print(f\"\\n✅ Cleaned {len(df)} names using spaCy NER\")\n",
245
+ "\n",
246
+ "# Save intermediate result\n",
247
+ "output_step1 = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
248
+ "df.to_csv(output_step1, index=False)\n",
249
+ "print(f\"💾 Saved to {output_step1}\")\n"
250
+ ]
251
+ },
252
+ {
253
+ "cell_type": "markdown",
254
+ "id": "64687c72",
255
+ "metadata": {},
256
+ "source": [
257
+ "#### STEP 02: Nationality tag to Country hint\n",
258
+ "here tags related to nationality gets converted to the country equivalent."
259
+ ]
260
+ },
261
+ {
262
+ "cell_type": "code",
263
+ "execution_count": null,
264
+ "id": "d6eaef5b",
265
+ "metadata": {},
266
+ "outputs": [],
267
+ "source": [
268
+ "import pandas as pd\n",
269
+ "from pathlib import Path\n",
270
+ "\n",
271
+ "# Set up paths\n",
272
+ "current_dir = Path.cwd()\n",
273
+ "countries_file = current_dir.parent / \"misc/lists/countries.csv\"\n",
274
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
275
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
276
+ "output_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
277
+ "\n",
278
+ "# Load datasets\n",
279
+ "poi_df = pd.read_csv(input_file)\n",
280
+ "countries_df = pd.read_csv(countries_file)\n",
281
+ "professions_df = pd.read_csv(professions_file)\n",
282
+ "\n",
283
+ "# Define uninhabited or non-relevant territories to exclude\n",
284
+ "excluded_territories = {\n",
285
+ " 'isle of man', 'bouvet island', 'heard island and mcdonald islands',\n",
286
+ " 'french southern territories', 'south georgia and the south sandwich islands',\n",
287
+ " 'svalbard and jan mayen', 'british indian ocean territory', 'antarctica',\n",
288
+ " 'christmas island', 'cocos (keeling) islands', 'norfolk island',\n",
289
+ " 'pitcairn', 'tokelau', 'united states minor outlying islands',\n",
290
+ " 'wallis and futuna', 'western sahara'\n",
291
+ "}\n",
292
+ "\n",
293
+ "# Step 1: Combine tags into one lowercase list\n",
294
+ "def combine_tags(row):\n",
295
+ " return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
296
+ "\n",
297
+ "poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
298
+ "\n",
299
+ "# Step 2: Build tag → (country, nationality) mapping with PRIORITIES\n",
300
+ "tag_to_country_nationality = {}\n",
301
+ "# We'll use a priority score: direct country name = 3, nationality = 2, word parts = 1\n",
302
+ "\n",
303
+ "for _, row in countries_df.iterrows():\n",
304
+ " country = str(row[\"en_short_name\"]).strip()\n",
305
+ " nationality = str(row[\"nationality\"]).strip()\n",
306
+ " \n",
307
+ " # Skip excluded territories\n",
308
+ " if country.lower() in excluded_territories:\n",
309
+ " continue\n",
310
+ "\n",
311
+ " country_lc = country.lower()\n",
312
+ " nationality_lc = nationality.lower()\n",
313
+ "\n",
314
+ " # Store as (country, nationality, priority)\n",
315
+ " # Exact country name match = highest priority\n",
316
+ " if country_lc not in tag_to_country_nationality:\n",
317
+ " tag_to_country_nationality[country_lc] = (country, \"\", 3)\n",
318
+ " \n",
319
+ " # Exact nationality match = medium priority \n",
320
+ " if nationality_lc not in tag_to_country_nationality:\n",
321
+ " tag_to_country_nationality[nationality_lc] = (\"\", nationality, 2)\n",
322
+ " \n",
323
+ " # No-space versions\n",
324
+ " country_no_space = country_lc.replace(\" \", \"\")\n",
325
+ " nationality_no_space = nationality_lc.replace(\" \", \"\")\n",
326
+ " \n",
327
+ " if country_no_space not in tag_to_country_nationality:\n",
328
+ " tag_to_country_nationality[country_no_space] = (country, \"\", 3)\n",
329
+ " if nationality_no_space not in tag_to_country_nationality:\n",
330
+ " tag_to_country_nationality[nationality_no_space] = (\"\", nationality, 2)\n",
331
+ "\n",
332
+ " # Word parts = lowest priority (only for longer words to avoid false matches)\n",
333
+ " for part in country_lc.split():\n",
334
+ " if len(part) > 4: # Only words longer than 4 chars\n",
335
+ " if part not in tag_to_country_nationality:\n",
336
+ " tag_to_country_nationality[part] = (country, \"\", 1)\n",
337
+ " for part in nationality_lc.split():\n",
338
+ " if len(part) > 4:\n",
339
+ " if part not in tag_to_country_nationality:\n",
340
+ " tag_to_country_nationality[part] = (\"\", nationality, 1)\n",
341
+ "\n",
342
+ "print(f\"Built country/nationality mapping with {len(tag_to_country_nationality)} entries\")\n",
343
+ "\n",
344
+ "# Step 3: Infer likely_country and likely_nationality by checking ALL tags\n",
345
+ "def infer_country_and_nationality(tags):\n",
346
+ " \"\"\"\n",
347
+ " Check ALL tags and return the best match based on priority.\n",
348
+ " Priority: exact country name > nationality > word parts\n",
349
+ " \"\"\"\n",
350
+ " best_match = None\n",
351
+ " best_priority = 0\n",
352
+ " \n",
353
+ " for tag in tags:\n",
354
+ " # Try cleaned version (no spaces)\n",
355
+ " cleaned = tag.replace(\" \", \"\").lower()\n",
356
+ " \n",
357
+ " # Check cleaned version\n",
358
+ " if cleaned in tag_to_country_nationality:\n",
359
+ " country, nationality, priority = tag_to_country_nationality[cleaned]\n",
360
+ " if priority > best_priority and country and country.lower() not in excluded_territories:\n",
361
+ " best_match = (country, nationality)\n",
362
+ " best_priority = priority\n",
363
+ " \n",
364
+ " # Also check original tag\n",
365
+ " if tag in tag_to_country_nationality:\n",
366
+ " country, nationality, priority = tag_to_country_nationality[tag]\n",
367
+ " if priority > best_priority and country and country.lower() not in excluded_territories:\n",
368
+ " best_match = (country, nationality)\n",
369
+ " best_priority = priority\n",
370
+ " \n",
371
+ " if best_match:\n",
372
+ " return pd.Series(best_match)\n",
373
+ " return pd.Series([\"\", \"\"])\n",
374
+ "\n",
375
+ "poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
376
+ "\n",
377
+ "# Step 4: Build tag → profession mapping\n",
378
+ "profession_alias_map = {}\n",
379
+ "\n",
380
+ "for _, row in professions_df.iterrows():\n",
381
+ " canonical = str(row['profession']).strip().lower()\n",
382
+ " profession_alias_map[canonical] = canonical\n",
383
+ " for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
384
+ " alias = row.get(alias_col)\n",
385
+ " if pd.notna(alias):\n",
386
+ " profession_alias_map[str(alias).strip().lower()] = canonical\n",
387
+ "\n",
388
+ "# Step 5: Infer likely profession from tags\n",
389
+ "def infer_profession_from_tags(tags):\n",
390
+ " matched = []\n",
391
+ " for tag in tags:\n",
392
+ " cleaned = tag.strip().lower()\n",
393
+ " if cleaned in profession_alias_map:\n",
394
+ " matched.append(profession_alias_map[cleaned])\n",
395
+ "\n",
396
+ " if not matched:\n",
397
+ " return \"\"\n",
398
+ " if \"celebrity\" in matched and len(set(matched)) > 1:\n",
399
+ " # Drop 'celebrity' if other professions are present\n",
400
+ " matched = [m for m in matched if m != \"celebrity\"]\n",
401
+ "\n",
402
+ " return matched[0] # Return the first specific match\n",
403
+ "\n",
404
+ "\n",
405
+ "poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
406
+ "\n",
407
+ "# Step 6: Save enriched dataset\n",
408
+ "poi_df.to_csv(output_file, index=False)\n",
409
+ "\n",
410
+ "# Preview results\n",
411
+ "print(f\"\\nProcessed {len(poi_df)} rows\")\n",
412
+ "print(f\"Rows with country: {(poi_df['likely_country'] != '').sum()}\")\n",
413
+ "print(f\"Rows with nationality: {(poi_df['likely_nationality'] != '').sum()}\")\n",
414
+ "print(f\"Rows with profession: {(poi_df['likely_profession'] != '').sum()}\")\n",
415
+ "\n",
416
+ "print(f\"\\nTop 10 countries:\")\n",
417
+ "print(poi_df[poi_df['likely_country'] != '']['likely_country'].value_counts().head(10))\n"
418
+ ]
419
+ },
420
+ {
421
+ "cell_type": "markdown",
422
+ "id": "4a4a58b3",
423
+ "metadata": {},
424
+ "source": [
425
+ "## LLM ANNOTATION"
426
+ ]
427
+ },
428
+ {
429
+ "cell_type": "markdown",
430
+ "id": "b298844d",
431
+ "metadata": {},
432
+ "source": [
433
+ "#### Model Configurations"
434
+ ]
435
+ },
436
+ {
437
+ "cell_type": "code",
438
+ "execution_count": null,
439
+ "id": "39f3d65e",
440
+ "metadata": {},
441
+ "outputs": [],
442
+ "source": [
443
+ "import pandas as pd\n",
444
+ "import json\n",
445
+ "import time\n",
446
+ "import re\n",
447
+ "from pathlib import Path\n",
448
+ "from tqdm import tqdm\n",
449
+ "import torch\n",
450
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
451
+ "import signal\n",
452
+ "from contextlib import contextmanager\n",
453
+ "\n",
454
+ "# Configuration\n",
455
+ "current_dir = Path.cwd()\n",
456
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
457
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
458
+ "\n",
459
+ "# Model configurations\n",
460
+ "MODEL_CONFIGS = {\n",
461
+ " 'mistral': {\n",
462
+ " 'name': 'mistralai/Mistral-7B-Instruct-v0.3',\n",
463
+ " 'dtype': torch.bfloat16,\n",
464
+ " 'quantization': None,\n",
465
+ " 'generation_params': {\n",
466
+ " 'max_new_tokens': 512,\n",
467
+ " 'temperature': 0.05,\n",
468
+ " 'do_sample': True,\n",
469
+ " 'top_p': 1.0,\n",
470
+ " }\n",
471
+ " },\n",
472
+ " 'gemma': {\n",
473
+ " 'name': 'google/gemma-3-27b-it',\n",
474
+ " 'dtype': torch.bfloat16,\n",
475
+ " 'quantization': None,\n",
476
+ " 'generation_params': {\n",
477
+ " 'max_new_tokens': 512,\n",
478
+ " 'temperature': 0.1,\n",
479
+ " 'do_sample': True,\n",
480
+ " 'top_p': 1.0,\n",
481
+ " }\n",
482
+ " },\n",
483
+ " 'qwen': {\n",
484
+ " 'name': 'Qwen/Qwen2.5-32B-Instruct',\n",
485
+ " 'dtype': None, # Will use quantization\n",
486
+ " 'quantization': BitsAndBytesConfig(\n",
487
+ " load_in_8bit=True,\n",
488
+ " llm_int8_threshold=6.0,\n",
489
+ " llm_int8_has_fp16_weight=False\n",
490
+ " ),\n",
491
+ " 'generation_params': {\n",
492
+ " 'max_new_tokens': 512,\n",
493
+ " 'temperature': 0.1,\n",
494
+ " 'do_sample': False,\n",
495
+ " }\n",
496
+ " }\n",
497
+ "}\n",
498
+ "\n",
499
+ "PROFESSION_CATEGORIES = [\n",
500
+ " \"actor\",\n",
501
+ " \"adult performer\",\n",
502
+ " \"singer/musician\",\n",
503
+ " \"model\",\n",
504
+ " \"online personality\",\n",
505
+ " \"public figure\",\n",
506
+ " \"voice actor/ASMR\",\n",
507
+ " \"sports professional\",\n",
508
+ " \"tv personality\"\n",
509
+ "]\n"
510
+ ]
511
+ },
512
+ {
513
+ "cell_type": "markdown",
514
+ "id": "c215b38c",
515
+ "metadata": {},
516
+ "source": [
517
+ "#### Load Model Function"
518
+ ]
519
+ },
520
+ {
521
+ "cell_type": "code",
522
+ "execution_count": null,
523
+ "id": "cfb5b13e",
524
+ "metadata": {},
525
+ "outputs": [],
526
+ "source": [
527
+ "def load_model(model_type='mistral'):\n",
528
+ " \"\"\"\n",
529
+ " Load model and tokenizer based on type.\n",
530
+ " \n",
531
+ " Args:\n",
532
+ " model_type: 'mistral', 'gemma', or 'qwen'\n",
533
+ " \n",
534
+ " Returns:\n",
535
+ " tuple: (model, tokenizer, config)\n",
536
+ " \"\"\"\n",
537
+ " if model_type not in MODEL_CONFIGS:\n",
538
+ " raise ValueError(f\"Unknown model type: {model_type}. Choose from {list(MODEL_CONFIGS.keys())}\")\n",
539
+ " \n",
540
+ " config = MODEL_CONFIGS[model_type]\n",
541
+ " model_name = config['name']\n",
542
+ " \n",
543
+ " device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
544
+ " print(f\"Loading model: {model_name}\")\n",
545
+ " print(f\"Cache directory: {CACHE_DIR}\")\n",
546
+ " print(f\"Device: {device}\\n\")\n",
547
+ " \n",
548
+ " if device == \"cpu\":\n",
549
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
550
+ " \n",
551
+ " # Load tokenizer\n",
552
+ " try:\n",
553
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
554
+ " model_name,\n",
555
+ " cache_dir=str(CACHE_DIR),\n",
556
+ " use_fast=True\n",
557
+ " )\n",
558
+ " except:\n",
559
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
560
+ " model_name,\n",
561
+ " cache_dir=str(CACHE_DIR),\n",
562
+ " use_fast=False\n",
563
+ " )\n",
564
+ " \n",
565
+ " if tokenizer.pad_token is None:\n",
566
+ " tokenizer.pad_token = tokenizer.eos_token\n",
567
+ " \n",
568
+ " # Load model\n",
569
+ " model_kwargs = {\n",
570
+ " 'cache_dir': str(CACHE_DIR),\n",
571
+ " 'device_map': 'auto',\n",
572
+ " 'trust_remote_code': False\n",
573
+ " }\n",
574
+ " \n",
575
+ " if config['quantization']:\n",
576
+ " model_kwargs['quantization_config'] = config['quantization']\n",
577
+ " else:\n",
578
+ " model_kwargs['torch_dtype'] = config['dtype']\n",
579
+ " \n",
580
+ " model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n",
581
+ " model.eval()\n",
582
+ " \n",
583
+ " # Check VRAM\n",
584
+ " if torch.cuda.is_available():\n",
585
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
586
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
587
+ " \n",
588
+ " return model, tokenizer, config\n"
589
+ ]
590
+ },
591
+ {
592
+ "cell_type": "markdown",
593
+ "id": "11b2221a",
594
+ "metadata": {},
595
+ "source": [
596
+ "#### Inference Code"
597
+ ]
598
+ },
599
+ {
600
+ "cell_type": "code",
601
+ "execution_count": null,
602
+ "id": "229f96bd",
603
+ "metadata": {},
604
+ "outputs": [],
605
+ "source": [
606
+ "@contextmanager\n",
607
+ "def timeout(duration):\n",
608
+ " \"\"\"Context manager for timeout.\"\"\"\n",
609
+ " def handler(signum, frame):\n",
610
+ " raise TimeoutError(\"Operation timed out\")\n",
611
+ " \n",
612
+ " signal.signal(signal.SIGALRM, handler)\n",
613
+ " signal.alarm(duration)\n",
614
+ " try:\n",
615
+ " yield\n",
616
+ " finally:\n",
617
+ " signal.alarm(0)\n",
618
+ "\n",
619
+ "def query_model(prompt, model, tokenizer, config, use_timeout=False):\n",
620
+ " \"\"\"\n",
621
+ " Query model with given prompt.\n",
622
+ " \n",
623
+ " Args:\n",
624
+ " prompt: Input prompt string\n",
625
+ " model: Loaded model\n",
626
+ " tokenizer: Loaded tokenizer\n",
627
+ " config: Model configuration dict\n",
628
+ " use_timeout: Whether to use 60s timeout (for Qwen)\n",
629
+ " \n",
630
+ " Returns:\n",
631
+ " str: Model response or None on error\n",
632
+ " \"\"\"\n",
633
+ " try:\n",
634
+ " device = next(model.parameters()).device\n",
635
+ " \n",
636
+ " # Format as chat message\n",
637
+ " messages = [\n",
638
+ " {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
639
+ " {\"role\": \"user\", \"content\": prompt}\n",
640
+ " ]\n",
641
+ " \n",
642
+ " # Tokenize\n",
643
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
644
+ " text = tokenizer.apply_chat_template(\n",
645
+ " messages,\n",
646
+ " tokenize=False,\n",
647
+ " add_generation_prompt=True\n",
648
+ " )\n",
649
+ " else:\n",
650
+ " text = f\"[INST] {prompt} [/INST]\"\n",
651
+ " \n",
652
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
653
+ " \n",
654
+ " # Generation parameters\n",
655
+ " gen_kwargs = config['generation_params'].copy()\n",
656
+ " gen_kwargs['pad_token_id'] = tokenizer.eos_token_id\n",
657
+ " \n",
658
+ " # Generate\n",
659
+ " generation_fn = lambda: model.generate(**inputs, **gen_kwargs)\n",
660
+ " \n",
661
+ " if use_timeout:\n",
662
+ " with timeout(60):\n",
663
+ " with torch.no_grad():\n",
664
+ " outputs = generation_fn()\n",
665
+ " else:\n",
666
+ " with torch.no_grad():\n",
667
+ " outputs = generation_fn()\n",
668
+ " \n",
669
+ " # Decode\n",
670
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
671
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
672
+ " \n",
673
+ " return response.strip()\n",
674
+ " \n",
675
+ " except TimeoutError:\n",
676
+ " print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
677
+ " return None\n",
678
+ " except Exception as e:\n",
679
+ " print(f\"[ERROR] Generation failed: {e}\")\n",
680
+ " return None\n"
681
+ ]
682
+ },
683
+ {
684
+ "cell_type": "markdown",
685
+ "id": "88f005f8",
686
+ "metadata": {},
687
+ "source": [
688
+ "#### Prompt creation"
689
+ ]
690
+ },
691
+ {
692
+ "cell_type": "code",
693
+ "execution_count": null,
694
+ "id": "dfe05463",
695
+ "metadata": {},
696
+ "outputs": [],
697
+ "source": [
698
+ "def create_prompt(row):\n",
699
+ " \"\"\"Create annotation prompt from row data.\"\"\"\n",
700
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
701
+ " \n",
702
+ " # Gather hints\n",
703
+ " hints = []\n",
704
+ " if pd.notna(row.get('likely_profession')):\n",
705
+ " hints.append(str(row['likely_profession']))\n",
706
+ " if pd.notna(row.get('likely_nationality')):\n",
707
+ " hints.append(str(row['likely_nationality']))\n",
708
+ " if pd.notna(row.get('likely_country')):\n",
709
+ " hints.append(str(row['likely_country']))\n",
710
+ " \n",
711
+ " # Add tags if needed\n",
712
+ " if len(hints) < 3:\n",
713
+ " for i in range(1, 8):\n",
714
+ " tag_col = f'tag_{i}'\n",
715
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
716
+ " tag_val = str(row[tag_col])\n",
717
+ " if tag_val not in hints:\n",
718
+ " hints.append(tag_val)\n",
719
+ " if len(hints) >= 5:\n",
720
+ " break\n",
721
+ " \n",
722
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
723
+ " \n",
724
+ " return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
725
+ "\n",
726
+ "Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
727
+ "\n",
728
+ "FORMAT REQUIREMENTS:\n",
729
+ "1. Full legal name in Western order (first last). VALUE ONLY.\n",
730
+ "2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
731
+ "3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
732
+ "4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
733
+ "5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
734
+ "\n",
735
+ "RULES:\n",
736
+ "- Professions MUST match the exact categories listed (actress = actor)\n",
737
+ "- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
738
+ "- \"public figure\" includes politicians, activists, journalists, authors\n",
739
+ "- Use \"Unknown\" when uncertain or for fictional characters\n",
740
+ "- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
741
+ "- For multi-role people, list up to 3 categories by relevance\n",
742
+ "\n",
743
+ "EXAMPLE FORMAT:\n",
744
+ "1. Taylor Swift\n",
745
+ "2. None\n",
746
+ "3. Female\n",
747
+ "4. singer/musician, public figure\n",
748
+ "5. United States\"\"\"\n"
749
+ ]
750
+ },
751
+ {
752
+ "cell_type": "markdown",
753
+ "id": "854fa668",
754
+ "metadata": {},
755
+ "source": [
756
+ "#### Response parsing code"
757
+ ]
758
+ },
759
+ {
760
+ "cell_type": "code",
761
+ "execution_count": null,
762
+ "id": "1a4be2ee",
763
+ "metadata": {},
764
+ "outputs": [],
765
+ "source": [
766
+ "def parse_response(response):\n",
767
+ " \"\"\"Parse model response into structured fields.\"\"\"\n",
768
+ " if not response:\n",
769
+ " return {\n",
770
+ " 'full_name': 'Unknown',\n",
771
+ " 'aliases': 'Unknown',\n",
772
+ " 'gender': 'Unknown',\n",
773
+ " 'profession_llm': 'Unknown',\n",
774
+ " 'country': 'Unknown'\n",
775
+ " }\n",
776
+ " \n",
777
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
778
+ " \n",
779
+ " fields = {\n",
780
+ " 'full_name': 'Unknown',\n",
781
+ " 'aliases': 'Unknown',\n",
782
+ " 'gender': 'Unknown',\n",
783
+ " 'profession_llm': 'Unknown',\n",
784
+ " 'country': 'Unknown'\n",
785
+ " }\n",
786
+ " \n",
787
+ " for line in lines:\n",
788
+ " if line.startswith('1.'):\n",
789
+ " fields['full_name'] = line[2:].strip()\n",
790
+ " elif line.startswith('2.'):\n",
791
+ " fields['aliases'] = line[2:].strip()\n",
792
+ " elif line.startswith('3.'):\n",
793
+ " gender_raw = line[2:].strip()\n",
794
+ " gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
795
+ " gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
796
+ " fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
797
+ " elif line.startswith('4.'):\n",
798
+ " fields['profession_llm'] = line[2:].strip()\n",
799
+ " elif line.startswith('5.'):\n",
800
+ " country_raw = line[2:].strip()\n",
801
+ " country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
802
+ " fields['country'] = country_raw\n",
803
+ " \n",
804
+ " return fields\n"
805
+ ]
806
+ },
807
+ {
808
+ "cell_type": "markdown",
809
+ "id": "7e2f7a86",
810
+ "metadata": {},
811
+ "source": [
812
+ "#### CSV annotation"
813
+ ]
814
+ },
815
+ {
816
+ "cell_type": "code",
817
+ "execution_count": null,
818
+ "id": "5f3dd5d6",
819
+ "metadata": {},
820
+ "outputs": [],
821
+ "source": [
822
+ "def annotate_dataset(model_type='mistral', test_mode=False, test_size=100, max_rows=50862, save_interval=10):\n",
823
+ " \"\"\"\n",
824
+ " Annotate dataset using specified model.\n",
825
+ " \n",
826
+ " Args:\n",
827
+ " model_type: 'mistral', 'gemma', or 'qwen'\n",
828
+ " test_mode: If True, only process test_size rows\n",
829
+ " test_size: Number of rows to process in test mode\n",
830
+ " max_rows: Maximum rows to process\n",
831
+ " save_interval: Save progress every N rows\n",
832
+ " \"\"\"\n",
833
+ " # Setup paths\n",
834
+ " input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
835
+ " output_file = current_dir.parent / f\"data/CSV/{model_type}_local_annotated_POI{'_test' if test_mode else ''}.csv\"\n",
836
+ " index_file = current_dir.parent / f\"misc/query_indicies/{model_type}_local_query_index.txt\"\n",
837
+ " index_file.parent.mkdir(parents=True, exist_ok=True)\n",
838
+ " \n",
839
+ " # Load model\n",
840
+ " model, tokenizer, config = load_model(model_type)\n",
841
+ " \n",
842
+ " # Load data\n",
843
+ " print(f\"Loaded {len(df)} rows from input file\")\n",
844
+ " df = pd.read_csv(input_file)\n",
845
+ " \n",
846
+ " # Merge existing annotations if available\n",
847
+ " if output_file.exists():\n",
848
+ " existing_df = pd.read_csv(output_file)\n",
849
+ " annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
850
+ " for col in annotation_cols:\n",
851
+ " if col in existing_df.columns:\n",
852
+ " df[col] = existing_df[col][:len(df)]\n",
853
+ " \n",
854
+ " # Apply limits\n",
855
+ " if test_mode:\n",
856
+ " df = df.head(test_size).copy()\n",
857
+ " elif max_rows:\n",
858
+ " df = df.head(max_rows).copy()\n",
859
+ " \n",
860
+ " # Create prompts\n",
861
+ " df['prompt'] = df.apply(create_prompt, axis=1)\n",
862
+ " \n",
863
+ " # Load progress index\n",
864
+ " current_index = 0\n",
865
+ " if index_file.exists():\n",
866
+ " try:\n",
867
+ " current_index = int(index_file.read_text().strip())\n",
868
+ " except:\n",
869
+ " current_index = 0\n",
870
+ " \n",
871
+ " print(f\"Resuming from index {current_index}\")\n",
872
+ " \n",
873
+ " # Process rows\n",
874
+ " use_timeout = (model_type == 'qwen')\n",
875
+ " \n",
876
+ " for i in tqdm(range(current_index, len(df)), desc=f\"{model_type.capitalize()} Annotation\"):\n",
877
+ " prompt = df.at[i, \"prompt\"]\n",
878
+ " \n",
879
+ " # Query with retries\n",
880
+ " response = None\n",
881
+ " for attempt in range(3):\n",
882
+ " response = query_model(prompt, model, tokenizer, config, use_timeout)\n",
883
+ " \n",
884
+ " if response and len(response.strip()) > 10:\n",
885
+ " break\n",
886
+ " \n",
887
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
888
+ " time.sleep(0.5)\n",
889
+ " \n",
890
+ " # Skip if invalid\n",
891
+ " if not response or len(response.strip()) <= 10:\n",
892
+ " print(f\"❌ Row {i}: failed after retries, skipping\")\n",
893
+ " continue\n",
894
+ " \n",
895
+ " # Parse and validate\n",
896
+ " parsed = parse_response(response)\n",
897
+ " \n",
898
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
899
+ " print(f\"❌ Row {i}: parsed as all Unknown, skipping\")\n",
900
+ " continue\n",
901
+ " \n",
902
+ " # Write fields\n",
903
+ " for key, value in parsed.items():\n",
904
+ " df.at[i, key] = value\n",
905
+ " \n",
906
+ " current_index = i + 1\n",
907
+ " \n",
908
+ " # GPU cleanup\n",
909
+ " if torch.cuda.is_available():\n",
910
+ " torch.cuda.empty_cache()\n",
911
+ " torch.cuda.synchronize()\n",
912
+ " \n",
913
+ " # Save progress\n",
914
+ " if (i + 1) % save_interval == 0 or (i + 1) == len(df):\n",
915
+ " df.to_csv(output_file, index=False)\n",
916
+ " index_file.write_text(str(current_index))\n",
917
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
918
+ " \n",
919
+ " # Final save\n",
920
+ " df.to_csv(output_file, index=False)\n",
921
+ " index_file.write_text(str(current_index))\n",
922
+ " print(f\"✓ Finished annotation with {model_type}\")\n"
923
+ ]
924
+ },
925
+ {
926
+ "cell_type": "markdown",
927
+ "id": "55da2f4c",
928
+ "metadata": {},
929
+ "source": [
930
+ "### Usage Examples\n",
931
+ "Run annotation with your chosen model."
932
+ ]
933
+ },
934
+ {
935
+ "cell_type": "code",
936
+ "execution_count": null,
937
+ "id": "351ea40c",
938
+ "metadata": {},
939
+ "outputs": [],
940
+ "source": [
941
+ "# Example 1: Annotate with Mistral (13.5 GB VRAM)\n",
942
+ "# annotate_dataset(model_type='mistral', test_mode=False)\n",
943
+ "\n",
944
+ "# Example 2: Annotate with Gemma (56.3 GB VRAM)\n",
945
+ "# annotate_dataset(model_type='gemma', test_mode=False)\n",
946
+ "\n",
947
+ "# Example 3: Annotate with Qwen (32.7 GB VRAM, 8-bit)\n",
948
+ "# annotate_dataset(model_type='qwen', test_mode=False)\n",
949
+ "\n",
950
+ "# Test mode (first 100 rows)\n",
951
+ "# annotate_dataset(model_type='mistral', test_mode=True, test_size=100)\n"
952
+ ]
953
+ },
954
+ {
955
+ "cell_type": "markdown",
956
+ "id": "6431d347-d80c-4e8b-83a7-531e5df95a72",
957
+ "metadata": {},
958
+ "source": [
959
+ "## EuroLLM-9B-Instruct"
960
+ ]
961
+ },
962
+ {
963
+ "cell_type": "code",
964
+ "execution_count": null,
965
+ "id": "e8203abc-e7c3-4cb6-aaeb-fdc6933981fc",
966
+ "metadata": {},
967
+ "outputs": [],
968
+ "source": [
969
+ "import pandas as pd\n",
970
+ "import json\n",
971
+ "import time\n",
972
+ "import re\n",
973
+ "from pathlib import Path\n",
974
+ "from tqdm import tqdm\n",
975
+ "import torch\n",
976
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
977
+ "import signal\n",
978
+ "from contextlib import contextmanager\n",
979
+ "\n",
980
+ "current_dir = Path.cwd()\n",
981
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
982
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
983
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
984
+ "# === PROCESS DATA ===\n",
985
+ "\n",
986
+ "\n",
987
+ "# === CONFIGURATION ===\n",
988
+ "TEST_MODE = False\n",
989
+ "TEST_SIZE = 100\n",
990
+ "MAX_ROWS = 50862\n",
991
+ "SAVE_INTERVAL = 10\n",
992
+ "\n",
993
+ "\n",
994
+ "index_file = current_dir.parent / \"misc/query_indicies/eurollm_local_query_index.txt\"\n",
995
+ "output_file = current_dir.parent / f\"data/CSV/eurollm_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
996
+ "\n",
997
+ "# Model settings\n",
998
+ "MODEL_NAME = \"utter-project/EuroLLM-9B-Instruct\"\n",
999
+ "#MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
1000
+ "#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
1001
+ "#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
1002
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
1003
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
1004
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
1005
+ "\n",
1006
+ "# Define the SPECIFIC profession categories\n",
1007
+ "PROFESSION_CATEGORIES = [\n",
1008
+ " \"actor\",\n",
1009
+ " \"adult performer\",\n",
1010
+ " \"singer/musician\",\n",
1011
+ " \"model\",\n",
1012
+ " \"online personality\",\n",
1013
+ " \"public figure\",\n",
1014
+ " \"voice actor/ASMR\",\n",
1015
+ " \"sports professional\",\n",
1016
+ " \"tv personality\"\n",
1017
+ "]\n",
1018
+ "\n",
1019
+ "# === LOAD MODEL ===\n",
1020
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
1021
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
1022
+ "print(f\"This may take a while on first run...\\n\")\n",
1023
+ "\n",
1024
+ "# Check GPU availability\n",
1025
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1026
+ "print(f\"Device: {device}\")\n",
1027
+ "\n",
1028
+ "if device == \"cpu\":\n",
1029
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
1030
+ " print(\" Consider using a GPU or reducing model size.\")\n",
1031
+ "\n",
1032
+ "# Get HF token from credentials file\n",
1033
+ "import os\n",
1034
+ "credentials_dir = current_dir.parent / \"misc/credentials\"\n",
1035
+ "hf_token_file = credentials_dir / \"hf_token.txt\"\n",
1036
+ "\n",
1037
+ "HF_TOKEN = None\n",
1038
+ "if hf_token_file.exists():\n",
1039
+ " HF_TOKEN = hf_token_file.read_text().strip()\n",
1040
+ " print(\"✅ HF token loaded from credentials file\")\n",
1041
+ "else:\n",
1042
+ " print(\"⚠️ HF token file not found at:\", hf_token_file)\n",
1043
+ " print(\" The script will try to use cached credentials from 'huggingface-cli login'\")\n",
1044
+ " print(\" Or create the file: misc/credentials/hf_token.txt with your token\")\n",
1045
+ " HF_TOKEN = None # Will use cached token if available\n",
1046
+ "\n",
1047
+ "# Load tokenizer\n",
1048
+ "print(\"Loading tokenizer...\")\n",
1049
+ "try:\n",
1050
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1051
+ " MODEL_NAME,\n",
1052
+ " cache_dir=str(CACHE_DIR),\n",
1053
+ " use_fast=True,\n",
1054
+ " token=HF_TOKEN\n",
1055
+ " )\n",
1056
+ "except Exception as e:\n",
1057
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
1058
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1059
+ " MODEL_NAME,\n",
1060
+ " cache_dir=str(CACHE_DIR),\n",
1061
+ " use_fast=False,\n",
1062
+ " token=HF_TOKEN\n",
1063
+ " )\n",
1064
+ "\n",
1065
+ "# Ensure pad token is set\n",
1066
+ "if tokenizer.pad_token is None:\n",
1067
+ " tokenizer.pad_token = tokenizer.eos_token\n",
1068
+ "\n",
1069
+ "print(\"✅ Tokenizer loaded\")\n",
1070
+ "\n",
1071
+ "# Configure 8-bit quantization for A100\n",
1072
+ "print(\"Configuring 8-bit quantization...\")\n",
1073
+ "quantization_config = BitsAndBytesConfig(\n",
1074
+ " load_in_8bit=True,\n",
1075
+ " llm_int8_threshold=6.0,\n",
1076
+ " llm_int8_has_fp16_weight=False\n",
1077
+ ")\n",
1078
+ "\n",
1079
+ "# Load model with 8-bit quantization\n",
1080
+ "print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
1081
+ "model = AutoModelForCausalLM.from_pretrained(\n",
1082
+ " MODEL_NAME,\n",
1083
+ " cache_dir=str(CACHE_DIR),\n",
1084
+ " quantization_config=quantization_config,\n",
1085
+ " device_map=\"auto\",\n",
1086
+ " trust_remote_code=False,\n",
1087
+ " token=HF_TOKEN\n",
1088
+ ")\n",
1089
+ "model.eval()\n",
1090
+ "print(\"✅ Model loaded with 8-bit quantization\")\n",
1091
+ "\n",
1092
+ "# Check VRAM usage\n",
1093
+ "if torch.cuda.is_available():\n",
1094
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
1095
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
1096
+ "\n",
1097
+ "# === LOAD DATA ===\n",
1098
+ "if output_file.exists():\n",
1099
+ " print(\"Loading annotated CSV...\")\n",
1100
+ " df = pd.read_csv(output_file)\n",
1101
+ "else:\n",
1102
+ " print(\"Loading raw input CSV...\")\n",
1103
+ " df = pd.read_csv(input_file)\n",
1104
+ "\n",
1105
+ "\n",
1106
+ "# Try to load profession mapping files\n",
1107
+ "try:\n",
1108
+ " professions_df = pd.read_csv(professions_file)\n",
1109
+ " print(f\"✅ Loaded professions.csv\")\n",
1110
+ "except:\n",
1111
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
1112
+ "\n",
1113
+ "try:\n",
1114
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
1115
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
1116
+ "except:\n",
1117
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
1118
+ "\n",
1119
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
1120
+ "\n",
1121
+ "print(f\"Loaded {len(df)} rows\")\n",
1122
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
1123
+ "for cat in PROFESSION_CATEGORIES:\n",
1124
+ " print(f\" - {cat}\")\n",
1125
+ "\n",
1126
+ "if TEST_MODE:\n",
1127
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
1128
+ " df = df.head(TEST_SIZE).copy()\n",
1129
+ "elif MAX_ROWS:\n",
1130
+ " df = df.head(MAX_ROWS).copy()\n",
1131
+ "\n",
1132
+ "# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
1133
+ "def create_prompt(row):\n",
1134
+ " \"\"\"Create prompt for EuroLLM annotation with strict formatting requirements.\"\"\"\n",
1135
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
1136
+ " \n",
1137
+ " # Gather hints\n",
1138
+ " hints = []\n",
1139
+ " if pd.notna(row.get('likely_profession')):\n",
1140
+ " hints.append(str(row['likely_profession']))\n",
1141
+ " if pd.notna(row.get('likely_nationality')):\n",
1142
+ " hints.append(str(row['likely_nationality']))\n",
1143
+ " if pd.notna(row.get('likely_country')):\n",
1144
+ " hints.append(str(row['likely_country']))\n",
1145
+ " \n",
1146
+ " # Add tags if we don't have enough hints\n",
1147
+ " if len(hints) < 3:\n",
1148
+ " for i in range(1, 8):\n",
1149
+ " tag_col = f'tag_{i}'\n",
1150
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
1151
+ " tag_val = str(row[tag_col])\n",
1152
+ " if tag_val not in hints:\n",
1153
+ " hints.append(tag_val)\n",
1154
+ " if len(hints) >= 5:\n",
1155
+ " break\n",
1156
+ " \n",
1157
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
1158
+ " \n",
1159
+ " return f\"\"\"Extract information about '{name}'. \n",
1160
+ "Context hints (DO NOT copy these as professions): {hint_text}\n",
1161
+ "\n",
1162
+ "Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
1163
+ "\n",
1164
+ "FORMAT REQUIREMENTS:\n",
1165
+ "1. Full legal name in Western order (first last). VALUE ONLY.\n",
1166
+ "2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
1167
+ "3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
1168
+ "4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
1169
+ "5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
1170
+ "\n",
1171
+ "CRITICAL RULES FOR PROFESSIONS (Line 4):\n",
1172
+ "- ONLY use the exact profession categories listed above\n",
1173
+ "- DO NOT use descriptive words like \"sexy\", \"photorealistic\", \"celebrity\"\n",
1174
+ "- DO NOT copy the hint words as professions\n",
1175
+ "- If uncertain about profession, write \"Unknown\"\n",
1176
+ "- Valid professions are ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality\n",
1177
+ "- Actress = actor, streamer = online personality, YouTuber = online personality\n",
1178
+ "\n",
1179
+ "OTHER RULES:\n",
1180
+ "- Use \"Unknown\" when uncertain or for fictional characters\n",
1181
+ "- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
1182
+ "- For multi-role people, list up to 3 categories by relevance\"\"\"\n",
1183
+ "\n",
1184
+ "# Create prompts\n",
1185
+ "print(\"\\nCreating prompts...\")\n",
1186
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
1187
+ "print(\"✅ Prompts created\")\n",
1188
+ "\n",
1189
+ "@contextmanager\n",
1190
+ "def timeout(duration):\n",
1191
+ " def handler(signum, frame):\n",
1192
+ " raise TimeoutError(\"Operation timed out\")\n",
1193
+ " \n",
1194
+ " # Set the signal handler and alarm\n",
1195
+ " signal.signal(signal.SIGALRM, handler)\n",
1196
+ " signal.alarm(duration)\n",
1197
+ " try:\n",
1198
+ " yield\n",
1199
+ " finally:\n",
1200
+ " signal.alarm(0) # Disable the alarm\n",
1201
+ "\n",
1202
+ "\n",
1203
+ "def query_eurollm_local(prompt: str) -> str:\n",
1204
+ " \"\"\"Query EuroLLM locally via transformers with very low temperature.\"\"\"\n",
1205
+ " try:\n",
1206
+ " # Format as chat message for EuroLLM with strict system prompt\n",
1207
+ " messages = [\n",
1208
+ " {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
1209
+ " {\"role\": \"user\", \"content\": prompt}\n",
1210
+ " ]\n",
1211
+ " \n",
1212
+ " # Tokenize\n",
1213
+ " if hasattr(tokenizer, 'apply_chat_template') and tokenizer.chat_template is not None:\n",
1214
+ " text = tokenizer.apply_chat_template(\n",
1215
+ " messages,\n",
1216
+ " tokenize=False,\n",
1217
+ " add_generation_prompt=True\n",
1218
+ " )\n",
1219
+ " else:\n",
1220
+ " # Fallback for models without chat template\n",
1221
+ " text = f\"[INST] {prompt} [/INST]\"\n",
1222
+ " \n",
1223
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
1224
+ " \n",
1225
+ " # Generate with timeout and very low temperature\n",
1226
+ " with timeout(60):\n",
1227
+ " with torch.no_grad():\n",
1228
+ " outputs = model.generate(\n",
1229
+ " **inputs,\n",
1230
+ " max_new_tokens=100,\n",
1231
+ " temperature=0.01, # Very low temperature for more deterministic outputs\n",
1232
+ " do_sample=True, # Must be True when temperature is set\n",
1233
+ " pad_token_id=tokenizer.eos_token_id\n",
1234
+ " )\n",
1235
+ " \n",
1236
+ " # Decode\n",
1237
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
1238
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
1239
+ " \n",
1240
+ " return response.strip()\n",
1241
+ " \n",
1242
+ " except TimeoutError:\n",
1243
+ " print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
1244
+ " return None\n",
1245
+ " except Exception as e:\n",
1246
+ " print(f\"Generation error: {e}\")\n",
1247
+ " import traceback\n",
1248
+ " traceback.print_exc()\n",
1249
+ " return None\n",
1250
+ "\n",
1251
+ " \n",
1252
+ "# === PARSE RESPONSE WITH CLEANING ===\n",
1253
+ "def parse_response(response):\n",
1254
+ " \"\"\"Parse EuroLLM response into structured fields with cleaning.\"\"\"\n",
1255
+ " if not response:\n",
1256
+ " return {\n",
1257
+ " 'full_name': 'Unknown',\n",
1258
+ " 'aliases': 'Unknown',\n",
1259
+ " 'gender': 'Unknown',\n",
1260
+ " 'profession_llm': 'Unknown',\n",
1261
+ " 'country': 'Unknown'\n",
1262
+ " }\n",
1263
+ " \n",
1264
+ " # Valid profession categories\n",
1265
+ " VALID_PROFESSIONS = {\n",
1266
+ " \"actor\", \"adult performer\", \"singer/musician\", \"model\", \n",
1267
+ " \"online personality\", \"public figure\", \"voice actor/asmr\", \n",
1268
+ " \"sports professional\", \"tv personality\"\n",
1269
+ " }\n",
1270
+ " \n",
1271
+ " # Split into lines and clean\n",
1272
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
1273
+ " \n",
1274
+ " # Initialize with Unknown values\n",
1275
+ " fields = {\n",
1276
+ " 'full_name': 'Unknown',\n",
1277
+ " 'aliases': 'Unknown',\n",
1278
+ " 'gender': 'Unknown',\n",
1279
+ " 'profession_llm': 'Unknown',\n",
1280
+ " 'country': 'Unknown'\n",
1281
+ " }\n",
1282
+ " \n",
1283
+ " # Extract information from each numbered line\n",
1284
+ " for line in lines:\n",
1285
+ " if line.startswith('1.'):\n",
1286
+ " fields['full_name'] = line[2:].strip()\n",
1287
+ " elif line.startswith('2.'):\n",
1288
+ " fields['aliases'] = line[2:].strip()\n",
1289
+ " elif line.startswith('3.'):\n",
1290
+ " # Clean gender field - remove any labels\n",
1291
+ " gender_raw = line[2:].strip()\n",
1292
+ " # Remove common prefixes\n",
1293
+ " gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
1294
+ " # Extract just the gender word\n",
1295
+ " gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
1296
+ " fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
1297
+ " elif line.startswith('4.'):\n",
1298
+ " # Clean and validate profession field\n",
1299
+ " profession_raw = line[2:].strip()\n",
1300
+ " \n",
1301
+ " # Split by comma and validate each profession\n",
1302
+ " professions = [p.strip().lower() for p in profession_raw.split(',')]\n",
1303
+ " valid_profs = []\n",
1304
+ " \n",
1305
+ " for prof in professions:\n",
1306
+ " # Check if it's a valid profession\n",
1307
+ " if prof in VALID_PROFESSIONS:\n",
1308
+ " valid_profs.append(prof)\n",
1309
+ " # Check for common invalid entries\n",
1310
+ " elif prof in ['unknown', '']:\n",
1311
+ " continue\n",
1312
+ " # Reject descriptive words that aren't professions\n",
1313
+ " elif prof in ['sexy', 'photorealistic', 'celebrity', 'famous', 'popular', \n",
1314
+ " 'beautiful', 'attractive', 'hot', 'gorgeous']:\n",
1315
+ " continue\n",
1316
+ " # If it looks like it might be close to a valid profession, keep it\n",
1317
+ " elif any(valid in prof for valid in VALID_PROFESSIONS):\n",
1318
+ " # Try to extract the valid part\n",
1319
+ " for valid in VALID_PROFESSIONS:\n",
1320
+ " if valid in prof:\n",
1321
+ " valid_profs.append(valid)\n",
1322
+ " break\n",
1323
+ " \n",
1324
+ " # Set the cleaned professions or Unknown if none are valid\n",
1325
+ " if valid_profs:\n",
1326
+ " fields['profession_llm'] = ', '.join(valid_profs)\n",
1327
+ " else:\n",
1328
+ " fields['profession_llm'] = 'Unknown'\n",
1329
+ " \n",
1330
+ " elif line.startswith('5.'):\n",
1331
+ " # Clean country field - remove any labels\n",
1332
+ " country_raw = line[2:].strip()\n",
1333
+ " # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
1334
+ " country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
1335
+ " fields['country'] = country_raw\n",
1336
+ " \n",
1337
+ " return fields\n",
1338
+ "\n",
1339
+ "# === PROCESS DATA ===\n",
1340
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
1341
+ "\n",
1342
+ "# Load index\n",
1343
+ "current_index = 0\n",
1344
+ "if index_file.exists():\n",
1345
+ " try:\n",
1346
+ " current_index = int(index_file.read_text().strip())\n",
1347
+ " except:\n",
1348
+ " current_index = 0\n",
1349
+ "\n",
1350
+ "print(f\"Resuming from index {current_index}\")\n",
1351
+ "\n",
1352
+ "start_time = time.time()\n",
1353
+ "\n",
1354
+ "for i in tqdm(range(current_index, len(df)), desc=\"EuroLLM Local\"):\n",
1355
+ "\n",
1356
+ " prompt = df.at[i, \"prompt\"]\n",
1357
+ "\n",
1358
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
1359
+ " response = None\n",
1360
+ " for attempt in range(3):\n",
1361
+ " response = query_eurollm_local(prompt)\n",
1362
+ " \n",
1363
+ " # DEBUG: Print first few responses to see what's happening\n",
1364
+ " if i < 5:\n",
1365
+ " print(f\"\\n=== DEBUG Row {i}, Attempt {attempt+1} ===\")\n",
1366
+ " print(f\"Response length: {len(response) if response else 0}\")\n",
1367
+ " print(f\"Response: {response[:500] if response else 'None'}\")\n",
1368
+ " print(\"=\" * 50)\n",
1369
+ " \n",
1370
+ " # Valid response?\n",
1371
+ " if response and len(response.strip()) > 10:\n",
1372
+ " break\n",
1373
+ " \n",
1374
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
1375
+ " time.sleep(0.5)\n",
1376
+ "\n",
1377
+ " # If still invalid → DO NOT overwrite previous data\n",
1378
+ " if not response or len(response.strip()) <= 10:\n",
1379
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
1380
+ " continue\n",
1381
+ "\n",
1382
+ " parsed = parse_response(response)\n",
1383
+ "\n",
1384
+ " # DEBUG: Print first few parsed results\n",
1385
+ " if i < 5:\n",
1386
+ " print(f\"\\n=== PARSED Row {i} ===\")\n",
1387
+ " for key, value in parsed.items():\n",
1388
+ " print(f\" {key}: {value}\")\n",
1389
+ " print(\"=\" * 50)\n",
1390
+ "\n",
1391
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
1392
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
1393
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
1394
+ " continue\n",
1395
+ "\n",
1396
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
1397
+ " for key, value in parsed.items():\n",
1398
+ " df.at[i, key] = value\n",
1399
+ "\n",
1400
+ " # Advance progress ONLY after successful write\n",
1401
+ " current_index = i + 1\n",
1402
+ "\n",
1403
+ " # -------- GPU MEMORY CLEANUP --------\n",
1404
+ " if torch.cuda.is_available():\n",
1405
+ " torch.cuda.empty_cache()\n",
1406
+ " torch.cuda.synchronize()\n",
1407
+ "\n",
1408
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
1409
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
1410
+ " df.to_csv(output_file, index=False)\n",
1411
+ " with open(index_file, \"w\") as f:\n",
1412
+ " f.write(str(current_index))\n",
1413
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
1414
+ "\n",
1415
+ "# Final save\n",
1416
+ "df.to_csv(output_file, index=False)\n",
1417
+ "index_file.write_text(str(current_index))\n",
1418
+ "print(\"✅ Finished full dataset.\")"
1419
+ ]
1420
+ },
1421
+ {
1422
+ "cell_type": "markdown",
1423
+ "id": "472e5ac2-ec04-4bfa-8a67-116277238c15",
1424
+ "metadata": {},
1425
+ "source": [
1426
+ "## Mistral 24b instruct"
1427
+ ]
1428
+ },
1429
+ {
1430
+ "cell_type": "code",
1431
+ "execution_count": null,
1432
+ "id": "a55a5e30-83f3-4f7c-a537-b1216d4e8a07",
1433
+ "metadata": {
1434
+ "execution": {
1435
+ "iopub.execute_input": "2025-12-09T22:16:21.002786Z",
1436
+ "iopub.status.busy": "2025-12-09T22:16:21.002337Z"
1437
+ }
1438
+ },
1439
+ "outputs": [
1440
+ {
1441
+ "name": "stderr",
1442
+ "output_type": "stream",
1443
+ "text": [
1444
+ "/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
1445
+ " from .autonotebook import tqdm as notebook_tqdm\n"
1446
+ ]
1447
+ },
1448
+ {
1449
+ "name": "stdout",
1450
+ "output_type": "stream",
1451
+ "text": [
1452
+ "Loading model: mistralai/Mistral-Small-Instruct-2409\n",
1453
+ "Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
1454
+ "This may take a while on first run (~65GB download)...\n",
1455
+ "\n",
1456
+ "Device: cuda\n",
1457
+ "Loading tokenizer...\n",
1458
+ "✅ Tokenizer loaded\n",
1459
+ "Loading model (this may take several minutes)...\n"
1460
+ ]
1461
+ },
1462
+ {
1463
+ "name": "stderr",
1464
+ "output_type": "stream",
1465
+ "text": [
1466
+ "`torch_dtype` is deprecated! Use `dtype` instead!\n",
1467
+ "Loading checkpoint shards: 100%|██████████| 9/9 [02:42<00:00, 18.06s/it]\n"
1468
+ ]
1469
+ },
1470
+ {
1471
+ "name": "stdout",
1472
+ "output_type": "stream",
1473
+ "text": [
1474
+ "✅ Model loaded\n",
1475
+ "VRAM used: 21.40 GB\n",
1476
+ "\n",
1477
+ "Loading raw input CSV...\n",
1478
+ "Loaded 50861 rows from input file\n",
1479
+ "Found existing annotations, merging...\n"
1480
+ ]
1481
+ },
1482
+ {
1483
+ "name": "stderr",
1484
+ "output_type": "stream",
1485
+ "text": [
1486
+ "/tmp/ipykernel_3104208/1997558719.py:113: DtypeWarning: Columns (52,53,54,55,56) have mixed types. Specify dtype option on import or set low_memory=False.\n",
1487
+ " existing_df = pd.read_csv(output_file)\n"
1488
+ ]
1489
+ },
1490
+ {
1491
+ "name": "stdout",
1492
+ "output_type": "stream",
1493
+ "text": [
1494
+ "Existing annotations has 50861 rows\n",
1495
+ "Merged annotations, continuing with 50861 total rows\n",
1496
+ "✅ Loaded professions.csv\n",
1497
+ "✅ Loaded profession mapping with 9 categories\n",
1498
+ "Loaded 50861 rows\n",
1499
+ "\n",
1500
+ "Profession categories (9):\n",
1501
+ " - actor\n",
1502
+ " - adult performer\n",
1503
+ " - singer/musician\n",
1504
+ " - model\n",
1505
+ " - online personality\n",
1506
+ " - public figure\n",
1507
+ " - voice actor/ASMR\n",
1508
+ " - sports professional\n",
1509
+ " - tv personality\n",
1510
+ "\n",
1511
+ "Creating prompts...\n",
1512
+ "✅ Prompts created\n",
1513
+ "Resuming from index 8810\n"
1514
+ ]
1515
+ },
1516
+ {
1517
+ "name": "stderr",
1518
+ "output_type": "stream",
1519
+ "text": [
1520
+ "Mistral Local: 0%| | 0/42051 [00:00<?, ?it/s]/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:181: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
1521
+ " warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
1522
+ "Mistral Local: 0%| | 7/42051 [00:57<93:01:03, 7.96s/it] "
1523
+ ]
1524
+ }
1525
+ ],
1526
+ "source": [
1527
+ "import pandas as pd\n",
1528
+ "import json\n",
1529
+ "import time\n",
1530
+ "import re\n",
1531
+ "from pathlib import Path\n",
1532
+ "from tqdm import tqdm\n",
1533
+ "import torch\n",
1534
+ "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
1535
+ "\n",
1536
+ "current_dir = Path.cwd()\n",
1537
+ "input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
1538
+ "professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
1539
+ "professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
1540
+ "# === PROCESS DATA ===\n",
1541
+ "\n",
1542
+ "\n",
1543
+ "# === CONFIGURATION ===\n",
1544
+ "TEST_MODE = False\n",
1545
+ "TEST_SIZE = 100\n",
1546
+ "MAX_ROWS = 50862\n",
1547
+ "SAVE_INTERVAL = 10\n",
1548
+ "\n",
1549
+ "output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
1550
+ "index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
1551
+ "\n",
1552
+ "\n",
1553
+ "# Model settings\n",
1554
+ "#MODEL_NAME = \"mistralai/Mistral-Small-3.1-24B-Instruct-2503\"\n",
1555
+ "MODEL_NAME = \"mistralai/Mistral-Small-Instruct-2409\"\n",
1556
+ "#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
1557
+ "CACHE_DIR = current_dir.parent / \"data/models\"\n",
1558
+ "CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
1559
+ "\n",
1560
+ "# Define the SPECIFIC profession categories\n",
1561
+ "PROFESSION_CATEGORIES = [\n",
1562
+ " \"actor\",\n",
1563
+ " \"adult performer\",\n",
1564
+ " \"singer/musician\",\n",
1565
+ " \"model\",\n",
1566
+ " \"online personality\",\n",
1567
+ " \"public figure\",\n",
1568
+ " \"voice actor/ASMR\",\n",
1569
+ " \"sports professional\",\n",
1570
+ " \"tv personality\"\n",
1571
+ "]\n",
1572
+ "\n",
1573
+ "# === LOAD MODEL ===\n",
1574
+ "print(f\"Loading model: {MODEL_NAME}\")\n",
1575
+ "print(f\"Cache directory: {CACHE_DIR}\")\n",
1576
+ "print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
1577
+ "\n",
1578
+ "# Check GPU availability\n",
1579
+ "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
1580
+ "print(f\"Device: {device}\")\n",
1581
+ "\n",
1582
+ "if device == \"cpu\":\n",
1583
+ " print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
1584
+ " print(\" Consider using a GPU or reducing model size.\")\n",
1585
+ "\n",
1586
+ "# Load tokenizer\n",
1587
+ "print(\"Loading tokenizer...\")\n",
1588
+ "try:\n",
1589
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1590
+ " MODEL_NAME,\n",
1591
+ " cache_dir=str(CACHE_DIR),\n",
1592
+ " use_fast=True\n",
1593
+ " )\n",
1594
+ "except Exception as e:\n",
1595
+ " print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
1596
+ " tokenizer = AutoTokenizer.from_pretrained(\n",
1597
+ " MODEL_NAME,\n",
1598
+ " cache_dir=str(CACHE_DIR),\n",
1599
+ " use_fast=False\n",
1600
+ " )\n",
1601
+ "\n",
1602
+ "# Ensure pad token is set\n",
1603
+ "if tokenizer.pad_token is None:\n",
1604
+ " tokenizer.pad_token = tokenizer.eos_token\n",
1605
+ "\n",
1606
+ "print(\"✅ Tokenizer loaded\")\n",
1607
+ "\n",
1608
+ "quantization_config = BitsAndBytesConfig(\n",
1609
+ " load_in_8bit=True\n",
1610
+ ")\n",
1611
+ "\n",
1612
+ "\n",
1613
+ "# Load model with optimizations\n",
1614
+ "print(\"Loading model (this may take several minutes)...\")\n",
1615
+ "model = AutoModelForCausalLM.from_pretrained(\n",
1616
+ " MODEL_NAME,\n",
1617
+ " cache_dir=str(CACHE_DIR),\n",
1618
+ " torch_dtype=torch.bfloat16,\n",
1619
+ " quantization_config=quantization_config,\n",
1620
+ " device_map=\"auto\",\n",
1621
+ " trust_remote_code=False\n",
1622
+ ")\n",
1623
+ "model.eval()\n",
1624
+ "print(\"✅ Model loaded\")\n",
1625
+ "\n",
1626
+ "# Check VRAM usage\n",
1627
+ "if torch.cuda.is_available():\n",
1628
+ " vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
1629
+ " print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
1630
+ "\n",
1631
+ "# === LOAD DATA ===\n",
1632
+ "print(\"Loading raw input CSV...\")\n",
1633
+ "df = pd.read_csv(input_file) # ALWAYS load the full input\n",
1634
+ "print(f\"Loaded {len(df)} rows from input file\")\n",
1635
+ "\n",
1636
+ "# If we have previous annotations, merge them\n",
1637
+ "if output_file.exists():\n",
1638
+ " print(\"Found existing annotations, merging...\")\n",
1639
+ " existing_df = pd.read_csv(output_file)\n",
1640
+ " print(f\"Existing annotations has {len(existing_df)} rows\")\n",
1641
+ " \n",
1642
+ " # Update df with existing annotations\n",
1643
+ " # Only update the columns that were annotated\n",
1644
+ " annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
1645
+ " for col in annotation_cols:\n",
1646
+ " if col in existing_df.columns:\n",
1647
+ " df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
1648
+ " \n",
1649
+ " print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
1650
+ "\n",
1651
+ "\n",
1652
+ "# Try to load profession mapping files\n",
1653
+ "try:\n",
1654
+ " professions_df = pd.read_csv(professions_file)\n",
1655
+ " print(f\"✅ Loaded professions.csv\")\n",
1656
+ "except:\n",
1657
+ " print(\"⚠️ Warning: professions.csv not found\")\n",
1658
+ "\n",
1659
+ "try:\n",
1660
+ " prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
1661
+ " print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
1662
+ "except:\n",
1663
+ " print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
1664
+ "\n",
1665
+ "profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
1666
+ "\n",
1667
+ "print(f\"Loaded {len(df)} rows\")\n",
1668
+ "print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
1669
+ "for cat in PROFESSION_CATEGORIES:\n",
1670
+ " print(f\" - {cat}\")\n",
1671
+ "\n",
1672
+ "if TEST_MODE:\n",
1673
+ " print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
1674
+ " df = df.head(TEST_SIZE).copy()\n",
1675
+ "elif MAX_ROWS:\n",
1676
+ " df = df.head(MAX_ROWS).copy()\n",
1677
+ "\n",
1678
+ "# === CREATE PROMPTS (DEEPSEEK STYLE) ===\n",
1679
+ "def create_prompt(row):\n",
1680
+ " \"\"\"Create prompt for Mistral annotation with specific profession categories.\"\"\"\n",
1681
+ " name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
1682
+ " \n",
1683
+ " # Gather hints\n",
1684
+ " hints = []\n",
1685
+ " if pd.notna(row.get('likely_profession')):\n",
1686
+ " hints.append(str(row['likely_profession']))\n",
1687
+ " if pd.notna(row.get('likely_nationality')):\n",
1688
+ " hints.append(str(row['likely_nationality']))\n",
1689
+ " if pd.notna(row.get('likely_country')):\n",
1690
+ " hints.append(str(row['likely_country']))\n",
1691
+ " \n",
1692
+ " # Add tags if we don't have enough hints\n",
1693
+ " if len(hints) < 3:\n",
1694
+ " for i in range(1, 8):\n",
1695
+ " tag_col = f'tag_{i}'\n",
1696
+ " if tag_col in row and pd.notna(row[tag_col]):\n",
1697
+ " tag_val = str(row[tag_col])\n",
1698
+ " if tag_val not in hints:\n",
1699
+ " hints.append(tag_val)\n",
1700
+ " if len(hints) >= 5:\n",
1701
+ " break\n",
1702
+ " \n",
1703
+ " hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
1704
+ " \n",
1705
+ " return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
1706
+ "1. Full legal name (Western order if non-latin script)\n",
1707
+ "2. Any stage names/aliases (comma separated)\n",
1708
+ "3. Gender (Male/Female/Other/Unknown)\n",
1709
+ "4. Top 3 most likely professions from ONLY these categories:\n",
1710
+ " - actor\n",
1711
+ " - adult performer\n",
1712
+ " - singer/musician\n",
1713
+ " - model\n",
1714
+ " - online personality (includes streamers, cosplayers, influencers)\n",
1715
+ " - public figure (includes politicians, activists, journalists, authors)\n",
1716
+ " - voice actor/ASMR\n",
1717
+ " - sports professional\n",
1718
+ " - tv personality (includes hosts, presenters, reality TV)\n",
1719
+ "\n",
1720
+ "5. Primary country associated\n",
1721
+ "\n",
1722
+ "IMPORTANT:\n",
1723
+ "- Choose professions ONLY from the 9 categories above\n",
1724
+ "- Provide up to 3 professions, comma-separated, ordered by relevance\n",
1725
+ "- Be SPECIFIC: choose the most accurate category for each role\n",
1726
+ "- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
1727
+ "- Use 'Unknown' when uncertain or for fictional characters/places\n",
1728
+ "- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
1729
+ "- For country respond with one word only, for example China or Columbia\n",
1730
+ "- actress = actor\n",
1731
+ "\n",
1732
+ "Respond with exactly 5 numbered lines.\"\"\"\n",
1733
+ "\n",
1734
+ "# Create prompts\n",
1735
+ "print(\"\\nCreating prompts...\")\n",
1736
+ "df['prompt'] = df.apply(create_prompt, axis=1)\n",
1737
+ "print(\"✅ Prompts created\")\n",
1738
+ "\n",
1739
+ "# === QUERY MISTRAL LOCAL ===\n",
1740
+ "def query_mistral_local(prompt: str) -> str:\n",
1741
+ " \"\"\"Query Mistral locally via transformers.\"\"\"\n",
1742
+ " try:\n",
1743
+ " # Format as chat message for Mistral\n",
1744
+ " messages = [\n",
1745
+ " {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
1746
+ " {\"role\": \"user\", \"content\": prompt}\n",
1747
+ " ]\n",
1748
+ " \n",
1749
+ " # Tokenize\n",
1750
+ " if hasattr(tokenizer, 'apply_chat_template'):\n",
1751
+ " text = tokenizer.apply_chat_template(\n",
1752
+ " messages,\n",
1753
+ " tokenize=False,\n",
1754
+ " add_generation_prompt=True\n",
1755
+ " )\n",
1756
+ " else:\n",
1757
+ " # Fallback for older tokenizers\n",
1758
+ " text = f\"[INST] {prompt} [/INST]\"\n",
1759
+ " \n",
1760
+ " inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
1761
+ " \n",
1762
+ " # Generate\n",
1763
+ " with torch.no_grad():\n",
1764
+ " outputs = model.generate(\n",
1765
+ " **inputs,\n",
1766
+ " max_new_tokens=512,\n",
1767
+ " temperature=0.05,\n",
1768
+ " do_sample=True,\n",
1769
+ " top_p=0.8,\n",
1770
+ " pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id else tokenizer.eos_token_id\n",
1771
+ " )\n",
1772
+ " \n",
1773
+ " # Decode\n",
1774
+ " generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
1775
+ " response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
1776
+ " \n",
1777
+ " return response.strip()\n",
1778
+ " \n",
1779
+ " except Exception as e:\n",
1780
+ " print(f\"Generation error: {e}\")\n",
1781
+ " return None\n",
1782
+ "\n",
1783
+ "# === PARSE RESPONSE (DEEPSEEK STYLE) ===\n",
1784
+ "def parse_response(response):\n",
1785
+ " \"\"\"Parse Mistral response into structured fields.\"\"\"\n",
1786
+ " if not response:\n",
1787
+ " return {\n",
1788
+ " 'full_name': 'Unknown',\n",
1789
+ " 'aliases': 'Unknown',\n",
1790
+ " 'gender': 'Unknown',\n",
1791
+ " 'profession_llm': 'Unknown',\n",
1792
+ " 'country': 'Unknown'\n",
1793
+ " }\n",
1794
+ " \n",
1795
+ " # Split into lines and clean\n",
1796
+ " lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
1797
+ " \n",
1798
+ " # Initialize with Unknown values\n",
1799
+ " fields = {\n",
1800
+ " 'full_name': 'Unknown',\n",
1801
+ " 'aliases': 'Unknown',\n",
1802
+ " 'gender': 'Unknown',\n",
1803
+ " 'profession_llm': 'Unknown',\n",
1804
+ " 'country': 'Unknown'\n",
1805
+ " }\n",
1806
+ " \n",
1807
+ " # Extract information from each numbered line\n",
1808
+ " for line in lines:\n",
1809
+ " if line.startswith('1.'):\n",
1810
+ " fields['full_name'] = line[2:].strip()\n",
1811
+ " elif line.startswith('2.'):\n",
1812
+ " fields['aliases'] = line[2:].strip()\n",
1813
+ " elif line.startswith('3.'):\n",
1814
+ " fields['gender'] = line[2:].strip()\n",
1815
+ " elif line.startswith('4.'):\n",
1816
+ " fields['profession_llm'] = line[2:].strip()\n",
1817
+ " elif line.startswith('5.'):\n",
1818
+ " fields['country'] = line[2:].strip()\n",
1819
+ " \n",
1820
+ " return fields\n",
1821
+ "\n",
1822
+ "# === PROCESS DATA ===\n",
1823
+ "output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
1824
+ "index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
1825
+ "\n",
1826
+ "index_file.parent.mkdir(parents=True, exist_ok=True)\n",
1827
+ "\n",
1828
+ "# Load index\n",
1829
+ "current_index = 0\n",
1830
+ "if index_file.exists():\n",
1831
+ " try:\n",
1832
+ " current_index = int(index_file.read_text().strip())\n",
1833
+ " except:\n",
1834
+ " current_index = 0\n",
1835
+ "\n",
1836
+ "print(f\"Resuming from index {current_index}\")\n",
1837
+ "\n",
1838
+ "start_time = time.time()\n",
1839
+ "\n",
1840
+ "for i in tqdm(range(current_index, len(df)), desc=\"Mistral Local\"):\n",
1841
+ "\n",
1842
+ " prompt = df.at[i, \"prompt\"]\n",
1843
+ "\n",
1844
+ " # -------- MODEL QUERY WITH RETRIES --------\n",
1845
+ " response = None\n",
1846
+ " for attempt in range(3):\n",
1847
+ " response = query_mistral_local(prompt)\n",
1848
+ " \n",
1849
+ " # Valid response?\n",
1850
+ " if response and len(response.strip()) > 10:\n",
1851
+ " break\n",
1852
+ " \n",
1853
+ " print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
1854
+ " time.sleep(0.5)\n",
1855
+ "\n",
1856
+ " # If still invalid → DO NOT overwrite previous data\n",
1857
+ " if not response or len(response.strip()) <= 10:\n",
1858
+ " print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
1859
+ " continue\n",
1860
+ "\n",
1861
+ " parsed = parse_response(response)\n",
1862
+ "\n",
1863
+ " # Additional safety: skip rows that parsed as all 'Unknown'\n",
1864
+ " if all(v == \"Unknown\" for v in parsed.values()):\n",
1865
+ " print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
1866
+ " continue\n",
1867
+ "\n",
1868
+ " # -------- WRITE PARSED FIELDS SAFELY --------\n",
1869
+ " for key, value in parsed.items():\n",
1870
+ " df.at[i, key] = value\n",
1871
+ "\n",
1872
+ " # Advance progress ONLY after successful write\n",
1873
+ " current_index = i + 1\n",
1874
+ "\n",
1875
+ " # -------- GPU MEMORY CLEANUP --------\n",
1876
+ " if torch.cuda.is_available():\n",
1877
+ " torch.cuda.empty_cache()\n",
1878
+ " torch.cuda.synchronize()\n",
1879
+ "\n",
1880
+ " # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
1881
+ " if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
1882
+ " df.to_csv(output_file, index=False)\n",
1883
+ " with open(index_file, \"w\") as f:\n",
1884
+ " f.write(str(current_index))\n",
1885
+ " print(f\"💾 Progress saved after row {i+1}\")\n",
1886
+ "\n",
1887
+ "# Final save\n",
1888
+ "df.to_csv(output_file, index=False)\n",
1889
+ "index_file.write_text(str(current_index))\n",
1890
+ "print(\"✅ Finished full dataset.\")\n"
1891
+ ]
1892
+ },
1893
+ {
1894
+ "cell_type": "code",
1895
+ "execution_count": null,
1896
+ "id": "d7212e75-0ff6-45a0-8695-c4a3d3e02818",
1897
+ "metadata": {},
1898
+ "outputs": [],
1899
+ "source": [
1900
+ "import transformers\n",
1901
+ "print(f\"Transformers version: {transformers.__version__}\")\n",
1902
+ "\n",
1903
+ "# Check if Mistral3 is available\n",
1904
+ "try:\n",
1905
+ " from transformers import Mistral3ForCausalLM\n",
1906
+ " print(\"✅ Mistral3 is available\")\n",
1907
+ "except ImportError:\n",
1908
+ " print(\"❌ Mistral3 not available in this transformers version\")"
1909
+ ]
1910
+ },
1911
+ {
1912
+ "cell_type": "code",
1913
+ "execution_count": null,
1914
+ "id": "a6ab032e-246e-4c4e-9776-ff0bfbf6fd9c",
1915
+ "metadata": {},
1916
+ "outputs": [],
1917
+ "source": []
1918
+ }
1919
+ ],
1920
+ "metadata": {
1921
+ "kernelspec": {
1922
+ "display_name": "pm-paper",
1923
+ "language": "python",
1924
+ "name": "pm-paper"
1925
+ },
1926
+ "language_info": {
1927
+ "codemirror_mode": {
1928
+ "name": "ipython",
1929
+ "version": 3
1930
+ },
1931
+ "file_extension": ".py",
1932
+ "mimetype": "text/x-python",
1933
+ "name": "python",
1934
+ "nbconvert_exporter": "python",
1935
+ "pygments_lexer": "ipython3",
1936
+ "version": "3.11.13"
1937
+ }
1938
+ },
1939
+ "nbformat": 4,
1940
+ "nbformat_minor": 5
1941
+ }
jupyter_notebooks/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
jupyter_notebooks/Section_2-3-4__Figure_8a_sunburst_gender.ipynb ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Prepare *.json for Figure 8a"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 1,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stdout",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "✓ Saved 8a.json\n"
20
+ ]
21
+ }
22
+ ],
23
+ "source": [
24
+ "import pandas as pd\n",
25
+ "from collections import defaultdict\n",
26
+ "import json\n",
27
+ "from pathlib import Path \n",
28
+ "\n",
29
+ "current_dir = Path.cwd()\n",
30
+ "sunburst_json = current_dir.parent / \"public/json/8a.json\"\n",
31
+ "\n",
32
+ "# Load consensus CSV\n",
33
+ "consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus.csv\"\n",
34
+ "df = pd.read_csv(consensus_file)\n",
35
+ "\n",
36
+ "# ---- Normalize Gender (group Non-binary and Unknown into 'Other') ----\n",
37
+ "def normalize_gender(g):\n",
38
+ " g = str(g).strip().lower()\n",
39
+ " if g in [\"female\", \"woman\", \"female (group)\", \"female (transgender)\", \"female (virtual persona)\", \"female (group members)\"]:\n",
40
+ " return \"Female\"\n",
41
+ " elif g in [\"male\", \"male (android)\", \"male (character)\"]:\n",
42
+ " return \"Male\"\n",
43
+ " else:\n",
44
+ " return \"Other\"\n",
45
+ "\n",
46
+ "df['gender_normalized'] = df['consensus_gender'].apply(normalize_gender)\n",
47
+ "\n",
48
+ "# ---- Step 1: Limit to top 10 countries ----\n",
49
+ "df['country_cleaned'] = df['consensus_country'].apply(lambda x: x if x not in ['Unknown', '', None] else 'Other')\n",
50
+ "top_countries = df['country_cleaned'].value_counts().nlargest(12).index.tolist()\n",
51
+ "df['country_limited'] = df['country_cleaned'].apply(lambda x: x if x in top_countries else 'Other')\n",
52
+ "\n",
53
+ "# ---- Step 2: Normalize and limit professions ----\n",
54
+ "valid_categories = [\n",
55
+ " \"actor\", \"adult performer\", \"singer/musician\", \"model\",\n",
56
+ " \"online personality\", \"tv personality\", \"voice actor/asmr\", \"public figure\", \"sports professional\"\n",
57
+ "]\n",
58
+ "\n",
59
+ "def remap_profession(profession):\n",
60
+ " profession_lower = str(profession).strip().lower()\n",
61
+ " if profession_lower == 'unknown' or profession_lower not in valid_categories:\n",
62
+ " return 'Other'\n",
63
+ " elif profession_lower == 'fictional character':\n",
64
+ " return 'actor'\n",
65
+ " elif profession_lower in ['voice actor', 'voice actor/asmr']:\n",
66
+ " return 'voice actor/ASMR'\n",
67
+ " return profession_lower\n",
68
+ "\n",
69
+ "df['profession_limited'] = df['consensus_primary_profession'].apply(remap_profession)\n",
70
+ "\n",
71
+ "# ---- Step 3: Group by gender and profession ----\n",
72
+ "sunburst_data = df.groupby(['gender_normalized', 'profession_limited']).size().reset_index(name='count')\n",
73
+ "\n",
74
+ "# ---- Step 4: Create nested structure for D3.js ----\n",
75
+ "sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
76
+ "gender_map = defaultdict(list)\n",
77
+ "\n",
78
+ "for _, row in sunburst_data.iterrows():\n",
79
+ " gender = row['gender_normalized']\n",
80
+ " profession = row['profession_limited']\n",
81
+ " count = int(row['count'])\n",
82
+ " gender_map[gender].append({\"name\": profession, \"value\": count})\n",
83
+ "\n",
84
+ "# Sort 'Other' professions to appear last\n",
85
+ "for gender, professions in gender_map.items():\n",
86
+ " professions_sorted = sorted(professions, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
87
+ " gender_map[gender] = professions_sorted\n",
88
+ "\n",
89
+ "# Build the final nested structure\n",
90
+ "for gender, professions in gender_map.items():\n",
91
+ " sunburst_dict[\"children\"].append({\"name\": gender, \"children\": professions})\n",
92
+ "\n",
93
+ "# ---- Step 5: Save to a JSON file ----\n",
94
+ "with open(sunburst_json, \"w\", encoding='utf-8') as f:\n",
95
+ " json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
96
+ "\n",
97
+ "print(\"✓ Saved 8a.json\")\n"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "code",
102
+ "execution_count": null,
103
+ "metadata": {},
104
+ "outputs": [],
105
+ "source": []
106
+ }
107
+ ],
108
+ "metadata": {
109
+ "kernelspec": {
110
+ "display_name": "latm",
111
+ "language": "python",
112
+ "name": "python3"
113
+ },
114
+ "language_info": {
115
+ "codemirror_mode": {
116
+ "name": "ipython",
117
+ "version": 3
118
+ },
119
+ "file_extension": ".py",
120
+ "mimetype": "text/x-python",
121
+ "name": "python",
122
+ "nbconvert_exporter": "python",
123
+ "pygments_lexer": "ipython3",
124
+ "version": "3.10.15"
125
+ }
126
+ },
127
+ "nbformat": 4,
128
+ "nbformat_minor": 2
129
+ }
jupyter_notebooks/Section_2-3-4__Figure_8b_sunburst_profession.ipynb ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Prepare *.json for Figure 8b"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 9,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stdout",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "✓ Saved sunburst_countries_A.json\n"
20
+ ]
21
+ }
22
+ ],
23
+ "source": [
24
+ "import pandas as pd\n",
25
+ "from collections import defaultdict\n",
26
+ "import json\n",
27
+ "from pathlib import Path\n",
28
+ "\n",
29
+ "current_dir = Path.cwd()\n",
30
+ "\n",
31
+ "sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
32
+ "\n",
33
+ "# Load consensus CSV\n",
34
+ "consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus_va.csv\"\n",
35
+ "df = pd.read_csv(consensus_file)\n",
36
+ "\n",
37
+ "# ============================================================\n",
38
+ "# 1. NORMALIZE COUNTRIES\n",
39
+ "# ============================================================\n",
40
+ "\n",
41
+ "def normalize_country(x: str):\n",
42
+ " if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
43
+ " return \"Other\"\n",
44
+ " x = x.strip()\n",
45
+ "\n",
46
+ " # Ensure matching with your HTML manualOrder\n",
47
+ " replacements = {\n",
48
+ " \"USA\": \"United States\",\n",
49
+ " \"US\": \"United States\",\n",
50
+ " \"U.S.\": \"United States\",\n",
51
+ " \"UK\": \"United Kingdom\",\n",
52
+ " \"U.K.\": \"United Kingdom\"\n",
53
+ " }\n",
54
+ " return replacements.get(x, x)\n",
55
+ "\n",
56
+ "df[\"country_clean\"] = df[\"consensus_country\"].apply(normalize_country)\n",
57
+ "\n",
58
+ "# Limit to the top 15 countries, merge the rest into \"Other\"\n",
59
+ "top_countries = df[\"country_clean\"].value_counts().nlargest(25).index.tolist()\n",
60
+ "\n",
61
+ "df[\"country_limited\"] = df[\"country_clean\"].apply(\n",
62
+ " lambda x: x if x in top_countries else \"Other\"\n",
63
+ ")\n",
64
+ "\n",
65
+ "# ============================================================\n",
66
+ "# 2. NORMALIZE PROFESSIONS\n",
67
+ "# ============================================================\n",
68
+ "\n",
69
+ "def normalize_profession(x: str):\n",
70
+ " if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
71
+ " return \"Other\"\n",
72
+ " x = x.strip().lower()\n",
73
+ " \n",
74
+ " mapping = {\n",
75
+ " \"actor\": \"Actor\",\n",
76
+ " \"model\": \"Model\",\n",
77
+ " \"adult performer\": \"Adult Performer\",\n",
78
+ " \"singer/musician\": \"Singer, Musician\",\n",
79
+ " \"online personality\": \"Online Personality\",\n",
80
+ " \"sports professional\": \"Sports Professional\",\n",
81
+ " \"voice actor/asmr\": \"Voice Actor\", # ← fixed key\n",
82
+ " \"public figure\": \"Public Figure\", # ← now its own category\n",
83
+ " \"tv personality\": \"Other\",\n",
84
+ " }\n",
85
+ " return mapping.get(x, \"Other\")\n",
86
+ "\n",
87
+ "df[\"profession_clean\"] = df[\"consensus_primary_profession\"].apply(normalize_profession)\n",
88
+ "\n",
89
+ "\n",
90
+ "df[\"profession_limited\"] = df[\"profession_clean\"]\n",
91
+ "\n",
92
+ "\n",
93
+ "# ============================================================\n",
94
+ "# 3. GROUP INTO SUNBURST STRUCTURE\n",
95
+ "# ============================================================\n",
96
+ "\n",
97
+ "sunburst_data = (\n",
98
+ " df.groupby([\"country_limited\", \"profession_limited\"])\n",
99
+ " .size()\n",
100
+ " .reset_index(name=\"count\")\n",
101
+ ")\n",
102
+ "\n",
103
+ "sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
104
+ "country_map = defaultdict(list)\n",
105
+ "\n",
106
+ "for _, row in sunburst_data.iterrows():\n",
107
+ " c = row[\"country_limited\"]\n",
108
+ " p = row[\"profession_limited\"]\n",
109
+ " v = int(row[\"count\"])\n",
110
+ "\n",
111
+ " country_map[c].append({\"name\": p, \"value\": v})\n",
112
+ "\n",
113
+ "# Sort professions inside each country so \"Other\" is last\n",
114
+ "for c, profs in country_map.items():\n",
115
+ " profs_sorted = sorted(profs, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
116
+ " country_map[c] = profs_sorted\n",
117
+ "\n",
118
+ "# Calculate total datapoints per country and sort\n",
119
+ "country_totals = []\n",
120
+ "for c, profs in country_map.items():\n",
121
+ " total = sum(p[\"value\"] for p in profs)\n",
122
+ " country_totals.append((c, total, profs))\n",
123
+ "\n",
124
+ "# Sort by total (descending), but put \"Other\" last\n",
125
+ "country_totals.sort(key=lambda x: (x[0] == \"Other\", -x[1]))\n",
126
+ "\n",
127
+ "# Build final JSON with sorted countries\n",
128
+ "for c, total, profs in country_totals:\n",
129
+ " sunburst_dict[\"children\"].append({\"name\": c, \"children\": profs})\n",
130
+ "\n",
131
+ "# ============================================================\n",
132
+ "# 4. SAVE JSON\n",
133
+ "# ============================================================\n",
134
+ "\n",
135
+ "with open(sunburst_path, \"w\", encoding=\"utf-8\") as f:\n",
136
+ " json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
137
+ "\n",
138
+ "print(\"✓ Saved sunburst_countries_A.json\")"
139
+ ]
140
+ },
141
+ {
142
+ "cell_type": "markdown",
143
+ "metadata": {},
144
+ "source": [
145
+ "# version that only considers data up until dec 31st 2024"
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": 10,
151
+ "metadata": {},
152
+ "outputs": [
153
+ {
154
+ "name": "stdout",
155
+ "output_type": "stream",
156
+ "text": [
157
+ "✓ Filtered to 16059 records published on or before December 31, 2024\n",
158
+ "✓ Saved sunburst_countries_A.json (2024 data only)\n"
159
+ ]
160
+ }
161
+ ],
162
+ "source": [
163
+ "import pandas as pd\n",
164
+ "from collections import defaultdict\n",
165
+ "import json\n",
166
+ "from pathlib import Path\n",
167
+ "\n",
168
+ "current_dir = Path.cwd()\n",
169
+ "sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
170
+ "\n",
171
+ "# Load consensus CSV\n",
172
+ "consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus_va.csv\"\n",
173
+ "df = pd.read_csv(consensus_file)\n",
174
+ "\n",
175
+ "# ============================================================\n",
176
+ "# FILTER DATA UP TO DECEMBER 31, 2024\n",
177
+ "# ============================================================\n",
178
+ "# Convert publishedAt to datetime\n",
179
+ "df[\"publishedAt\"] = pd.to_datetime(df[\"publishedAt\"], errors=\"coerce\", utc=True)\n",
180
+ "\n",
181
+ "# Filter to only include data up to December 31, 2024\n",
182
+ "# Make cutoff_date timezone-aware (UTC) to match publishedAt\n",
183
+ "cutoff_date = pd.Timestamp(\"2024-12-31 23:59:59\", tz=\"UTC\")\n",
184
+ "df = df[df[\"publishedAt\"] <= cutoff_date]\n",
185
+ "\n",
186
+ "print(f\"✓ Filtered to {len(df)} records published on or before December 31, 2024\")\n",
187
+ "\n",
188
+ "# ============================================================\n",
189
+ "# 1. NORMALIZE COUNTRIES\n",
190
+ "# ============================================================\n",
191
+ "def normalize_country(x: str):\n",
192
+ " if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
193
+ " return \"Other\"\n",
194
+ " x = x.strip()\n",
195
+ " # Ensure matching with your HTML manualOrder\n",
196
+ " replacements = {\n",
197
+ " \"USA\": \"United States\",\n",
198
+ " \"US\": \"United States\",\n",
199
+ " \"U.S.\": \"United States\",\n",
200
+ " \"UK\": \"United Kingdom\",\n",
201
+ " \"U.K.\": \"United Kingdom\"\n",
202
+ " }\n",
203
+ " return replacements.get(x, x)\n",
204
+ "\n",
205
+ "df[\"country_clean\"] = df[\"consensus_country\"].apply(normalize_country)\n",
206
+ "\n",
207
+ "# Limit to the top 15 countries, merge the rest into \"Other\"\n",
208
+ "top_countries = df[\"country_clean\"].value_counts().nlargest(25).index.tolist()\n",
209
+ "df[\"country_limited\"] = df[\"country_clean\"].apply(\n",
210
+ " lambda x: x if x in top_countries else \"Other\"\n",
211
+ ")\n",
212
+ "\n",
213
+ "# ============================================================\n",
214
+ "# 2. NORMALIZE PROFESSIONS\n",
215
+ "# ============================================================\n",
216
+ "def normalize_profession(x: str):\n",
217
+ " if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
218
+ " return \"Other\"\n",
219
+ " x = x.strip().lower()\n",
220
+ " mapping = {\n",
221
+ " \"actor\": \"Actor\",\n",
222
+ " \"model\": \"Model\",\n",
223
+ " \"adult performer\": \"Adult Performer\",\n",
224
+ " \"singer/musician\": \"Singer, Musician\",\n",
225
+ " \"online personality\": \"Online Personality\",\n",
226
+ " \"sports professional\": \"Sports Professional\",\n",
227
+ " \"voice actor/asmr\": \"Voice Actor\",\n",
228
+ " \"public figure\": \"Public Figure\",\n",
229
+ " \"tv personality\": \"Other\",\n",
230
+ " }\n",
231
+ " return mapping.get(x, \"Other\")\n",
232
+ "\n",
233
+ "df[\"profession_clean\"] = df[\"consensus_primary_profession\"].apply(normalize_profession)\n",
234
+ "\n",
235
+ "# Define top professions to keep\n",
236
+ "top_prof = [\n",
237
+ " \"Actor\",\n",
238
+ " \"Model\", \n",
239
+ " \"Adult Performer\",\n",
240
+ " \"Singer, Musician\",\n",
241
+ " \"Online Personality\",\n",
242
+ " \"Sports Professional\",\n",
243
+ " \"Voice Actor\",\n",
244
+ " \"Public Figure\",\n",
245
+ " \"Other\"\n",
246
+ "]\n",
247
+ "\n",
248
+ "# Re-limit professions, everything else → Other\n",
249
+ "df[\"profession_limited\"] = df[\"profession_clean\"].apply(\n",
250
+ " lambda x: x if x in top_prof else \"Other\"\n",
251
+ ")\n",
252
+ "\n",
253
+ "# ============================================================\n",
254
+ "# 3. GROUP INTO SUNBURST STRUCTURE\n",
255
+ "# ============================================================\n",
256
+ "sunburst_data = (\n",
257
+ " df.groupby([\"country_limited\", \"profession_limited\"])\n",
258
+ " .size()\n",
259
+ " .reset_index(name=\"count\")\n",
260
+ ")\n",
261
+ "\n",
262
+ "sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
263
+ "country_map = defaultdict(list)\n",
264
+ "\n",
265
+ "for _, row in sunburst_data.iterrows():\n",
266
+ " c = row[\"country_limited\"]\n",
267
+ " p = row[\"profession_limited\"]\n",
268
+ " v = int(row[\"count\"])\n",
269
+ " country_map[c].append({\"name\": p, \"value\": v})\n",
270
+ "\n",
271
+ "# Sort professions inside each country so \"Other\" is last\n",
272
+ "for c, profs in country_map.items():\n",
273
+ " profs_sorted = sorted(profs, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
274
+ " country_map[c] = profs_sorted\n",
275
+ "\n",
276
+ "# Calculate total datapoints per country and sort\n",
277
+ "country_totals = []\n",
278
+ "for c, profs in country_map.items():\n",
279
+ " total = sum(p[\"value\"] for p in profs)\n",
280
+ " country_totals.append((c, total, profs))\n",
281
+ "\n",
282
+ "# Sort by total (descending), but put \"Other\" last\n",
283
+ "country_totals.sort(key=lambda x: (x[0] == \"Other\", -x[1]))\n",
284
+ "\n",
285
+ "# Build final JSON with sorted countries\n",
286
+ "for c, total, profs in country_totals:\n",
287
+ " sunburst_dict[\"children\"].append({\"name\": c, \"children\": profs})\n",
288
+ "\n",
289
+ "# ============================================================\n",
290
+ "# 4. SAVE JSON\n",
291
+ "# ============================================================\n",
292
+ "with open(sunburst_path, \"w\", encoding=\"utf-8\") as f:\n",
293
+ " json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
294
+ "\n",
295
+ "print(\"✓ Saved sunburst_countries_A.json (2024 data only)\")"
296
+ ]
297
+ },
298
+ {
299
+ "cell_type": "markdown",
300
+ "metadata": {},
301
+ "source": [
302
+ "the resulting *.json is the input for Figure_8.html"
303
+ ]
304
+ },
305
+ {
306
+ "cell_type": "markdown",
307
+ "metadata": {},
308
+ "source": []
309
+ }
310
+ ],
311
+ "metadata": {
312
+ "kernelspec": {
313
+ "display_name": "latm",
314
+ "language": "python",
315
+ "name": "python3"
316
+ },
317
+ "language_info": {
318
+ "codemirror_mode": {
319
+ "name": "ipython",
320
+ "version": 3
321
+ },
322
+ "file_extension": ".py",
323
+ "mimetype": "text/x-python",
324
+ "name": "python",
325
+ "nbconvert_exporter": "python",
326
+ "pygments_lexer": "ipython3",
327
+ "version": "3.10.15"
328
+ }
329
+ },
330
+ "nbformat": 4,
331
+ "nbformat_minor": 4
332
+ }
jupyter_notebooks/Section_2-3_Figure_5_co-occurence_promotional_tags.ipynb ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Create *.json for figure 5 (Co-occurence network of Tags)"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 5,
13
+ "metadata": {},
14
+ "outputs": [
15
+ {
16
+ "name": "stdout",
17
+ "output_type": "stream",
18
+ "text": [
19
+ "Processing: america\n",
20
+ " ✅ Saved to /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/public/json/tags_america.json\n"
21
+ ]
22
+ }
23
+ ],
24
+ "source": [
25
+ "import pandas as pd\n",
26
+ "from itertools import combinations\n",
27
+ "from collections import Counter, defaultdict\n",
28
+ "import json\n",
29
+ "import re\n",
30
+ "import os\n",
31
+ "\n",
32
+ "from pathlib import Path\n",
33
+ "current_dir = Path.cwd()\n",
34
+ "\n",
35
+ "# === CONFIG ===\n",
36
+ "file_path = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
37
+ "output_dir = current_dir.parent / \"public/json/\"\n",
38
+ "#target_terms = [\"asian\", \"indian\", \"man\", \"woman\", \"german\", \"korean\", \"american\", \"russian\", \"style\", \"japanese\", \"chinese\"] # Add any tags you want to process\n",
39
+ "#target_terms = [\"character\", \"instagram\", \"youtuber\", \"actor\", \"actress\", \"celebrity\", \"vtuber\", \"kpop\"] # Add any tags you want to process\n",
40
+ "target_terms = [\"america\"] # Add any tags you want to process\n",
41
+ "min_connections = 1 # minimum number of link connections per node\n",
42
+ "\n",
43
+ "# === LOAD DATA ===\n",
44
+ "df = pd.read_csv(file_path)\n",
45
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
46
+ "df_tags = df[tag_columns]\n",
47
+ "\n",
48
+ "# === MAIN LOOP ===\n",
49
+ "for target_term in target_terms:\n",
50
+ " print(f\"Processing: {target_term}\")\n",
51
+ " \n",
52
+ " pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
53
+ " df_filtered = df_tags[df_tags.apply(\n",
54
+ " lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
55
+ " axis=1\n",
56
+ " )]\n",
57
+ "\n",
58
+ " # Skip if no data matches\n",
59
+ " if df_filtered.empty:\n",
60
+ " print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
61
+ " continue\n",
62
+ "\n",
63
+ " # === COUNT INDIVIDUAL TAGS ===\n",
64
+ " all_tags = df_filtered.values.flatten()\n",
65
+ " all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
66
+ " tag_counts = Counter(all_tags)\n",
67
+ "\n",
68
+ " # === CO-OCCURRENCE ===\n",
69
+ " co_occurrences = defaultdict(int)\n",
70
+ " for tags in df_filtered.itertuples(index=False, name=None):\n",
71
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
72
+ " for tag1, tag2 in combinations(tags, 2):\n",
73
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
74
+ "\n",
75
+ " edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
76
+ "\n",
77
+ " # === FILTER BY CONNECTIONS ===\n",
78
+ " connected_tags = Counter()\n",
79
+ " for tag1, tag2, _ in edges:\n",
80
+ " connected_tags[tag1] += 1\n",
81
+ " connected_tags[tag2] += 1\n",
82
+ "\n",
83
+ " nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
84
+ " valid_ids = set(node[\"id\"] for node in nodes)\n",
85
+ " links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
86
+ " for tag1, tag2, weight in edges\n",
87
+ " if tag1 in valid_ids and tag2 in valid_ids]\n",
88
+ "\n",
89
+ " if not nodes or not links:\n",
90
+ " print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
91
+ " continue\n",
92
+ "\n",
93
+ " # === EXPORT ===\n",
94
+ " d3_data = {\"nodes\": nodes, \"links\": links}\n",
95
+ " safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
96
+ " output_file = os.path.join(output_dir, f\"tags_{safe_term}.json\")\n",
97
+ " \n",
98
+ " with open(output_file, \"w\") as f:\n",
99
+ " json.dump(d3_data, f, indent=4)\n",
100
+ " \n",
101
+ " print(f\" ✅ Saved to {output_file}\")\n"
102
+ ]
103
+ },
104
+ {
105
+ "cell_type": "markdown",
106
+ "metadata": {},
107
+ "source": [
108
+ "## Different Countries"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "code",
113
+ "execution_count": 2,
114
+ "metadata": {},
115
+ "outputs": [
116
+ {
117
+ "name": "stderr",
118
+ "output_type": "stream",
119
+ "text": [
120
+ "/tmp/ipykernel_68582/2797381217.py:15: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
121
+ " df = pd.read_csv(file_path)\n"
122
+ ]
123
+ },
124
+ {
125
+ "name": "stdout",
126
+ "output_type": "stream",
127
+ "text": [
128
+ "Processing: united states\n",
129
+ " ⚠️ No matches for 'united states', skipping.\n",
130
+ "Processing: korea\n",
131
+ " ✅ Saved to data/json/tags_korea_poi.json\n",
132
+ "Processing: uk\n",
133
+ " ✅ Saved to data/json/tags_uk_poi.json\n",
134
+ "Processing: russia\n",
135
+ " ✅ Saved to data/json/tags_russia_poi.json\n",
136
+ "Processing: china\n",
137
+ " ✅ Saved to data/json/tags_china_poi.json\n",
138
+ "Processing: canada\n",
139
+ " ✅ Saved to data/json/tags_canada_poi.json\n",
140
+ "Processing: India\n",
141
+ " ✅ Saved to data/json/tags_india_poi.json\n",
142
+ "Processing: germany\n",
143
+ " ✅ Saved to data/json/tags_germany_poi.json\n"
144
+ ]
145
+ }
146
+ ],
147
+ "source": [
148
+ "import pandas as pd\n",
149
+ "from itertools import combinations\n",
150
+ "from collections import Counter, defaultdict\n",
151
+ "import json\n",
152
+ "import re\n",
153
+ "import os\n",
154
+ "\n",
155
+ "# === CONFIG ===\n",
156
+ "file_path = \"data/model_subsets/all_models_poi_true.csv\"\n",
157
+ "output_dir = \"data/json/\"\n",
158
+ "target_terms = [\"united states\", \"korea\", \"uk\", \"russia\", \"china\", \"canada\", \"India\", \"germany\"] # Add any tags you want to process\n",
159
+ "min_connections = 1 # minimum number of link connections per node\n",
160
+ "\n",
161
+ "# === LOAD DATA ===\n",
162
+ "df = pd.read_csv(file_path)\n",
163
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
164
+ "df_tags = df[tag_columns]\n",
165
+ "\n",
166
+ "# === MAIN LOOP ===\n",
167
+ "for target_term in target_terms:\n",
168
+ " print(f\"Processing: {target_term}\")\n",
169
+ " \n",
170
+ " pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
171
+ " df_filtered = df_tags[df_tags.apply(\n",
172
+ " lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
173
+ " axis=1\n",
174
+ " )]\n",
175
+ "\n",
176
+ " # Skip if no data matches\n",
177
+ " if df_filtered.empty:\n",
178
+ " print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
179
+ " continue\n",
180
+ "\n",
181
+ " # === COUNT INDIVIDUAL TAGS ===\n",
182
+ " all_tags = df_filtered.values.flatten()\n",
183
+ " all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
184
+ " tag_counts = Counter(all_tags)\n",
185
+ "\n",
186
+ " # === CO-OCCURRENCE ===\n",
187
+ " co_occurrences = defaultdict(int)\n",
188
+ " for tags in df_filtered.itertuples(index=False, name=None):\n",
189
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
190
+ " for tag1, tag2 in combinations(tags, 2):\n",
191
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
192
+ "\n",
193
+ " edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
194
+ "\n",
195
+ " # === FILTER BY CONNECTIONS ===\n",
196
+ " connected_tags = Counter()\n",
197
+ " for tag1, tag2, _ in edges:\n",
198
+ " connected_tags[tag1] += 1\n",
199
+ " connected_tags[tag2] += 1\n",
200
+ "\n",
201
+ " nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
202
+ " valid_ids = set(node[\"id\"] for node in nodes)\n",
203
+ " links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
204
+ " for tag1, tag2, weight in edges\n",
205
+ " if tag1 in valid_ids and tag2 in valid_ids]\n",
206
+ "\n",
207
+ " if not nodes or not links:\n",
208
+ " print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
209
+ " continue\n",
210
+ "\n",
211
+ " # === EXPORT ===\n",
212
+ " d3_data = {\"nodes\": nodes, \"links\": links}\n",
213
+ " safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
214
+ " output_file = os.path.join(output_dir, f\"tags_{safe_term}_poi.json\")\n",
215
+ " \n",
216
+ " with open(output_file, \"w\") as f:\n",
217
+ " json.dump(d3_data, f, indent=4)\n",
218
+ " \n",
219
+ " print(f\" ✅ Saved to {output_file}\")\n"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": 5,
225
+ "metadata": {},
226
+ "outputs": [
227
+ {
228
+ "name": "stderr",
229
+ "output_type": "stream",
230
+ "text": [
231
+ "/tmp/ipykernel_79893/1420003123.py:13: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
232
+ " df = pd.read_csv(file_path)\n"
233
+ ]
234
+ },
235
+ {
236
+ "name": "stdout",
237
+ "output_type": "stream",
238
+ "text": [
239
+ "✅ Exported 60330 nodes and 16921 links to public/json/nodes_all.json\n"
240
+ ]
241
+ }
242
+ ],
243
+ "source": [
244
+ "import pandas as pd\n",
245
+ "from itertools import combinations\n",
246
+ "from collections import Counter, defaultdict\n",
247
+ "import json\n",
248
+ "import os\n",
249
+ "\n",
250
+ "# === CONFIG ===\n",
251
+ "file_path = \"data/model_subsets/all_models_poi_false.csv\"\n",
252
+ "output_file = \"public/json/nodes_all.json\"\n",
253
+ "min_link_threshold = 10 # Only keep edges with co-occurrence >= this\n",
254
+ "\n",
255
+ "# === LOAD DATA ===\n",
256
+ "df = pd.read_csv(file_path)\n",
257
+ "tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
258
+ "df_tags = df[tag_columns]\n",
259
+ "\n",
260
+ "# === COUNT INDIVIDUAL TAGS ===\n",
261
+ "all_tags = df_tags.values.flatten()\n",
262
+ "all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
263
+ "tag_counts = Counter(all_tags)\n",
264
+ "\n",
265
+ "# === CO-OCCURRENCE ===\n",
266
+ "co_occurrences = defaultdict(int)\n",
267
+ "for tags in df_tags.itertuples(index=False, name=None):\n",
268
+ " tags = [tag for tag in tags if pd.notna(tag)]\n",
269
+ " for tag1, tag2 in combinations(tags, 2):\n",
270
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
271
+ "\n",
272
+ "# === Build Edges (Filtered by co-occurrence threshold)\n",
273
+ "edges = [\n",
274
+ " {\"source\": list(pair)[0], \"target\": list(pair)[1], \"value\": weight}\n",
275
+ " for pair, weight in co_occurrences.items()\n",
276
+ " if weight >= min_link_threshold\n",
277
+ "]\n",
278
+ "\n",
279
+ "# === Build Nodes (All tags that appear, regardless of links)\n",
280
+ "nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts]\n",
281
+ "\n",
282
+ "# === EXPORT ===\n",
283
+ "d3_data = {\"nodes\": nodes, \"links\": edges}\n",
284
+ "\n",
285
+ "os.makedirs(os.path.dirname(output_file), exist_ok=True)\n",
286
+ "with open(output_file, \"w\") as f:\n",
287
+ " json.dump(d3_data, f, indent=4)\n",
288
+ "\n",
289
+ "print(f\"✅ Exported {len(nodes)} nodes and {len(edges)} links to {output_file}\")\n"
290
+ ]
291
+ }
292
+ ],
293
+ "metadata": {
294
+ "kernelspec": {
295
+ "display_name": "Python 3 (ipykernel)",
296
+ "language": "python",
297
+ "name": "python3"
298
+ },
299
+ "language_info": {
300
+ "codemirror_mode": {
301
+ "name": "ipython",
302
+ "version": 3
303
+ },
304
+ "file_extension": ".py",
305
+ "mimetype": "text/x-python",
306
+ "name": "python",
307
+ "nbconvert_exporter": "python",
308
+ "pygments_lexer": "ipython3",
309
+ "version": "3.12.10"
310
+ }
311
+ },
312
+ "nbformat": 4,
313
+ "nbformat_minor": 4
314
+ }
jupyter_notebooks/Section_2-4_Figure_9_Training_tags_Sankey.ipynb ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Sankey Diagram from model dataset"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": null,
13
+ "metadata": {},
14
+ "outputs": [],
15
+ "source": [
16
+ "import pandas as pd\n",
17
+ "from pathlib import Path\n",
18
+ "current_dir = Path.cwd()"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": null,
24
+ "metadata": {},
25
+ "outputs": [
26
+ {
27
+ "name": "stderr",
28
+ "output_type": "stream",
29
+ "text": [
30
+ "/tmp/ipykernel_9667/181225218.py:8: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
31
+ " df['has_tags'] = df['has_tags'].fillna(False)\n",
32
+ "/tmp/ipykernel_9667/181225218.py:9: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
33
+ " df['has_sex_tags'] = df['has_sex_tags'].fillna(False)\n"
34
+ ]
35
+ },
36
+ {
37
+ "name": "stdout",
38
+ "output_type": "stream",
39
+ "text": [
40
+ "🟢 All Adapters with Tags → Explicit: 4970\n",
41
+ "🔵 All Adapters with Tags → Non-Explicit: 6181\n",
42
+ "✅ Sankey structure saved to sankey_tags_focus_output.csv\n"
43
+ ]
44
+ }
45
+ ],
46
+ "source": [
47
+ "import pandas as pd\n",
48
+ "# Load your dataset\n",
49
+ "df = pd.read_csv(\"data/CSV/all_models_with_tags.csv\", nrows=40000)\n",
50
+ "\n",
51
+ "\n",
52
+ "# Fill missing values\n",
53
+ "df['has_tags'] = df['has_tags'].fillna(False)\n",
54
+ "df['has_sex_tags'] = df['has_sex_tags'].fillna(False)\n",
55
+ "df['poi'] = df['poi'].fillna(False)\n",
56
+ "\n",
57
+ "# Normalize and map types\n",
58
+ "df['type_normalized'] = df['type'].str.lower().str.strip()\n",
59
+ "type_mapping = {\n",
60
+ " 'checkpoint': 'Checkpoint',\n",
61
+ " 'lora': 'LoRA',\n",
62
+ " 'locon': 'LOCON',\n",
63
+ " 'textualinversion': 'TextualInversion',\n",
64
+ " 'controlnet': 'Other',\n",
65
+ " 'vae': 'Other',\n",
66
+ " 'upscaler': 'Other',\n",
67
+ " 'poses': 'Other',\n",
68
+ " 'workflows': 'Other',\n",
69
+ " 'other': 'Other'\n",
70
+ "}\n",
71
+ "df['type_mapped'] = df['type_normalized'].map(type_mapping).fillna('Other')\n",
72
+ "\n",
73
+ "\n",
74
+ "\n",
75
+ "\n",
76
+ "\n",
77
+ "# Identify Adapters\n",
78
+ "adapter_types = ['LoRA', 'LOCON', 'TextualInversion', 'Controlnet']\n",
79
+ "df['is_adapter'] = df['type_mapped'].isin(adapter_types)\n",
80
+ "adapter_df = df[df['is_adapter']]\n",
81
+ "\n",
82
+ "# Build Sankey links\n",
83
+ "sankey_links = []\n",
84
+ "\n",
85
+ "# Total → each model type\n",
86
+ "for t in df['type_mapped'].unique():\n",
87
+ " count = df[df['type_mapped'] == t].shape[0]\n",
88
+ " sankey_links.append({'source': 'Total', 'target': t, 'value': count})\n",
89
+ "\n",
90
+ "# Adapter types → Adapters\n",
91
+ "for t in adapter_types:\n",
92
+ " count = df[df['type_mapped'] == t].shape[0]\n",
93
+ " if count > 0:\n",
94
+ " sankey_links.append({'source': t, 'target': 'Adapters', 'value': count})\n",
95
+ "# Redefine adapter_types to only include those relevant for POI analysis\n",
96
+ "adapter_types = ['LoRA', 'LOCON', 'TextualInversion']\n",
97
+ "df['is_adapter'] = df['type_mapped'].isin(adapter_types)\n",
98
+ "adapter_df = df[df['is_adapter']]\n",
99
+ "\n",
100
+ "# Adapters → has_tags / has_no_tags\n",
101
+ "tagged_df = adapter_df[adapter_df['has_tags'] == True]\n",
102
+ "no_tag_df = adapter_df[adapter_df['has_tags'] == False]\n",
103
+ "\n",
104
+ "# Add has_no_tags count\n",
105
+ "no_tag_count = no_tag_df.shape[0]\n",
106
+ "if no_tag_count > 0:\n",
107
+ " sankey_links.append({'source': 'Adapters', 'target': 'has_no_tags', 'value': no_tag_count})\n",
108
+ "\n",
109
+ "# Now subdivide has_tags group\n",
110
+ "has_tag_count = tagged_df.shape[0]\n",
111
+ "if has_tag_count > 0:\n",
112
+ " sankey_links.append({'source': 'Adapters', 'target': 'has_tags', 'value': has_tag_count})\n",
113
+ "\n",
114
+ " for poi_val in [True, False]:\n",
115
+ " poi_label = f'has_tags + POI {\"True\" if poi_val else \"False\"}'\n",
116
+ " subset = tagged_df[tagged_df['poi'] == poi_val]\n",
117
+ " poi_count = subset.shape[0] # ✅ Fix: missing in original code\n",
118
+ "\n",
119
+ " if poi_count > 0:\n",
120
+ " sankey_links.append({'source': 'has_tags', 'target': poi_label, 'value': poi_count})\n",
121
+ "\n",
122
+ " for explicit in [True, False]:\n",
123
+ " explicit_label = f'{poi_label} + {\"explicit\" if explicit else \"non-explicit\"}'\n",
124
+ " exp_count = subset[subset['has_sex_tags'] == explicit].shape[0]\n",
125
+ " if exp_count > 0:\n",
126
+ " sankey_links.append({'source': poi_label, 'target': explicit_label, 'value': exp_count})\n",
127
+ "\n",
128
+ "# --- New Section: Count Models (Rows) Containing Specific Tags ---\n",
129
+ "\n",
130
+ "# --- Output total explicit and non-explicit from all adapters with tags ---\n",
131
+ "\n",
132
+ "explicit_total = tagged_df[tagged_df['has_sex_tags'] == True].shape[0]\n",
133
+ "non_explicit_total = tagged_df[tagged_df['has_sex_tags'] == False].shape[0]\n",
134
+ "\n",
135
+ "# Add to sankey_links (optional)\n",
136
+ "if explicit_total > 0:\n",
137
+ " sankey_links.append({'source': 'has_tags', 'target': 'All Adapters has_tags + explicit', 'value': explicit_total})\n",
138
+ "\n",
139
+ "if non_explicit_total > 0:\n",
140
+ " sankey_links.append({'source': 'has_tags', 'target': 'All Adapters has_tags + non-explicit', 'value': non_explicit_total})\n",
141
+ "\n",
142
+ "# Also print them clearly\n",
143
+ "print(f\"🟢 All Adapters with Tags → Explicit: {explicit_total}\")\n",
144
+ "print(f\"🔵 All Adapters with Tags → Non-Explicit: {non_explicit_total}\")\n",
145
+ "\n",
146
+ "\n",
147
+ "# Define target tags to check\n",
148
+ "target_tags = ['rape', 'loli', 'shota', 'lolicon']\n",
149
+ "\n",
150
+ "# Identify all tag columns\n",
151
+ "tag_columns = [col for col in df.columns if col.startswith('tag') and col[3:].isdigit()]\n",
152
+ "\n",
153
+ "# Lowercase tag values for matching\n",
154
+ "df_tags_lower = df[tag_columns].astype(str).apply(lambda col: col.str.lower().str.strip())\n",
155
+ "\n",
156
+ "# For each target tag, check how many rows contain it\n",
157
+ "for tag in target_tags:\n",
158
+ " contains_tag = df_tags_lower.isin([tag])\n",
159
+ " \n",
160
+ " # Only count rows that are both explicit and contain the tag\n",
161
+ " matching_rows = df[(df['has_sex_tags'] == True) & contains_tag.any(axis=1)]\n",
162
+ " rows_with_tag = matching_rows.shape[0]\n",
163
+ " \n",
164
+ " if rows_with_tag > 0:\n",
165
+ " sankey_links.append({'source': 'has_tags', 'target': f'has_tag_{tag}', 'value': int(rows_with_tag)})\n",
166
+ "\n",
167
+ "\n",
168
+ "# Save to CSV\n",
169
+ "sankey_df = pd.DataFrame(sankey_links)\n",
170
+ "sankey_df.to_csv(\"data/CSV/sankey_data.csv\", index=False)\n",
171
+ "print(\"✅ Sankey structure saved to sankey_tags_focus_output.csv\")\n"
172
+ ]
173
+ },
174
+ {
175
+ "cell_type": "code",
176
+ "execution_count": null,
177
+ "metadata": {},
178
+ "outputs": [],
179
+ "source": []
180
+ }
181
+ ],
182
+ "metadata": {
183
+ "kernelspec": {
184
+ "display_name": "latm",
185
+ "language": "python",
186
+ "name": "python3"
187
+ },
188
+ "language_info": {
189
+ "codemirror_mode": {
190
+ "name": "ipython",
191
+ "version": 3
192
+ },
193
+ "file_extension": ".py",
194
+ "mimetype": "text/x-python",
195
+ "name": "python",
196
+ "nbconvert_exporter": "python",
197
+ "pygments_lexer": "ipython3",
198
+ "version": "3.10.15"
199
+ }
200
+ },
201
+ "nbformat": 4,
202
+ "nbformat_minor": 2
203
+ }
jupyter_notebooks/Section_2-4_Figure_9_ectract_LoRA_metadata_v2.ipynb ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "f36422c8",
6
+ "metadata": {},
7
+ "source": [
8
+ "# LoRA metadata"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "raw",
13
+ "id": "8a2feb6e",
14
+ "metadata": {},
15
+ "source": [
16
+ "LoRA Metadata Processing Workflow\n",
17
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
18
+ "│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
19
+ "│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
20
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
21
+ " ↓\n",
22
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
23
+ "│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
24
+ "│ Files Using API │ │ using rotating API keys. │\n",
25
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
26
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
27
+ "│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
28
+ "│ from SafeTensor │ │ images, model type, and architecture. │\n",
29
+ "│ Files │ │ │\n",
30
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
31
+ " ↓\n",
32
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
33
+ "│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
34
+ "│ Metadata as JSON │ │ files for later analysis. │\n",
35
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
36
+ "\n",
37
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
38
+ "│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
39
+ "│ for Consolidation │ │ details, and filter necessary attributes. │\n",
40
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
41
+ " ↓\n",
42
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
43
+ "│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
44
+ "│ Tags & Model Info │ │ and systems used for model creation. │\n",
45
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
46
+ " ↓\n",
47
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
48
+ "│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
49
+ "│ Metadata to CSV │ │ format for final analysis. ���\n",
50
+ "└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "code",
55
+ "execution_count": null,
56
+ "id": "efc9939d",
57
+ "metadata": {},
58
+ "outputs": [],
59
+ "source": [
60
+ "import os\n",
61
+ "import re\n",
62
+ "import json\n",
63
+ "import csv\n",
64
+ "import struct\n",
65
+ "import requests\n",
66
+ "from pathlib import Path\n",
67
+ "import pandas as pd\n",
68
+ "from collections import Counter\n",
69
+ "from concurrent.futures import ProcessPoolExecutor\n",
70
+ "from pathlib import Path\n",
71
+ "import matplotlib.pyplot as plt\n",
72
+ "from matplotlib.font_manager import FontProperties\n",
73
+ "from matplotlib import font_manager\n",
74
+ "import pandas as pd\n",
75
+ "from collections import Counter\n",
76
+ "from concurrent.futures import ProcessPoolExecutor\n",
77
+ "\n",
78
+ "# Define the current directory and important file paths\n",
79
+ "current_dir = Path.cwd()\n",
80
+ "\n",
81
+ "# Define frequently used directories\n",
82
+ "\n",
83
+ "data_dir = current_dir.parent / 'data/csv/adapters.csv'\n",
84
+ "fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
85
+ "plots_dir = current_dir.parent / 'results/plots'\n",
86
+ "raw_data_dir = current_dir.parent / 'data/adapter_metadata/lora' ### location of the LoRA metadata (JSON)\n",
87
+ "temp_dir = current_dir.parent / 'data/raw/adapters_safetensors'\n",
88
+ "misc_dir = current_dir.parent / 'misc'\n",
89
+ "\n",
90
+ "# File paths\n",
91
+ "adapters_csv = current_dir.parent / 'data/csv/adapters.csv'\n",
92
+ "output_json_dir = raw_data_dir\n",
93
+ "api_keys_file = misc_dir / 'credentials/civit.txt'\n",
94
+ "\n",
95
+ "# Ensure directories exist\n",
96
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
97
+ "os.makedirs(temp_dir, exist_ok=True)\n",
98
+ "\n",
99
+ "\n",
100
+ "# Load fonts into Matplotlib\n",
101
+ "for font_path in font_paths:\n",
102
+ " font_manager.fontManager.addfont(font_path)\n",
103
+ "\n",
104
+ "# Set default font family for plots\n",
105
+ "plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
106
+ "\n",
107
+ "print('Paths and fonts initialized successfully.')\n",
108
+ "\n",
109
+ "print('Paths initialized successfully.')"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "id": "87a58593",
115
+ "metadata": {},
116
+ "source": [
117
+ "## Step 2: Download LoRA and extract *.safetensors metadata\n",
118
+ "This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "code",
123
+ "execution_count": null,
124
+ "id": "abd3a0bc",
125
+ "metadata": {},
126
+ "outputs": [],
127
+ "source": [
128
+ "import os\n",
129
+ "import sys\n",
130
+ "import csv\n",
131
+ "import json\n",
132
+ "import struct\n",
133
+ "import time\n",
134
+ "import requests\n",
135
+ "import signal\n",
136
+ "import contextlib\n",
137
+ "from pathlib import Path\n",
138
+ "import re\n",
139
+ "\n",
140
+ "# === Paste your API keys here ===\n",
141
+ "API_KEYS = [\n",
142
+ " \"399c06ea6d1b7349556a115376ec346b\", #DISCORD\n",
143
+ " \"213be9d373130f86e394c6fea4d75162\", #ASDD 1\n",
144
+ " \"4f180c0c56334b74394b467c5e5b8201\", #ASDD 2\n",
145
+ " \"bdfba7ac53290f66bc76130f25b74336\", #BSDD \n",
146
+ " \"43294f4a27b388624a896db5a65f445a\"\n",
147
+ "]\n",
148
+ "if not API_KEYS or any(not isinstance(k, str) or not k.strip() for k in API_KEYS):\n",
149
+ " raise ValueError(\"Please paste at least one valid API key into API_KEYS.\")\n",
150
+ "\n",
151
+ "# === Config (adjust paths as needed) ===\n",
152
+ "current_dir = Path.cwd()\n",
153
+ "output_json_dir = current_dir.parent / \"data/adapter_metadata/lora\" # where JSON outputs go\n",
154
+ "temp_dir = current_dir.parent / \"data/raw/adapters_safetensors\" # where downloads go\n",
155
+ "csv_path = current_dir.parent / \"data/csv/adapters_poi_false_sfw.csv\"\n",
156
+ "\n",
157
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
158
+ "os.makedirs(temp_dir, exist_ok=True)\n",
159
+ "\n",
160
+ "# === API key state ===\n",
161
+ "current_key_index = 0\n",
162
+ "\n",
163
+ "\n",
164
+ "\n",
165
+ "def safe_filename(name: str, max_length: int = 100) -> str:\n",
166
+ " # Replace unsafe chars\n",
167
+ " sanitized = re.sub(r'[^a-zA-Z0-9_\\-]', '_', name)\n",
168
+ " # Truncate if too long\n",
169
+ " if len(sanitized) > max_length:\n",
170
+ " sanitized = sanitized[:max_length]\n",
171
+ " return sanitized\n",
172
+ "\n",
173
+ "\n",
174
+ "def get_headers():\n",
175
+ " global current_key_index\n",
176
+ " return {\n",
177
+ " \"Accept\": \"application/json\",\n",
178
+ " \"Authorization\": f\"Bearer {API_KEYS[current_key_index].strip()}\"\n",
179
+ " }\n",
180
+ "\n",
181
+ "def rotate_api_key():\n",
182
+ " global current_key_index\n",
183
+ " if current_key_index < len(API_KEYS) - 1:\n",
184
+ " current_key_index += 1\n",
185
+ " print(f\"🔁 Rotated to API key #{current_key_index + 1}\")\n",
186
+ " else:\n",
187
+ " raise Exception(\"All API keys have been exhausted.\")\n",
188
+ "\n",
189
+ "# === Utilities ===\n",
190
+ "def save_json(data, filename):\n",
191
+ " with open(filename, 'w', encoding=\"utf-8\") as f:\n",
192
+ " json.dump(data, f, indent=4, ensure_ascii=False)\n",
193
+ "\n",
194
+ "def parse_safetensors(file_path):\n",
195
+ " # Minimal, tolerant metadata reader; returns {} on failure.\n",
196
+ " try:\n",
197
+ " with open(file_path, 'rb') as f:\n",
198
+ " file_data = f.read()\n",
199
+ " # Many safetensors use 8-byte header length; this code follows your original logic\n",
200
+ " # (4-byte) but keeps the 8-byte skip. Keep if it's working in your dataset.\n",
201
+ " metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
202
+ " metadata_bytes = file_data[8:8 + metadata_size]\n",
203
+ " metadata_str = metadata_bytes.decode('utf-8', errors='replace')\n",
204
+ " metadata = json.loads(metadata_str)\n",
205
+ " return metadata.get('__metadata__', {})\n",
206
+ " except Exception as e:\n",
207
+ " print(f\"Error parsing safetensors file: {e}\")\n",
208
+ " return {}\n",
209
+ "\n",
210
+ "# === Timeout context ===\n",
211
+ "class TimeoutException(Exception):\n",
212
+ " pass\n",
213
+ "\n",
214
+ "@contextlib.contextmanager\n",
215
+ "def time_limit(seconds):\n",
216
+ " def signal_handler(signum, frame):\n",
217
+ " raise TimeoutException(f\"Timed out after {seconds} seconds\")\n",
218
+ " # Note: SIGALRM works on Unix-like OS; on Windows this will be a no-op.\n",
219
+ " try:\n",
220
+ " signal.signal(signal.SIGALRM, signal_handler)\n",
221
+ " signal.alarm(seconds)\n",
222
+ " except Exception:\n",
223
+ " # Fallback: no hard alarm on non-Unix systems\n",
224
+ " pass\n",
225
+ " try:\n",
226
+ " yield\n",
227
+ " finally:\n",
228
+ " try:\n",
229
+ " signal.alarm(0)\n",
230
+ " except Exception:\n",
231
+ " pass\n",
232
+ "\n",
233
+ "# === Download with timeout, retries, backoff, and key rotation ===\n",
234
+ "def download_file(url, output_folder, timeout=30, overall_timeout=120, max_retries=3):\n",
235
+ " filename = url.split(\"/\")[-1]\n",
236
+ " output_path = os.path.join(output_folder, filename)\n",
237
+ "\n",
238
+ " global current_key_index\n",
239
+ " retries = 0\n",
240
+ " backoff = 2\n",
241
+ "\n",
242
+ " while current_key_index < len(API_KEYS):\n",
243
+ " try:\n",
244
+ " with time_limit(overall_timeout): # global cap per download\n",
245
+ " #print(f\"➡️ GET {url} using key #{current_key_index + 1}\")\n",
246
+ " resp = requests.get(\n",
247
+ " url,\n",
248
+ " headers=get_headers(),\n",
249
+ " stream=True,\n",
250
+ " timeout=(10, timeout), # (connect timeout, per-chunk read timeout)\n",
251
+ " )\n",
252
+ "\n",
253
+ " # Auth errors → rotate key\n",
254
+ " if resp.status_code in (401, 403):\n",
255
+ " print(f\"❌ Auth {resp.status_code} with key #{current_key_index + 1}. Rotating.\")\n",
256
+ " rotate_api_key()\n",
257
+ " retries = 0\n",
258
+ " backoff = 2\n",
259
+ " continue\n",
260
+ "\n",
261
+ " # Not found → bubble up as FileNotFoundError (do not rotate)\n",
262
+ " if resp.status_code == 404:\n",
263
+ " raise FileNotFoundError(f\"Model not found at {url}\")\n",
264
+ "\n",
265
+ " # Rate limit → either rotate or wait/backoff\n",
266
+ " if resp.status_code == 429:\n",
267
+ " print(\"⏳ Rate limited (429).\", end=\" \")\n",
268
+ " if current_key_index < len(API_KEYS) - 1:\n",
269
+ " print(\"Rotating key.\")\n",
270
+ " rotate_api_key()\n",
271
+ " retries = 0\n",
272
+ " backoff = 2\n",
273
+ " continue\n",
274
+ " else:\n",
275
+ " print(f\"Waiting {backoff}s (no other keys).\")\n",
276
+ " time.sleep(backoff)\n",
277
+ " backoff = min(backoff * 2, 60)\n",
278
+ " continue\n",
279
+ "\n",
280
+ " # Other HTTP errors → raise to RequestException path\n",
281
+ " resp.raise_for_status()\n",
282
+ "\n",
283
+ " # Save file\n",
284
+ " with open(output_path, 'wb') as fh:\n",
285
+ " for chunk in resp.iter_content(chunk_size=8192):\n",
286
+ " if chunk:\n",
287
+ " fh.write(chunk)\n",
288
+ "\n",
289
+ " return output_path, filename\n",
290
+ "\n",
291
+ " except TimeoutException as e:\n",
292
+ " # Hard overall timeout → propagate\n",
293
+ " raise e\n",
294
+ " except requests.exceptions.RequestException as e:\n",
295
+ " # Network-ish errors: retry same key with backoff up to max_retries\n",
296
+ " retries += 1\n",
297
+ " if retries <= max_retries:\n",
298
+ " print(f\"🌐 Network error (try {retries}/{max_retries}) with key #{current_key_index + 1}: {e}\")\n",
299
+ " time.sleep(backoff)\n",
300
+ " backoff = min(backoff * 2, 60)\n",
301
+ " continue\n",
302
+ " else:\n",
303
+ " raise Exception(f\"Failed to download {url} after {max_retries} retries: {e}\")\n",
304
+ "\n",
305
+ " # If we exit the loop, we truly ran out\n",
306
+ " raise Exception(\"All API keys have been exhausted or failed.\")\n",
307
+ "\n",
308
+ "# === Main processing ===\n",
309
+ "def process_csv(csv_path):\n",
310
+ " with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
311
+ " reader = csv.DictReader(csvfile)\n",
312
+ " for index, row in enumerate(reader):\n",
313
+ " # Collect up to 20 version IDs; use the most recent\n",
314
+ " version_ids = []\n",
315
+ " for i in range(1, 21):\n",
316
+ " k = f'version_id_{i}'\n",
317
+ " if k in row and row[k]:\n",
318
+ " try:\n",
319
+ " version_ids.append(int(float(row[k])))\n",
320
+ " except ValueError:\n",
321
+ " print(f\"Invalid version_id value '{row[k]}' in row: {row}\")\n",
322
+ "\n",
323
+ " if not version_ids:\n",
324
+ " print(f\"No valid version IDs found in row: {row}\")\n",
325
+ " continue\n",
326
+ "\n",
327
+ " most_recent_version_id = str(max(version_ids))\n",
328
+ " name = row.get('name', 'unknown')\n",
329
+ " sanitized_name = safe_filename(name, max_length=100)\n",
330
+ " new_json_file = os.path.join(\n",
331
+ " output_json_dir,\n",
332
+ " f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
333
+ " )\n",
334
+ "\n",
335
+ " # Skip if JSON already exists\n",
336
+ " if os.path.exists(new_json_file):\n",
337
+ " #print(f\"↩️ Skipping versionID {most_recent_version_id} (JSON already exists)\")\n",
338
+ " continue\n",
339
+ "\n",
340
+ " try:\n",
341
+ " adapter_file, fname = download_file(\n",
342
+ " row['downloadUrl'], str(temp_dir),\n",
343
+ " timeout=30, overall_timeout=180\n",
344
+ " )\n",
345
+ " metadata = parse_safetensors(adapter_file)\n",
346
+ "\n",
347
+ " civitaidata = {\n",
348
+ " k: (int(v) if str(v).isdigit() else v)\n",
349
+ " for k, v in row.items()\n",
350
+ " }\n",
351
+ " new_json_data = {\n",
352
+ " \"civitaidata\": civitaidata,\n",
353
+ " \"metadata\": metadata,\n",
354
+ " \"versionID\": most_recent_version_id\n",
355
+ " }\n",
356
+ " save_json(new_json_data, new_json_file)\n",
357
+ " #print(f\"✅ Created JSON for versionID {most_recent_version_id} with file {fname}\")\n",
358
+ "\n",
359
+ " except FileNotFoundError as e:\n",
360
+ " print(f\"⚠️ {e} — saving empty metadata.\")\n",
361
+ " civitaidata = {\n",
362
+ " k: (int(v) if str(v).isdigit() else v)\n",
363
+ " for k, v in row.items()\n",
364
+ " }\n",
365
+ " empty_json = {\n",
366
+ " \"civitaidata\": civitaidata,\n",
367
+ " \"metadata\": {},\n",
368
+ " \"versionID\": most_recent_version_id,\n",
369
+ " \"error\": \"Model not found (404)\"\n",
370
+ " }\n",
371
+ " save_json(empty_json, new_json_file)\n",
372
+ " except Exception as e:\n",
373
+ " print(f\"⚠️ Error processing versionID {most_recent_version_id}: {e}\")\n",
374
+ " civitaidata = {\n",
375
+ " k: (int(v) if str(v).isdigit() else v)\n",
376
+ " for k, v in row.items()\n",
377
+ " }\n",
378
+ " empty_json = {\n",
379
+ " \"civitaidata\": civitaidata,\n",
380
+ " \"metadata\": {},\n",
381
+ " \"versionID\": most_recent_version_id,\n",
382
+ " \"error\": str(e)\n",
383
+ " }\n",
384
+ " save_json(empty_json, new_json_file)\n",
385
+ " print(f\"💾 Saved empty JSON for versionID {most_recent_version_id} due to failure.\")\n",
386
+ "\n",
387
+ "# === Run ===\n",
388
+ "if __name__ == \"__main__\":\n",
389
+ " process_csv(csv_path)\n"
390
+ ]
391
+ }
392
+ ],
393
+ "metadata": {
394
+ "kernelspec": {
395
+ "display_name": "Python 3 (ipykernel)",
396
+ "language": "python",
397
+ "name": "python3"
398
+ },
399
+ "language_info": {
400
+ "codemirror_mode": {
401
+ "name": "ipython",
402
+ "version": 3
403
+ },
404
+ "file_extension": ".py",
405
+ "mimetype": "text/x-python",
406
+ "name": "python",
407
+ "nbconvert_exporter": "python",
408
+ "pygments_lexer": "ipython3",
409
+ "version": "3.13.9"
410
+ }
411
+ },
412
+ "nbformat": 4,
413
+ "nbformat_minor": 5
414
+ }
jupyter_notebooks/Section_2-4_Figure_9_extract_LoRA_metadata.ipynb ADDED
@@ -0,0 +1,557 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "781346f0-fab6-45f9-9193-dd547f201864",
6
+ "metadata": {
7
+ "execution": {
8
+ "iopub.execute_input": "2025-02-08T11:14:38.839305Z",
9
+ "iopub.status.busy": "2025-02-08T11:14:38.837963Z",
10
+ "iopub.status.idle": "2025-02-08T11:14:38.845161Z",
11
+ "shell.execute_reply": "2025-02-08T11:14:38.844121Z",
12
+ "shell.execute_reply.started": "2025-02-08T11:14:38.839221Z"
13
+ }
14
+ },
15
+ "source": [
16
+ "# Section 3-4: LoRA metadata"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "markdown",
21
+ "id": "90b42e45-31bc-4cbd-b820-cf2daccbd712",
22
+ "metadata": {
23
+ "execution": {
24
+ "iopub.execute_input": "2025-02-08T21:10:00.955922Z",
25
+ "iopub.status.busy": "2025-02-08T21:10:00.955285Z",
26
+ "iopub.status.idle": "2025-02-08T21:10:00.959160Z",
27
+ "shell.execute_reply": "2025-02-08T21:10:00.958709Z",
28
+ "shell.execute_reply.started": "2025-02-08T21:10:00.955902Z"
29
+ }
30
+ },
31
+ "source": [
32
+ "### How are models trained? \n",
33
+ "### What are the most common training tags? \n",
34
+ "### What tagging systems are used?"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "raw",
39
+ "id": "79b10960-c7c6-4aa6-b91e-f0caf5c2a7dd",
40
+ "metadata": {},
41
+ "source": [
42
+ "LoRA Metadata Processing Workflow\n",
43
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
44
+ "│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
45
+ "│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
46
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
47
+ " ↓\n",
48
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
49
+ "│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
50
+ "│ Files Using API │ │ using rotating API keys. │\n",
51
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
52
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
53
+ "│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
54
+ "│ from SafeTensor │ │ images, model type, and architecture. │\n",
55
+ "│ Files │ │ │\n",
56
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
57
+ " ↓\n",
58
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
59
+ "│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
60
+ "│ Metadata as JSON │ │ files for later analysis. │\n",
61
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
62
+ "\n",
63
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
64
+ "│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
65
+ "│ for Consolidation │ │ details, and filter necessary attributes. │\n",
66
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
67
+ " ↓\n",
68
+ "┌──────────────────────┐ ┌───────────────────────��───────────────────────────┐\n",
69
+ "│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
70
+ "│ Tags & Model Info │ │ and systems used for model creation. │\n",
71
+ "└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
72
+ " ↓\n",
73
+ "┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
74
+ "│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
75
+ "│ Metadata to CSV │ │ format for final analysis. │\n",
76
+ "└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "id": "fd202dbd-fc0f-45f1-9d88-94c7d3635207",
82
+ "metadata": {
83
+ "execution": {
84
+ "iopub.execute_input": "2025-02-08T20:54:00.478266Z",
85
+ "iopub.status.busy": "2025-02-08T20:54:00.477080Z",
86
+ "iopub.status.idle": "2025-02-08T20:54:00.481464Z",
87
+ "shell.execute_reply": "2025-02-08T20:54:00.480830Z",
88
+ "shell.execute_reply.started": "2025-02-08T20:54:00.478244Z"
89
+ }
90
+ },
91
+ "source": [
92
+ "### Define Common Paths, Fonts etc."
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "code",
97
+ "execution_count": 24,
98
+ "id": "9da03b00",
99
+ "metadata": {
100
+ "execution": {
101
+ "iopub.execute_input": "2025-02-08T21:11:15.085872Z",
102
+ "iopub.status.busy": "2025-02-08T21:11:15.084865Z",
103
+ "iopub.status.idle": "2025-02-08T21:11:15.117479Z",
104
+ "shell.execute_reply": "2025-02-08T21:11:15.116955Z",
105
+ "shell.execute_reply.started": "2025-02-08T21:11:15.085846Z"
106
+ }
107
+ },
108
+ "outputs": [
109
+ {
110
+ "name": "stdout",
111
+ "output_type": "stream",
112
+ "text": [
113
+ "Paths and fonts initialized successfully.\n",
114
+ "Paths initialized successfully.\n"
115
+ ]
116
+ }
117
+ ],
118
+ "source": [
119
+ "import os\n",
120
+ "import re\n",
121
+ "import json\n",
122
+ "import csv\n",
123
+ "import struct\n",
124
+ "import requests\n",
125
+ "from pathlib import Path\n",
126
+ "import pandas as pd\n",
127
+ "from collections import Counter\n",
128
+ "from concurrent.futures import ProcessPoolExecutor\n",
129
+ "from pathlib import Path\n",
130
+ "import matplotlib.pyplot as plt\n",
131
+ "from matplotlib.font_manager import FontProperties\n",
132
+ "from matplotlib import font_manager\n",
133
+ "import pandas as pd\n",
134
+ "from collections import Counter\n",
135
+ "from concurrent.futures import ProcessPoolExecutor\n",
136
+ "\n",
137
+ "# Define the current directory and important file paths\n",
138
+ "current_dir = Path.cwd()\n",
139
+ "\n",
140
+ "# Define frequently used directories\n",
141
+ "data_dir = current_dir.parent / 'data/CSV'\n",
142
+ "fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
143
+ "plots_dir = current_dir.parent / 'plots'\n",
144
+ "raw_data_dir = current_dir.parent / 'data/raw/adapter_metadata'\n",
145
+ "temp_dir = current_dir.parent / 'data/models/adapter_temp'\n",
146
+ "misc_dir = current_dir.parent / 'misc'\n",
147
+ "\n",
148
+ "# File paths\n",
149
+ "adapters_csv = data_dir / 'model_subsets/Civiverse_adapters_poi_false.csv'\n",
150
+ "output_json_dir = raw_data_dir\n",
151
+ "api_keys_file = misc_dir / 'api_keys.txt'\n",
152
+ "\n",
153
+ "# Ensure directories exist\n",
154
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
155
+ "os.makedirs(temp_dir, exist_ok=True)\n",
156
+ "\n",
157
+ "# Font paths for Matplotlib\n",
158
+ "font_paths = [\n",
159
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-ExtraBold.ttf',\n",
160
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Bold.ttf',\n",
161
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-ExtraLight.ttf',\n",
162
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Light.ttf',\n",
163
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Medium.ttf',\n",
164
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-SemiBold.ttf',\n",
165
+ " fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Regular.ttf',\n",
166
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-ExtraBold.ttf',\n",
167
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Bold.ttf',\n",
168
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-ExtraLight.ttf',\n",
169
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Light.ttf',\n",
170
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Medium.ttf',\n",
171
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-SemiBold.ttf',\n",
172
+ " fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Regular.ttf'\n",
173
+ "]\n",
174
+ "\n",
175
+ "# Load fonts into Matplotlib\n",
176
+ "for font_path in font_paths:\n",
177
+ " font_manager.fontManager.addfont(font_path)\n",
178
+ "\n",
179
+ "# Set default font family for plots\n",
180
+ "plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
181
+ "\n",
182
+ "print('Paths and fonts initialized successfully.')\n",
183
+ "\n",
184
+ "print('Paths initialized successfully.')"
185
+ ]
186
+ },
187
+ {
188
+ "cell_type": "markdown",
189
+ "id": "9c6b84ec-c23e-4047-a288-cfab2f5d8fdb",
190
+ "metadata": {},
191
+ "source": [
192
+ "## Step 1: Download LoRA and extract *.safetensors metadata"
193
+ ]
194
+ },
195
+ {
196
+ "cell_type": "markdown",
197
+ "id": "e484c678-70a8-4b70-b0aa-9ce69414c43b",
198
+ "metadata": {},
199
+ "source": [
200
+ "This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
201
+ ]
202
+ },
203
+ {
204
+ "cell_type": "code",
205
+ "execution_count": 3,
206
+ "id": "ecf31f5c",
207
+ "metadata": {
208
+ "execution": {
209
+ "iopub.execute_input": "2025-02-08T20:49:20.487791Z",
210
+ "iopub.status.busy": "2025-02-08T20:49:20.487464Z",
211
+ "iopub.status.idle": "2025-02-08T20:49:20.499881Z",
212
+ "shell.execute_reply": "2025-02-08T20:49:20.499471Z",
213
+ "shell.execute_reply.started": "2025-02-08T20:49:20.487771Z"
214
+ }
215
+ },
216
+ "outputs": [],
217
+ "source": [
218
+ "output_json_dir = output_json_dir\n",
219
+ "temp_dir = temp_dir ### better delete these later for space efficiency\n",
220
+ "api_karussell = api_keys_file\n",
221
+ "os.makedirs(output_json_dir, exist_ok=True)\n",
222
+ "os.makedirs(temp_dir, exist_ok=True)\n",
223
+ "\n",
224
+ "# Load API keys\n",
225
+ "def load_api_keys(api_path):\n",
226
+ " if not os.path.exists(api_path):\n",
227
+ " raise FileNotFoundError(f\"API keys file does not exist: {api_path}\")\n",
228
+ " with open(api_path, 'r') as file:\n",
229
+ " return [line.strip() for line in file if line.strip()]\n",
230
+ "\n",
231
+ "api_keys = load_api_keys(api_karussell) # Replace with the path to your API keys file\n",
232
+ "current_key_index = 0\n",
233
+ "\n",
234
+ "def get_headers():\n",
235
+ " \"\"\"Generate request headers with the current API key.\"\"\"\n",
236
+ " global current_key_index\n",
237
+ " return {\n",
238
+ " \"Accept\": \"application/json\",\n",
239
+ " \"Authorization\": f\"Bearer {api_keys[current_key_index]}\"\n",
240
+ " }\n",
241
+ "\n",
242
+ "def rotate_api_key():\n",
243
+ " \"\"\"Rotate to the next API key.\"\"\"\n",
244
+ " global current_key_index\n",
245
+ " if current_key_index < len(api_keys) - 1:\n",
246
+ " current_key_index += 1\n",
247
+ " else:\n",
248
+ " raise Exception(\"All API keys have been exhausted.\")\n",
249
+ "\n",
250
+ "# Function to parse .safetensors metadata\n",
251
+ "def parse_safetensors(file_path):\n",
252
+ " try:\n",
253
+ " with open(file_path, 'rb') as f:\n",
254
+ " file_data = f.read()\n",
255
+ " metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
256
+ " metadata_bytes = file_data[8:8 + metadata_size]\n",
257
+ " metadata_str = metadata_bytes.decode('utf-8')\n",
258
+ " metadata = json.loads(metadata_str)\n",
259
+ " return metadata.get('__metadata__', {})\n",
260
+ " except Exception as e:\n",
261
+ " print(f\"Error parsing safetensors file: {e}\")\n",
262
+ " return {}\n",
263
+ "\n",
264
+ "def save_json(data, filename):\n",
265
+ " with open(filename, 'w') as f:\n",
266
+ " json.dump(data, f, indent=4)\n",
267
+ "\n",
268
+ "def download_file(url, output_folder):\n",
269
+ " \"\"\"Download file with API key rotation.\"\"\"\n",
270
+ " filename = url.split(\"/\")[-1]\n",
271
+ " output_path = os.path.join(output_folder, filename)\n",
272
+ "\n",
273
+ " global current_key_index\n",
274
+ " while current_key_index < len(api_keys):\n",
275
+ " try:\n",
276
+ " response = requests.get(url, headers=get_headers(), stream=True)\n",
277
+ " if response.status_code == 401: # Unauthorized\n",
278
+ " print(f\"API key {current_key_index + 1} failed. Trying next key.\")\n",
279
+ " rotate_api_key()\n",
280
+ " continue\n",
281
+ " elif response.status_code == 403: # Forbidden\n",
282
+ " print(f\"Access forbidden for API key {current_key_index + 1}. Rotating to the next key.\")\n",
283
+ " rotate_api_key()\n",
284
+ " continue\n",
285
+ " response.raise_for_status()\n",
286
+ "\n",
287
+ " # Save the file to the specified output folder\n",
288
+ " with open(output_path, 'wb') as file:\n",
289
+ " for chunk in response.iter_content(chunk_size=8192):\n",
290
+ " file.write(chunk)\n",
291
+ " return output_path, filename\n",
292
+ " except requests.exceptions.RequestException as e:\n",
293
+ " print(f\"Error downloading file: {e}\")\n",
294
+ " rotate_api_key()\n",
295
+ " raise Exception(\"All API keys failed.\")\n",
296
+ "\n",
297
+ "def process_csv(csv_path):\n",
298
+ " # Read the CSV and process each row\n",
299
+ " with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
300
+ " reader = csv.DictReader(csvfile)\n",
301
+ " for index, row in enumerate(reader):\n",
302
+ " version_ids = []\n",
303
+ " for i in range(1, 21):\n",
304
+ " key = f'version_id_{i}'\n",
305
+ " if key in row and row[key]:\n",
306
+ " try:\n",
307
+ " version_ids.append(int(float(row[key])))\n",
308
+ " except ValueError:\n",
309
+ " print(f\"Invalid version_id value '{row[key]}' in row: {row}\")\n",
310
+ "\n",
311
+ " if not version_ids:\n",
312
+ " print(f\"No valid version IDs found in row: {row}\")\n",
313
+ " continue\n",
314
+ "\n",
315
+ " most_recent_version_id = str(max(version_ids))\n",
316
+ "\n",
317
+ " try:\n",
318
+ " adapter_file, filename = download_file(row['downloadUrl'], temp_dir)\n",
319
+ " metadata = parse_safetensors(adapter_file)\n",
320
+ " # Add all CSV data under 'civitaidata'\n",
321
+ " civitaidata = {key: int(value) if value.isdigit() else value for key, value in row.items()}\n",
322
+ " new_json_data = {\n",
323
+ " \"civitaidata\": civitaidata,\n",
324
+ " \"metadata\": metadata,\n",
325
+ " \"versionID\": most_recent_version_id\n",
326
+ " }\n",
327
+ " sanitized_name = row['name'].replace(\" \", \"_\").replace(\"/\", \"_\")\n",
328
+ " new_json_file = os.path.join(\n",
329
+ " output_json_dir,\n",
330
+ " f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
331
+ " )\n",
332
+ " save_json(new_json_data, new_json_file)\n",
333
+ " print(f\"Created new JSON for versionID {most_recent_version_id} with filename {filename}\")\n",
334
+ " except Exception as e:\n",
335
+ " print(f\"Error processing versionID {most_recent_version_id}: {e}\")"
336
+ ]
337
+ },
338
+ {
339
+ "cell_type": "markdown",
340
+ "id": "708131db-95d8-4427-9397-74c72d3edf48",
341
+ "metadata": {},
342
+ "source": [
343
+ "Uncomment the following to download and process model adapters"
344
+ ]
345
+ },
346
+ {
347
+ "cell_type": "code",
348
+ "execution_count": 4,
349
+ "id": "51ce1943-e509-442d-8b92-1a67bc686471",
350
+ "metadata": {
351
+ "execution": {
352
+ "iopub.execute_input": "2025-02-08T20:49:21.547880Z",
353
+ "iopub.status.busy": "2025-02-08T20:49:21.547383Z",
354
+ "iopub.status.idle": "2025-02-08T20:49:21.550529Z",
355
+ "shell.execute_reply": "2025-02-08T20:49:21.550087Z",
356
+ "shell.execute_reply.started": "2025-02-08T20:49:21.547860Z"
357
+ }
358
+ },
359
+ "outputs": [],
360
+ "source": [
361
+ "csv_path = adapters_csv \n",
362
+ "#process_csv(csv_path) #### UNCOMMENT HERE"
363
+ ]
364
+ },
365
+ {
366
+ "cell_type": "markdown",
367
+ "id": "4c2c52ee-83ad-4269-a7bf-16225b71f427",
368
+ "metadata": {},
369
+ "source": [
370
+ "## Step 2: Accumulate training tags and compare with auto-tagging vocabulary"
371
+ ]
372
+ },
373
+ {
374
+ "cell_type": "code",
375
+ "execution_count": 5,
376
+ "id": "6de2d9a6",
377
+ "metadata": {
378
+ "execution": {
379
+ "iopub.execute_input": "2025-02-08T20:49:31.218029Z",
380
+ "iopub.status.busy": "2025-02-08T20:49:31.217597Z",
381
+ "iopub.status.idle": "2025-02-08T20:49:32.646108Z",
382
+ "shell.execute_reply": "2025-02-08T20:49:32.645340Z",
383
+ "shell.execute_reply.started": "2025-02-08T20:49:31.218005Z"
384
+ }
385
+ },
386
+ "outputs": [
387
+ {
388
+ "name": "stdout",
389
+ "output_type": "stream",
390
+ "text": [
391
+ "Processed JSON files and summary saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/model_subsets/Section_6-5/Lora_metadata.csv\n",
392
+ "Tag summary saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/model_subsets/Section_6-5/LoRA_training_tags_acc.csv\n"
393
+ ]
394
+ }
395
+ ],
396
+ "source": [
397
+ "interrogator_vocab = current_dir.parent / 'misc/autotagging-vocabularies/clip_interrogator.csv'\n",
398
+ "deepbooru_vocab = current_dir.parent / 'misc/autotagging-vocabularies/deepbooru_tags.txt'\n",
399
+ "\n",
400
+ "\n",
401
+ "def load_tags():\n",
402
+ " try:\n",
403
+ " clip_interrogator_tags = pd.read_csv(\n",
404
+ " interrogator_vocab, header=None, usecols=[0], on_bad_lines='skip'\n",
405
+ " )[0].tolist()\n",
406
+ " except pd.errors.ParserError as e:\n",
407
+ " print(f\"Error reading 'clip_interrogator.csv': {e}\")\n",
408
+ " clip_interrogator_tags = []\n",
409
+ "\n",
410
+ " try:\n",
411
+ " with open(deepbooru_vocab, 'r', encoding='utf-8') as file:\n",
412
+ " deepbooru_tags = {line.strip() for line in file}\n",
413
+ " except FileNotFoundError as e:\n",
414
+ " print(f\"Error reading 'deepbooru_tags.txt': {e}\")\n",
415
+ " deepbooru_tags = set()\n",
416
+ "\n",
417
+ " return clip_interrogator_tags, deepbooru_tags\n",
418
+ "\n",
419
+ "clip_interrogator_tags, deepbooru_tags = load_tags()\n",
420
+ "\n",
421
+ "# Determines tagging system based on tag matches\n",
422
+ "def determine_tagging_system(tags):\n",
423
+ " clip_interrogator_matches = sum(1 for tag in tags if tag.strip() in clip_interrogator_tags)\n",
424
+ " deepbooru_matches = sum(1 for tag in tags if tag.strip() in deepbooru_tags or tag.strip().replace(' ', '_') in deepbooru_tags)\n",
425
+ " \n",
426
+ " if deepbooru_matches > clip_interrogator_matches:\n",
427
+ " return 'deepbooru'\n",
428
+ " elif clip_interrogator_matches > deepbooru_matches:\n",
429
+ " return 'clip-interrogator'\n",
430
+ " elif clip_interrogator_matches == 0 and deepbooru_matches == 0 and tags:\n",
431
+ " return 'other'\n",
432
+ " else:\n",
433
+ " return 'no tag metadata'\n",
434
+ "\n",
435
+ "def process_json_file(file_path):\n",
436
+ " try:\n",
437
+ " with open(file_path, 'r', encoding='utf-8') as file:\n",
438
+ " data = json.load(file)\n",
439
+ " except (json.JSONDecodeError, IOError) as e:\n",
440
+ " print(f\"Error processing file {file_path}: {e}\")\n",
441
+ " return None\n",
442
+ "\n",
443
+ " # Handle ss_tag_frequency - Ensure it's parsed correctly\n",
444
+ " tag_frequency_raw = data.get('metadata', {}).get('ss_tag_frequency', {})\n",
445
+ " if isinstance(tag_frequency_raw, str):\n",
446
+ " try:\n",
447
+ " tag_frequency_raw = json.loads(tag_frequency_raw)\n",
448
+ " except json.JSONDecodeError:\n",
449
+ " print(f\"Error decoding ss_tag_frequency in {file_path}\")\n",
450
+ " return None # Skip files with unreadable tag data\n",
451
+ "\n",
452
+ " if not isinstance(tag_frequency_raw, dict) or not tag_frequency_raw:\n",
453
+ " return None # Skip files with empty or nonexistent ss_tag_frequency\n",
454
+ "\n",
455
+ " filename = os.path.basename(file_path).replace('.json', '')\n",
456
+ " modelspec_title = data.get('metadata', {}).get('ss_output_name', '') # Using ss_output_name instead\n",
457
+ " modelspec_architecture = data.get('metadata', {}).get('ss_network_module', '')\n",
458
+ " ss_num_train_images = data.get('metadata', {}).get('ss_num_train_images', 0)\n",
459
+ " ss_steps = data.get('metadata', {}).get('ss_steps', 0)\n",
460
+ " ss_sd_model_name = data.get('metadata', {}).get('ss_sd_model_name', '')\n",
461
+ "\n",
462
+ " # Extract first-level tag frequencies\n",
463
+ " tag_frequency = next(iter(tag_frequency_raw.values()), {}) # Get the first nested dict\n",
464
+ " tags = list(tag_frequency.keys())\n",
465
+ " training_system = determine_tagging_system(tags) if tags else 'undetermined'\n",
466
+ "\n",
467
+ " tag_items = list(tag_frequency.items())[:20]\n",
468
+ " tag_data = {f'tag{i+1:02d}': tag_items[i][0] if i < len(tag_items) else None for i in range(20)}\n",
469
+ " tag_data.update({f'tag{i+1:02d}_no': tag_items[i][1] if i < len(tag_items) else None for i in range(20)})\n",
470
+ "\n",
471
+ " row = {\n",
472
+ " 'filename': filename,\n",
473
+ " 'modelspec_title': modelspec_title,\n",
474
+ " 'modelspec_architecture': modelspec_architecture,\n",
475
+ " 'ss_num_train_images': ss_num_train_images,\n",
476
+ " 'ss_steps': ss_steps,\n",
477
+ " 'ss_sd_model_name': ss_sd_model_name,\n",
478
+ " 'training_system': training_system\n",
479
+ " }\n",
480
+ " row.update(tag_data)\n",
481
+ "\n",
482
+ " return row, tag_frequency\n",
483
+ "\n",
484
+ "def parallel_process_files(folder_path):\n",
485
+ " rows = []\n",
486
+ " tag_occurrences = Counter()\n",
487
+ " json_files = [os.path.join(root, file) for root, _, files in os.walk(folder_path) for file in files if file.endswith('.json')]\n",
488
+ "\n",
489
+ " with ProcessPoolExecutor(max_workers=4) as executor:\n",
490
+ " results = executor.map(process_json_file, json_files, chunksize=1) # Using small chunksize for debugging\n",
491
+ "\n",
492
+ " for result in results:\n",
493
+ " if result:\n",
494
+ " row, tag_frequency = result\n",
495
+ " rows.append(row)\n",
496
+ " tag_occurrences.update(tag_frequency)\n",
497
+ "\n",
498
+ " total_tags_count = sum(tag_occurrences.values())\n",
499
+ " unique_tags_count = len(tag_occurrences)\n",
500
+ "\n",
501
+ " return rows, total_tags_count, unique_tags_count, tag_occurrences\n",
502
+ "\n",
503
+ "def write_summary_and_csv(rows, total_tags_count, unique_tags_count, tag_occurrences, output_file, tag_summary_file):\n",
504
+ " if not rows:\n",
505
+ " print(\"No valid data to write to CSV.\")\n",
506
+ " return\n",
507
+ "\n",
508
+ " df = pd.DataFrame(rows)\n",
509
+ " df.to_csv(output_file, index=False, encoding='utf-8')\n",
510
+ "\n",
511
+ " tag_summary_df = pd.DataFrame(list(tag_occurrences.items()), columns=['Tag (ss_tag_frequency/tag_frequency)', 'No. of Occurrences'])\n",
512
+ " tag_summary_df = tag_summary_df.sort_values(by='No. of Occurrences', ascending=False)\n",
513
+ " tag_summary_df.to_csv(tag_summary_file, index=False, encoding='utf-8')\n",
514
+ "\n",
515
+ " print(f\"Processed JSON files and summary saved to {output_file}\")\n",
516
+ " print(f\"Tag summary saved to {tag_summary_file}\")\n",
517
+ "\n",
518
+ "\n",
519
+ "def main():\n",
520
+ " folder_path = '/home/lauwag/shares/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/adapter_metadata' # Path to your JSON files\n",
521
+ " \n",
522
+ " output_dir = current_dir.parent / 'data/CSV/model_subsets/Section_6-5/'\n",
523
+ " os.makedirs(output_dir, exist_ok=True) # Ensure the directory exists\n",
524
+ "\n",
525
+ " output_file = output_dir / 'Lora_metadata.csv'\n",
526
+ " tag_summary_file = output_dir / 'LoRA_training_tags_acc.csv'\n",
527
+ "\n",
528
+ " rows, total_tags_count, unique_tags_count, tag_occurrences = parallel_process_files(folder_path)\n",
529
+ " write_summary_and_csv(rows, total_tags_count, unique_tags_count, tag_occurrences, output_file, tag_summary_file)\n",
530
+ "\n",
531
+ "if __name__ == \"__main__\":\n",
532
+ " main()\n"
533
+ ]
534
+ }
535
+ ],
536
+ "metadata": {
537
+ "kernelspec": {
538
+ "display_name": "Python 3 (ipykernel)",
539
+ "language": "python",
540
+ "name": "python3"
541
+ },
542
+ "language_info": {
543
+ "codemirror_mode": {
544
+ "name": "ipython",
545
+ "version": 3
546
+ },
547
+ "file_extension": ".py",
548
+ "mimetype": "text/x-python",
549
+ "name": "python",
550
+ "nbconvert_exporter": "python",
551
+ "pygments_lexer": "ipython3",
552
+ "version": "3.11.9"
553
+ }
554
+ },
555
+ "nbformat": 4,
556
+ "nbformat_minor": 5
557
+ }
jupyter_notebooks/SuppM_Figure_13_Danbooru_categories.ipynb ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "import json\n",
10
+ "from pathlib import Path\n",
11
+ "\n",
12
+ "current_dir = Path.cwd()\n",
13
+ "input_json = current_dir.parent / \"misc/lists/danbooru.json\"\n",
14
+ "output_json = current_dir.parent / \"public/json/danbooru_flat.json\"\n",
15
+ "\n",
16
+ "# ---- CATEGORY COLORS ----\n",
17
+ "CATEGORY_COLORS = {\n",
18
+ " \"attire_and_body_accessories\": \"#DC143C\",\n",
19
+ " \"body\": \"coral\",\n",
20
+ " \"characters\": \"silver\",\n",
21
+ " \"copyrights\": \"#264653\",\n",
22
+ " \"creatures\": \"silver\",\n",
23
+ " \"drawing software\": \"#219ebc\",\n",
24
+ " \"games\": \"silver\",\n",
25
+ " \"metatags\": \"silver\",\n",
26
+ " \"more\": \"silver\",\n",
27
+ " \"objects\": \"#6d6875\",\n",
28
+ " \"plant\": \"#7cb518\",\n",
29
+ " \"real_world\": \"#a5a58d\",\n",
30
+ " \"sex\": \"#ef476f\",\n",
31
+ " \"visual_characteristics\": \"#06d6a0\",\n",
32
+ " \"subject\": \"#ffd166\",\n",
33
+ " \"uncategorized\": \"#adb5bd\",\n",
34
+ " \"actions_and_expressions\": \"#d00000\",\n",
35
+ " \"objects_and_backgrounds\": \"#118ab2\"\n",
36
+ "}\n",
37
+ "\n",
38
+ "# ---- STEP 1: Load the original nested JSON ----\n",
39
+ "with open(input_json, \"r\", encoding=\"utf-8\") as f:\n",
40
+ " nested_data = json.load(f)\n",
41
+ "\n",
42
+ "# ---- STEP 2: Recursively extract paths and limit tags ----\n",
43
+ "def extract_tags(data, path=None, result=None):\n",
44
+ " if path is None:\n",
45
+ " path = []\n",
46
+ " if result is None:\n",
47
+ " result = []\n",
48
+ "\n",
49
+ " if isinstance(data, dict):\n",
50
+ " for key, value in data.items():\n",
51
+ " extract_tags(value, path + [key], result)\n",
52
+ " elif isinstance(data, list) and len(data) > 0:\n",
53
+ " limited_tags = [data[0]] + [\"...\"] if len(data) > 1 else data\n",
54
+ " result.append({\n",
55
+ " \"level_path\": \" > \".join(path),\n",
56
+ " \"top_tags\": limited_tags,\n",
57
+ " \"tag_count\": len(data)\n",
58
+ " })\n",
59
+ "\n",
60
+ " return result\n",
61
+ "\n",
62
+ "# ---- STEP 3: Build the nested structure ----\n",
63
+ "def build_tree_from_flat_list(flat_data):\n",
64
+ " tree = {\"name\": \"root\", \"children\": []}\n",
65
+ "\n",
66
+ " for row in flat_data:\n",
67
+ " parts = row['level_path'].split(\" > \")\n",
68
+ " current = tree\n",
69
+ " for i, part in enumerate(parts):\n",
70
+ " match = next((child for child in current.get(\"children\", []) if child[\"name\"] == part), None)\n",
71
+ " if not match:\n",
72
+ " match = {\"name\": part, \"children\": []}\n",
73
+ " current.setdefault(\"children\", []).append(match)\n",
74
+ "\n",
75
+ " # If this is the top-level category, set its color\n",
76
+ " if current[\"name\"] == \"root\":\n",
77
+ " color = CATEGORY_COLORS.get(part.lower(), \"#888888\")\n",
78
+ " match[\"color\"] = color\n",
79
+ " else:\n",
80
+ " # Inherit from parent\n",
81
+ " match[\"color\"] = current.get(\"color\", \"#888888\")\n",
82
+ "\n",
83
+ " current = match\n",
84
+ "\n",
85
+ " current[\"tags\"] = row[\"top_tags\"]\n",
86
+ " current[\"direct_tag_count\"] = row[\"tag_count\"]\n",
87
+ "\n",
88
+ " return tree\n",
89
+ "\n",
90
+ "# ---- STEP 4: Recursively compute tag and category sizes ----\n",
91
+ "def compute_sizes(node):\n",
92
+ " tag_count = node.get(\"direct_tag_count\", 0)\n",
93
+ " category_count = 0\n",
94
+ "\n",
95
+ " for child in node.get(\"children\", []):\n",
96
+ " compute_sizes(child)\n",
97
+ " tag_count += child.get(\"tag_count\", 0)\n",
98
+ " category_count += 1 + child.get(\"category_count\", 0)\n",
99
+ "\n",
100
+ " node[\"tag_count\"] = tag_count\n",
101
+ " node[\"category_count\"] = category_count\n",
102
+ "\n",
103
+ "# ---- STEP 5: Generate and save final JSON ----\n",
104
+ "flat_data = extract_tags(nested_data)\n",
105
+ "tree_data = build_tree_from_flat_list(flat_data)\n",
106
+ "compute_sizes(tree_data)\n",
107
+ "\n",
108
+ "with open(output_json, \"w\", encoding=\"utf-8\") as f:\n",
109
+ " json.dump(tree_data, f, ensure_ascii=False, indent=2)\n"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "code",
114
+ "execution_count": null,
115
+ "metadata": {},
116
+ "outputs": [],
117
+ "source": []
118
+ }
119
+ ],
120
+ "metadata": {
121
+ "kernelspec": {
122
+ "display_name": "latm",
123
+ "language": "python",
124
+ "name": "python3"
125
+ },
126
+ "language_info": {
127
+ "codemirror_mode": {
128
+ "name": "ipython",
129
+ "version": 3
130
+ },
131
+ "file_extension": ".py",
132
+ "mimetype": "text/x-python",
133
+ "name": "python",
134
+ "nbconvert_exporter": "python",
135
+ "pygments_lexer": "ipython3",
136
+ "version": "3.10.15"
137
+ }
138
+ },
139
+ "nbformat": 4,
140
+ "nbformat_minor": 2
141
+ }
jupyter_notebooks/SuppM_Figure_S12_asset_types.ipynb ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "c0d18a6a",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import pandas as pd\n",
11
+ "import matplotlib.pyplot as plt\n",
12
+ "from matplotlib.ticker import FuncFormatter\n",
13
+ "from pathlib import Path"
14
+ ]
15
+ },
16
+ {
17
+ "cell_type": "code",
18
+ "execution_count": 2,
19
+ "id": "98d25755",
20
+ "metadata": {},
21
+ "outputs": [],
22
+ "source": [
23
+ "current_dir = Path.cwd()\n",
24
+ "\n",
25
+ "def sortByFrequency_model_types_csv(csv_path, output_svg_path):\n",
26
+ " hatch_pattern = '\\\\\\\\\\\\\\\\\\\\\\\\' # Hatch pattern for the bars\n",
27
+ "\n",
28
+ " # Read the CSV file\n",
29
+ " df = pd.read_csv(csv_path)\n",
30
+ "\n",
31
+ " if 'type' not in df.columns:\n",
32
+ " return \"The CSV file does not contain a 'type' column.\"\n",
33
+ "\n",
34
+ " # Count the occurrences of each model type\n",
35
+ " type_counts = df['type'].value_counts().reset_index()\n",
36
+ " type_counts.columns = ['Type', 'Count']\n",
37
+ " total = type_counts['Count'].sum()\n",
38
+ " type_counts['Percentage'] = (type_counts['Count'] / total * 100).round(2)\n",
39
+ "\n",
40
+ " # Sort the data in ascending order\n",
41
+ " type_counts = type_counts.sort_values(by='Count', ascending=True)\n",
42
+ "\n",
43
+ " # Plotting\n",
44
+ " plt.figure(figsize=(10, 3.5))\n",
45
+ " bars = plt.barh(type_counts['Type'], type_counts['Count'], color='white', hatch=hatch_pattern, edgecolor='coral')\n",
46
+ " plt.xlabel('Counts', fontweight='bold')\n",
47
+ " plt.ylabel('Asset Type', fontweight='bold')\n",
48
+ "\n",
49
+ " ax = plt.gca()\n",
50
+ "\n",
51
+ " # Hide all axis spines\n",
52
+ " for spine in ax.spines.values():\n",
53
+ " spine.set_visible(False)\n",
54
+ "\n",
55
+ " # Keep ticks visible\n",
56
+ " ax.xaxis.set_ticks_position('bottom')\n",
57
+ " ax.yaxis.set_ticks_position('left')\n",
58
+ "\n",
59
+ " # Bold tick labels\n",
60
+ " for label in ax.get_xticklabels() + ax.get_yticklabels():\n",
61
+ " label.set_fontweight('bold')\n",
62
+ "\n",
63
+ " # Format x-axis ticks: 25000 → 25 k\n",
64
+ " ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{int(x/1000)} k' if x >= 1000 else f'{int(x)}'))\n",
65
+ "\n",
66
+ " # Add percentage labels to the bars\n",
67
+ " for bar, percentage in zip(bars, type_counts['Percentage']):\n",
68
+ " plt.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,\n",
69
+ " f' {percentage}%', va='center', color='blueviolet', fontweight='bold')\n",
70
+ "\n",
71
+ " plt.tight_layout()\n",
72
+ " plt.savefig(out_file, format='svg')\n",
73
+ " plt.show()\n"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": 4,
79
+ "id": "9e610dec",
80
+ "metadata": {},
81
+ "outputs": [
82
+ {
83
+ "data": {
84
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAA94AAAFUCAYAAADS2eS8AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAw3xJREFUeJzs3Xd4FNX6wPHv7G6y6Q0CIUBCDQQQkGJAWlBpIgoIVhREf3hRrFe9otJURFEsV0UvwqWKBcSChQ6hhg6GGkognZbeNlvO74+52bAmQIKEIL6f55knkzNnzpyzIXl45zRNKaUQQgghhBBCCCFElTBUdwWEEEIIIYQQQojrmQTeQgghhBBCCCFEFZLAWwghhBBCCCGEqEISeAshhBBCCCGEEFVIAm8hhBBCCCGEEKIKSeAthBBCCCGEEEJUIQm8hRBCCCGEEEKIKiSBtxBCCCGEEEIIUYUk8BZCCCGEEEIIIaqQBN5CCCGEEEIIIUQVksBbCCGEEEIIIYSoQhJ4CyGEEEIIIYQQVUgCbyGEEEIIIYQQogpJ4C2EEEIIIYQQQlQhCbyFEEIIIYQQQogqJIG3EEIIIYQQQghRhSTwFkIIIYQQQgghqpAE3kIIIYQQQgghRBWSwFsIIYQQQgghhKhCEngLIYQQQgghhBBVSAJvIYQQQgghhBCiCpmquwJCVImsM1CQU921EEIIIYQQ4vrj5QcBwdVdi78UTSmlqrsSQlxxk+8Dq6W6ayGEEEIIIcT1x+QOYz6pUPCdmWhj06fZJO+wUJBhx93bQI0mbnR42Jemt3i65D2yuoAd8/I4dbAYZQef2kZaD/Ym6jG/ClXrWEwhS5486/z+uZ31MJk1AOxWxe6v84hbkk92sg2TWaNBFw96POePb4jeH11c4GDlG5kcjynCYILI272IfjEAg1EvIzvFxuyB6fSeGEiL/t4VqlMJ6fEW1yerBQY/C2dTYf23ENkZug4GTdOv/foFZKRD/1FQK0y/J/EArJgLYZFw6zAwmsBug9ULIPEg9B4OYS30vKcT4ZcZEBQCt/8fuJlBKdi4BA5uge73QPOb9LxZZ+Dnz8DdA+4YDV6+evquVbBjGXToC+1u09MKcvW8xUV63pI/Zoe2STukHdIOaYe0Q9oh7ZB2SDukHdXfjlXzICFOH116icBbKcWiUafJTrZjdIcaTdzITrGRvMNC8k4LwxfVplZzdwC2z8lh3XvZAHjXNOAdbKQgw8HJ2KIKBd75Z+0sG5dxwevLJ2aw/8cCAGo2MZF/1sHBXwpI2W1hxHchmH0NbP0ihwNLC7hnVjDZSTaWT8ykRhM32gzxAWDFpEzqdzBXOugu+TCEUEop1aNHDwWo8PDwC+b58ssvVXR0tAoICFBubm6qfv366uGHH1b79+93yTdhwgQFOA9N01RAQIDq0aOHWrVqVbllz54925nfYDCoxMTEy2/MhIFKpRzVz3euVGrCIKWWfqaU3a6nFRUoNfNlpd56QKmkw6X3Hd6u1OtDlPpqilLWYj3NWqx///oQ/XqJpMP6/TNf1stTSi9/6Wf683auLM17Jlmp90Yq9fEYpXIyStPXfavXdd23pWk5GXq+90bq95WQdkg7pB3SDmmHtEPaIe2Qdkg7qrsdyUdc/699ETlpVjW1ZaKa2jJRxc7MVkopdXJroTPt6Fq9zOxUq3qvjZ62c0GOcjgczjIsefZLPkcppRb947Sa1jZRLXnqjLN8a5HDWca7N+hpa9/NVEopVZhtVx90SFJTWyaqLf/R67b4H6fV1JaJylbsUOcSitXUlolq5Zv6z2Tfj3nqg45JKjvVWqH6/JEE3sLpUoH3o48+6gyM/fz8VPPmzZXRaFSA8vT0VMuXL3fmPT/wbtu2rWrTpo1L3vKC6pLnlxxvvPHG5TdmwkCl9m8p/V7+qEo7pB3SDmmHtEPaIe2Qdkg7pB1/vh0pRysceNttDjWjX6qa2jJRTbsxUc0ZkqY+6pyk3muTqH577Zyy2/TAeMe8HDW1ZaL6oEOSWvrSWfXvm5PVpz2S1c//Oqtyz9gu+ZydC/T7t83JURs/ySoTeBfl2tXUVnraummZelpOaeD99SOnlFJKrf8wU01tmahOxBaqvYty1dSWiWrPolyVn2FTH3dNVjvm51yyLhcigbdwuljg/d133zkD4rvuuksVFRUppZT6/fffVc2aNRWgateurfLz85VSroF3QkKCUkqpWbNmOdMWL17sUv7x48eVpmkKUB06dFCAatKkyeU3ZsJApd55WP6oSjukHdIOaYe0Q9oh7ZB2SDukHVeyHSf2VzjwVkrv9Z47NM0ZDE9tmag+6Zasdi4sDWJXvH7Oee29Nonqv3elOnvA5wxJU7ZixwXLP3OkWL3fLkl9+3+nlcPhKDfwVkrvES9J/+/ANPVx12Tn91/0T1VKKWXJt6ufXz6rPuqcpD7umqxWT8lQdptDLX3prJr/QLo6d7xYfT3ylPqoU5Kad2+6Sv29qEKfgVISeIvzXCzwHjhwoDNoPnHihMu184PsH374oUxaeYH3tm3byi0jJCRE7d6925lvw4YNl6x3UVGRys7OdjmKXh2g1Aej5I+qtEPaIe2Qdkg7pB3SDmmHtEPacSXbMf3ZCgfeDrtDLR6tB7yrp2QoS75dHVqe7wx441fpnXbLJpQG3vt+ylNK6UO7S9JObi284DNmD0pTn3RLdvaMXyjwLsyyqxVvZKjPbklRH3RIUguHn1Lz7tFfCPz3rtQLln98Q4Ga1jZRnY63qPn3pat/d05WCZsL1Yy+qeqzW1Mu+lLgfBJ4C6eLBd6RkZEKUAEBAWWuff/9985A+Z133lFKuQbebdu2VW3btlVGo1G5ubmpV155xeV+h8OhGjZsqAD1/PPPK6WUat26tQLUo48+esl6n/+skmNCj2ZKHdktf1SlHdIOaYe0Q9oh7ZB2SDukHdKOK9mON++rcOCdsKl0Pnf6AYsz/cOoJJf50xs/LQ2Wzx3X63DueLEzLe77vAs+o6SX/IMOSeqDDknOnvKSoeu7vip/eLjD4VAz79CHwX//zJly81jy7erzXilqw8dZypJnd8m7ZmqGmtoyUZ0+bCn33j+SwFs4VSTwDgwMLHPtxx9/vGjgff4RFhamtmzZ4nL/2rVrndd3796tlFLq3Xffdc4lLxm+fiEX7PFOOSp/VKUd0g5ph7RD2iHtkHZIO6Qd0o4r2Y7dayoceJ/fu71nUa5SSumLlv1hvnXSziJnvv1L9SB7/9LSHu+knfqQ7p1f5qiZd6SqmXeU9lCfP4S9vGP7PD3wPnO0WOWfK50vHjsr25nn0LLy443Vb2eoWQNSldXi0OeJt0xUP71wVv9RvJ8pgbe4PJc71HzixImXHGoeHx+vGjdurABVp04dlZdX+tZq+PDhzrz+/v7K399feXt7O9Pmz59f+cac/8dA/qhKO6Qd0g5ph7RD2iHtkHZIO6QdV6YdlVhcrSDTpv7dWZ9L/e4N+tzqkgXN3muTqNIPlgatJauRv9dGz1fSc/31o6ecec4fRn4hFxpqvvW/2eq9Nolq1oBUNb1nijPP90+fcVlFvUTq70XqvTalQb9SSs27N119fluKyjtjU/PuSVOf3SJDzcVlKAm8w8LCVGFhocuxePFiZyA8cOBA5+JqcXFxFV5c7aeffnKmTZ06VSmlVG5urkuQXd5x6623Vr4xEwYqtfTz0u/lj6q0Q9oh7ZB2SDukHdIOaYe0Q9rx59tRicBbKaXOHi1WS186qz6/LUVNuzFRfdojWS3+x2mVvLvIJZ+1yKHWvZ+pPrtVz/dF/1S14eMsVVxod+b5M4H38Q0Fas6QNPXhTUlqWlt9XvfWWdnKbi0bONutDjV7UJpa8fo5l/Rzx4vVwodPqQ86JKk5Q9JUyp6iMvdeiATewumP23mdf3zwwQdq5MiRLj3TkZGRzi3CPDw8LridWEng7XA4VKtWrZyLqBUWFrrs3b1v3z6X+nz44YeXv6f3hIHyR1XaIe2Qdkg7pB3SDmmHtEPaIe240u3Y8lOlAm+hk8BbOF0q8FZKqQULFqgePXoof39/ZTKZVN26ddVDDz1UJmguL/BWSqn58+c70z/99FPnMyMiIsrUJzEx8fL39C7p8ZY/qtIOaYe0Q9oh7ZB2SDukHdIOaceVa8ekuyXwvgyaUkohxPVm4iAY/CwkHoIdy6BDX2h3m36tIBd+/gyKi+CO0RAQrKcf2gbrv4XIztB1MGgaWC3w6xeQkQ79R0GtMD1v4gFYMRfCIuHWYWA0gd0GqxdA4kHoPRzCWuh5TyfCLzMgKARu/z9wM4NSsHEJHNwC3e+B5jfpebPO6HVz99Dr5uWrp+9aJe2Qdkg7pB3SDmmHtEPaIe2QdlR/O379AtKOwaj3ILQxomIk8BbXp8n36X9EhBBCCCGEEFeWyR3GfFIa0ItLksBbXJ+yzkBBTnXXQgghhBBCiOuPl58E3ZUkgbcQQgghhBBCCFGFTNVdASGqxLXW4y1vBYUQQgghhPjbkh5vcX261uZ4m9xgzKeVCr6L8x3MuTud7GQ7AL3GBdL2Xp+L5t/4cTbJuyzkpNqxFip8Q4w07+vFTSN9cfc2OPOejC1i68wczsRbseQ68AgwULetmZtH+xEc4Q7AsZhCYt7PIjvFTs3GJm59NZDQ1mZnGSvfyCB5p4WHF4VgdNMq+4kIIYQQQgjxtyE93tVoxIgRzJ07lx49erBu3boqf56m6cHR7NmzGTFiRJU/b86cOTzyyCMAXPX3O1aLvqq5yVz9qz+eOgk/fqz3wFci8F41OdMZdFdEYZaDnQvyMLpDUEM38k7byTxpY8t/ckg/UMyQz/RnZ5yw8t3oM9it4OFnoEYTN84esRK/spDknRZGrw2lOF+x9IVz1Gnjzv3zarFw2Gl+fO4co1eHApC8y8Lv3+Vz/9xaEnQLIYQQQghxCYZLZxGXq6ioiPfff5+oqCj8/Pzw8vIiIiKCxx9/nOPHj1d39apccHAwUVFRREVFVfreESNGoGka0dHRl1+BmvWgRScY+Za+9cGyWeAbpG97cMfj0PMBPfg+sktPa9IWHn1bHxb+2xd6sB7aGG65H+58Eg7Gwu5VENIQwlvo5YY0gN9mgsOu5+00AO57GZIOwabv9YC8dnilq35oWQH7fyqgWR/PCt9jNGv0+Kc/YzbUZcR3IfxjVSh12ui91wkbiijKdgCQFleM3arfc/fnNRm+KISox/wAPXi3FigyT1qxFipCb3DH099I7Uh38k7ZKci0Y7cqVkzMoO09PoS2MZdbFyGEEEIIIUQpCbyrSGZmJjfffDP//Oc/2bZtGwCNGzfm1KlTzJgxg/Xr11dzDate//79iY2NJTY2tnoqcEj/3KlZF4a/DpYCmDsecjP19B5D9eB77UKIWaSn+Qbqec1eet6zKXp6u9vgzidgxwr4dQY4HGD2hGHj9eB6/iRIjtfzRnSAe/8FR3bC4ml60F8JOWk2VryeQe0WbnR72r/C9/nUNHLTI37OIeUms0adlnrgrRnA8L/xLaGt3TG66effjT7L3KHpbJ2Zg9lX45axAZh9DQSGueHmqZEaV0xhtp1TB4vxqW3EK9DIlv/kYC1UdHu24nUTQgghhBDi70wC7yoyZswYdu/eDcCLL75IRkYGcXFxZGdnExMTQ7NmzVzyz5w5k4YNG+Lr68sdd9xBenq6y/UFCxbQsWNHvLy88PX1pW/fvuzZs8clT3p6OqNGjaJ+/fq4u7tTu3ZtHnjggQvW8fvvv8fNzQ1N05g8eTIADRo0QNM0Xn75ZcaMGUNQUBD+/v488cQTWCylc6YLCwt59dVXadKkCe7u7gQFBTFw4EDi4uKceebMmYOmac4h7gDR0dFomsbDDz/MhAkTqFOnDoGBgQwbNozc3FxnHebOnQtATEyMs4xKD8df/60+pByqN/hevaDCVVYOxa9jM3BY4Y6pNTCYLn8Yd/45O/GrCgFo3s/LGZAHhrtxz8xaeAUZKMp2cPqgFYcNfGsbqdlYj8g9/A0MeK8G+aftfH5rGm6eGne9X4Ozx6xsm5VDr3GB7P4qj89vS2V6dApr383EYZPlIoQQQgghhCiXEldcVlaWMplMClBt2rRRDoej3HzDhw9XgPL09FQeHh6qadOmClCAeuCBB5z53nnnHWd6RESECg0NVYDy9vZWBw4cUEopdfbsWRUeHu7M17RpUxUWFqYCAgKc5ZRcmz17tvrtt9+Uu7u7AtSUKVOceUrKMJvNqkaNGqphw4bO+5577jlnvttuu00BStM01bx5c+Xj46MA5ePjow4ePKiUUmr27NnOe0v06NFDAcrNzU35+vq6lP/KK68opZQaOHCgqlmzpgKUr6+vioqKUlFRUWrnzp3lfo5FRUUqOzvb5Sh6dYBSX7+j1IRBSu1cWZr5TLJS741U6uMxSuVklKav+1apCQP1ryVyMvR8743U7yuxc6Ve7tLPlLLb/1eJAqVmvqzUWw8olXS4NO/h7UpNulsvO+VoufU/3/a5OWpqy0S1d3GuUkqprGSrmtoyUU1tmah2f517yftLZJy0qi9uT1VTWyaqL4elK0uevbRZ6VY1o59+7eBv+cqSb1er385QU1smqvfbJ6nc07Zyy3TYHWrBg+lq6Ytn1bH1BWpqy0S14o0MteU/2ZWunxBCCCGEEH8n0uNdBeLj47HZ9OHF3bp1c+nxLY/FYiE2Npb4+HgGDRoEwOrVqwEoKChg0qRJAEyaNInDhw9z8uRJOnToQH5+Pm+99RYAn376KSdPngTg22+/JT4+npMnT7JmzZoyz1u/fj2DBw+muLiYd955h5dffrlMnrCwMBISEjh+/Dj333+/8xnZ2dmsXbuWVav0nuT333+fgwcPcvDgQXx8fMjLy2PKlCmX/Iw8PDw4ePAgR48epX379i5t/v777+nfvz8A7dq1cw5Xb9euXbllTZkyBX9/f5djysZ4fSG0Dr3hp+nV2/Pde/glP48Spw8XA7Dm7Sw+7JjM7IGlIx/WvJPJlw+eumQZKXssfPngKTJP2mgc7cHQGcEuK5rv+TqPrEQb7j4azft64e5loOWd3gDYihQpu8tfDX7313lknrBxy8sBnIwtAqDtPd60e1Bfaf3ElqIKt1MIIYQQQoi/Ewm8q4A6bwXvSwXdADfccANt2rQBoEWLFgCcOqUHWPv376egoACACRMmoGkabm5u7NixA8A5f3rr1q0ANGnShKFDhzrLvvHGG8s8b/bs2RQWFvLcc8/x0ksvlVunO+64A19ffQXw++67D4Di4mLi4+PZvn27M1/JUPZ69erRrVs3AGfdLuaWW26hbt26GAwGmjdv7tLmyho7dizZ2dkux9iuEfrq47ePqv7gO6xFpdtkLVTOo4S9GKxF+ve5p2zMGpDGrAFpxK8qcOY5vKKAbx89Q2Gmg3YP+DDo3zVx83T9Nbfk6WUU5ysyTuirrKXvL3Zed/Ms+282N93Gho+yiX4xAK8gIyX/xI1u2p8aDi+EEEIIIcTfgWwnVgWaNWuGyWTCZrOxceNGlFIXDcADAgKc5ybThX8kkZGR+Pn5uaTVqFGj0vUr6Zn+6quvePLJJ2ncuHGly/izymuzuswtx8xmM2bzH1bXNhn1Lb8MBj34Bj34Bj1oLgm+547Xj+Gv60F2j/+9tFi7UP/aY2hp8H1+3pp1S7chKyn39lGlwfeC1/Xg+6EJYDBWuC23T67B7ZNLf6bZKTZm9EkDXPfxdtggI0EfVVH8v0A677Sdn/55DhQY3SBtXzFfDjvtLKvXa4HUbuFO01s92f11HiiYN/QU/vVMnDumB+B+oUbqdyy7UvnKNzMJbWOm1V16z3iDTh7snJfH8Y1F+Ibo7QvvJCucCyGEEEIIUR7p8a4C/v7+3HPPPQDs3r2bV155xTn0HGDVqlVs3ry5QmW1bNkST099S6m+ffuyZcsW59Drzz77jFdffRXAuWXX0aNHWbJkifP+Py7ABvDGG2/QqlUr0tPT6dWrF2lpaWXy/PLLL+Tl5QH60HUAd3d3IiIi6NixozPfwoV6gJqcnMyGDRsA6NChQ4XadjFeXl4A5OfnX34hv34BlsLS4Lu6er5PJ15+GyrBblX6bHnAboW034tdDkuevp1YeCcPhnxWk/BOZty8NDJPWvGrY6T13d7cP7cWbh6ufxYOLSsgcauFXuMDnWmNunvS9Sl/ts3KYeWkTNo96EObIT5XpZ1CCCGEEEL85VTzHPPr1rlz51Tbtm2dC4f5+fmp1q1bq8DAQOcCZyWLq/Xo0cN534QJE8osSPbWW28500JDQ1WbNm1UUFCQAtSECROUUmUXV4uIiFANGjS44OJqSUlJql69egpQN9xwg8rI0BcaKynD29tb1axZUzVq1Mh53zPPPOMs6/zF1SIjI5Wvr2+lFlcbPny4M63kcwgPD3emffTRR857W7VqpaKiolRBQUHFfwATBir15n36gmdF/7vPbtcXRLvaC669eV+FF1cTQgghhBBCXH+kx7uKBAUFsWXLFt577z06duyIw+Hg8OHDBAYG8thjj9G9e/cKlzV27Fjmzp1Lx44dyczM5OjRo9SqVYt//OMfDB48GNCHnMfGxvJ///d/1K1bl+PHj1NQUEDfvn3LLbNevXosW7aMgIAA4uLi6N+/v3MuOcAzzzzDsGHDyMzMxNfXl8cff5y3337bef2nn37ilVdeoWHDhhw5cgSTycRdd93F5s2bnXO2/4yRI0dy99134+/vz759+9i6dSt2u71yhfQfpfc2L3i9enu+g0L+9OchhBBCCCGE+OvSlLrMibXiutSgQQNOnjzJhAkTmDhxYnVX5/JNHASDn9UD4F9m6MHv7f8HbmZQCjYugYNboPs90Pwm/Z6sM/DzZ+DuAXeMBi99cTl2rYIdy6BD39J53QW5et7iIj1vQLCefmibvn94ZGd9VXVNg7TjsHQ6jHoPQq/+fHohhBBCCCFE9ZLAW7i4bgLvyfeCtfjS+a4WNzM8+XFpgC6EEEIIIYT425BVzcX16clPoCCnumtRystPgm4hhBBCCCH+pqTHWwghhBBCCCGEqELS4y2uT1lnrp0eb+ntFkIIIYQQ4m9NerzF9WnyfWC1VHctdCZ3GPNJpYLv4nwHc+5OJztZX8m917hA2t578X2y93ybx4Gl+Zw+ZMVaqP9aj/wphBqN3MrNf+pgMV8+cAq7lTJ5j8UUEvN+Ftkpdmo2NnHrq4GEtjY77135RgbJOy08vCgEo5tW4XYJIYQQQgjxdyQ93tVkxIgRzJ07lx49erBu3brqrs4VMWfOHB555BEAqv19jtWir2oeGAKrF0DiQeg9HMJa6NdPJ16d1c5tFljyod77XonAe9XkTGfQXVEJG4s4fciKZ6ABa+HF77UWOfj5pXPOoPt8RTkOlr5wjjpt3Ll/Xi0WDjvNj8+dY/TqUACSd1n4/bt87p9bS4JuIYQQQgghKuAvs493gwYN0DTtoseVWIV7zpw5zvKutpI2RkdHX/VnXwnBwcFERUURFRVV3VXRBYZA/Wb6XtoRHWDlPMjL1Lf0atsThk+CrNOwaj7UCIW6TWDoC9ChD6xfBOkJet4WnWDkW2C3wbJZ4Bukp9/xuL7P945lcGSXntakLTz6tj68/LcvwGS+ZDX/6NCyAvb/VECzPp6Vuq/Xa4E8HVuXLk/4XzLv2qlZZCTYyn1G5km9xzz0Bnc8/Y3UjnQn75Sdgkw7dqtixcQM2t7jQ2ibyrdNCCGEEEKIv6O/TOB94403OoO6unXrOtPbtm3rTK9Xr1411vD6Zbfbsdsv3fvav39/YmNjiY2NvQq1qoDVC8BmBZMbDPknNG0P37wD8Tv06/Ui4KEJeu/3gtfBUggGA9w+Cjr0hp+m673aADXrwvDXwVIAc8dDbqae3mOoHnyvXQgxi/Q030A9r9lL7/2uhJw0Gytez6B2Cze6PX3pAPp8PrWMGIyXfmF0dF0he7/Np90DPjTqVjbwDgxzw81TIzWumMJsO6cOFuNT24hXoJEt/8nBWqjo9mzl6iaEEEIIIcTf2V8m8P7++++dQd1jjz1WJn358uXExcURHh6Ou7s79erV4/nnn6egoACAxMREAgIC0DSNSZMmAZCSkuJMGz9+PCNGjHAOlQbK9KSXfD9nzhxnnujoaDRNY8SIEc604cOH07RpU3x9fXF3dyc8PJynn36anJzKL/ZV0gv+r3/9izFjxlCjRg1q1arFM888g81mA6Bv375omsagQYOc9ymlCAsLQ9M0Xn75ZQAsFgsTJkygadOmuLu7U6tWLUaOHMnZs2ed902cOBFN02jQoAHz5s2jcePGuLu7k5SUxOHDh7nzzjupVasWZrOZevXq0a9fP7Zt2wZceLTA7Nmzad++PZ6ennh7e9OlSxd+/PFH5/UTJ064fLZ33HEHXl5eNGzYkFmzZlX6M3NKPAiLp1Vv8O3uUeHqKofi17EZOKxwx9QaGExXftRF3lk7y8dnULOpGz3+GVBuHg9/AwPeq0H+aTuf35qGm6fGXe/X4OwxK9tm5dBrXCC7v8rj89tSmR6dwtp3M3HYZKkIIYQQQgghLuQvE3hfTHFxMdHR0fz73//m9OnTREZGcu7cOT744AMGDBjgDEI//fRTAN566y3279/P448/TnZ2NjfddBPjx4+ncePGNGrUyFnu5fak//jjj2RmZtK4cWPq169PYmIiH3/8MY8++uhlt/GDDz7gq6++wtPTkzNnzvDvf/+b2bNnA3qgD/Dbb785g/stW7aQlJQE4HwpMHjwYF5//XUSEhKIjIzEYrEwe/ZsevToQWFhocvzUlNTGTFiBCaTidq1awNw//33s3TpUmw2Gy1btsThcLBs2TIOHDhwwXq/+eabjBw5kl27dlGrVi38/PzYvHkzAwcOZMGCBWXyjxo1iv379+Pm5saJEycYNWoUhw4duuhnY7FYyMnJcTksNrs+p/vIzuoNvu8YfdG6n2/ngjySdli45eUAghqUvyDan7VyUgbF+Yo7pgZhMl84sG/cw5ORP9XhuR31ePjbEOrc4M7yCRlE9PICDdZ/kE3jaE/aPeDLjrl5/P5dfpXUVwghhBBCiOvBdRF4f/XVV+zZswd3d3d+//139u7d6xzuvGbNGtasWQPAgw8+yL333ktxcTG33norv/zyC97e3ixYsACTycS4ceMYN26cs9zyetgrIiYmhrNnz7Jnzx6OHTvGq6++CsAPP/xAUVHRZbWxXr16HD9+nKNHjxIaqi9ytXr1agAGDhyIn58fFouFH374AYBvvvkGgJtuuonmzZsTExPDr7/+6vxM9u7dy6FDh/D09OTAgQMsXLjQ5XlWq5Xp06dz+PBhUlJSCAsL48iRIwAsXbqUXbt2kZqayvHjxy84Jz0/P5+33noLgEGDBpGQkMCJEye46SZ90bLXXnutzD133XUXx48fZ8OGDQA4HI5LLj43ZcoU/P39XY4pG+P1hdTu/Vf1Bt8lC7FVwOnDxQCseTuLDzsmM3tguvPamncy+fLBUxUu68LPsGK3Kr584DQfdkxmxesZzmvz7z1FzPtZ5d63++s8Mk/YuOXlAE7G6v+G297jTbsH9ZXWT2y5vH/XQgghhBBC/B1cF4F3yVDn4uJiIiIi0DSNtm3bOq+fP+f4s88+IzQ0lFOn9CDmvffeo2nTple0PqtWraJVq1Z4enqiaRqTJ08GwGazcebMmcsq884778Tf3x8PDw8aNmwI4GyDp6cn99xzDwBff/01DoeDRYv0wK+kt7vkMwLo0aMHmqYRGhrq7On+47xsT09PRo0aBehD7A0GAwMGDACgZ8+eREZGcvfdd7Ns2TLq1KlTbp3379/vLP++++7DYDBgNpu5++67ATh58mSZz+PBBx9E0zRatGjhTCtp54WMHTuW7Oxsl2Ns1wj9YkSH6g++K8laqJxHCXsxWIv073NP2Zg1II1ZA9KIX1VQ6fKVo/QZ9mLX59qLyw4Zz023seGjbKJfDMAryEjJgvVGN61KhsMLIYQQQghxvbmuthNzd3fnxhtvLJMeGBjoPM/IyHCZa3306NFKP+f8hcays7Ndrn355Ze88MILANSpU4f69etz9uxZjh8/XubeyggICHCem0z6j+38LbuGDx/OzJkzWbVqFT/++CNpaWmYzWbuu+++MmWVt+p4SEiIy/fBwcEYDK7vZebNm8edd97JunXrOHDgAL/++itLlixh3759zmH8f1ZJO0vaCJfemsxsNmM2/2GFbZNRD6BDG5cG39+8owffQ/5ZGnwvnqan3/svPV9J8D1/kh58DxsPZk89+AY9+AZ9+7CS4HvueP0Y/ro+vLzHUD3P2oWQm0FF3T65BrdPruH8PjvFxow+aYDrPt4OG2Qk6PP7i/NKP5uY97OIX1lIcb7Dmbb48TMYTBrtHvSh/TBfHl8R6vLMfT/k89treh0vtOf3yjczCW1jptVd3gA06OTBznl5HN9YhG+IEYDwTrLCuRBCCCGEEBdyXfR4d+zYEdCD2unTpzuHiK9bt44XX3yRBx54wHn9oYceIi8vjzZt2qBpGh988AExMTHOsry8vJzn+fmu81Zr1aoFQHx8PACHDh0iLi7OJU9Jz7Gvry8JCQls3bqV3r17X+EWl9W1a1caN26M1WrliSeeAPRe8pKXDiWfEeg9xCWf0caNG5k4cWKZ+eflbae2YcMGBg0axOeff8769euZMGECAOvXry+3Ti1btsTTU181+5tvvsHhcGCxWFiyZAkA4eHhBAdXfG/rSvtlBiTrP6tq6/nesazq2vcH+efsZCXZKMgoDbxz0vS0omzHRe68sEPLCkjcaqHX+NKXV426e9L1KX+2zcph5aRM2j3oQ5shPn+6/kIIIYQQQlyvrovA+/7776d169bY7XY6duxIq1ataNasGQEBAQwZMoSsrCxAnwu8ZcsWAgMD+e2333j88cdxOBwMHz7c2QvevHlzZ7ktWrSgU6dObNq0CYBbb70VgGnTptGzZ086d+5cpje2devWAOTm5tKoUSMaNWrEt99+W9UfAQAPP/wwAOnp+tzgkkXXQF99vU+fPoA+J7x58+a0bNmSgIAA+vXrx4kTJy5Z/kMPPURgYCDNmjXjxhtvZPz48UBpm//I29ubV155BYAlS5bQsGFDGjRowNatWwF94bUqFRSi91xXZ/Ddoe9lV9+/rokX99XnxX31nb3df0xvNdDbmX775BrO9D8eXZ4sf/uvVgO9nXnK6+1u3teLZ7fXI6Ce6+CYzo/78cS6uozZWJdbxwbKkHMhhBBCCCEu4roIvM1mMzExMTz99NPUr1+f+Ph4MjMz6dChA5MnT6Z27drs3LmT119/HYCPPvqIOnXq8O6779KwYUNOnjzJmDFjAD2IHDduHLVr1yYxMZGtW7eSmakHUu+//z79+/fH09OTY8eO8corr9C1a1eXujz66KM8//zz1KxZk9zcXKKjo53PrWoPP/yws6c6JCSEvn1dg74ffviB8ePH07RpU44fP056ejqRkZG89tprtGrV6pLlP/LII7Rs2ZKzZ89y4MABQkJCGDVqFJ988skF73nttdeYNWsW7dq14/Tp02RnZ9O5c2d++OEHhg0b9ucafCm3/x/UCqve4LvdbVXbRiGEEEIIIcQ1T1OXmkArxF/RxEEw+FnwD4Zfv4CMdOg/Sg/EARIPwIq5EBYJtw4DownsNli9QN//u/dwfVV00APvX2boPei3/x+4mUEp2LgEDm6B7vdAc32ldrLOwM+f6ft33zEaCrJhyYcw6j19vrkQQgghhBDib0cCb3F9evNesBVfOt/V4GaGJz+GgCqczy6EEEIIIYS4ZkngLa5PWWegIOfS+a4GLz8JuoUQQgghhPgbk8BbCCGEEEIIIYSoQtfVPt5COF3NHm/p0RZCCCGEEEJchPR4i+vT5PvAark6z6rEHO7fXssgeZeF/DN2ALxqGGjc3ZObn/TD0994wfty021s+U8OKXuKyT1lw2EF/7pGWt7lTfthvhjd9NXsbRbF8okZpO8rJuOEDRTUae3OsIW1XcqL+z6PLf/JoeCcg5BW7vSeEEhQg9LtxL4bfQaHHYbOkBcKQgghhBBC/FnS4y2uT1aLvqp5zXpwaBus/xYiO0PXwaBp+vUrsdr5z5/rZRXkVCjwPrq2ELOvRlBDE4WZDrKT7examEfGSRtD/3Ph+zMTbexdlI+bl0ZgmImsZBtnj9qImZZNdrKNXuOCAD3wPrC0AJ/aRsw+Gpbcsu/Vzh23snxCJi3v9KLb0wHMHpTOb69l8OACPTg/8Es+STssjPg+pHKfuRBCCCGEEKJc18U+3qJyoqOj0TSNBg0a/Omydu3axbBhwwgLC8NsNlO7dm2io6OZOXPmn6/on2Uy61t43XI/3PkkHIyF3asgpCGEt4CRb0FIA/htJjjset5OA+C+lyHpEGz6Xg/I6zeDYeP1PcBXzoO8TD1v255wxz8qVaXRa0IZtSyUh78N4fGVodRt5w5Ayu6L9857+BvoMzGQMRvrMnxxCI8vD8W/nt5DfuCXAmc+d2+N0WtDGb06lFrN3Mst6+wRK8oBoW3N+NQyEtTAxJnDVgAKs+ysfSeLLmP8Cagn7+WEEEIIIYS4EiTwFpdt5syZ3HTTTXz55ZckJydTr149fHx82LBhA2+++WZ1V0/fT/tsin7e7ja48wnYsQJ+nQEOB5g99YC6VhjMnwTJ8XreiA5w77/gyE5YPA1sVjC5wZB/QtP28M07EL9Dz1vSU15BJrPGxo+zWXD/Kf7TO5WUXfqWZ/XamS96X61m7rQe4oPJXR9S7uFvoGYTfWh4SRqAwajhE3zhIesANZu6oRkgdY+FvNN2Mk7YCG6ml7V2ahZ+oSbaD/OpVLuEEEIIIYQQFyaBtyijsLCQV199lSZNmuDu7k5QUBADBw4kLi7OmefQoUP84x//wG63Ex4ezu7duzl27BjHjh3j7NmzvPLKK868GRkZPPnkk9SvXx83Nzdq167NsGHDSExMdOaZOHGisxd+0aJFNG/eHG9vb7p3787hw4cvryHuHjB3fNUH35WUedJKWlwxOan6PO/wTmbunFajUmVkJFhJ3Kr3kre+27tS99Zo5EafSYEk7bAws38awU3d6PdGECe2FHHw1wJ6jQsk5v0spken8PltqWyddY1syyaEEEIIIcRflATeoow777yTt956i+PHj9O4cWOsVis//vgjN998M4cOHQJg1qxZ2O164PjBBx/Qpk0b5/2BgYGMGjUKgKKiInr06MH06dNJT08nIiKCnJwcvvzySzp37syZM2dcnp2SksKDDz6IpmkUFhayYcMGRo4cedH6WiwWcnJyXA6LzQ53jAazV9UG34kHKv35DnivJs/vrsfDi2tTs6kbJ2MtrHwzs8L3p8VZ+GrEaayFiqa3edLlSf9K1+GGQT6MWhbKs9vrcd+cWviGGFkxKYOOI3xJiytmx9w82j3gS5Oenqz/IJuEjYWVfoYQQgghhBBCJ4G3cLF27VpWrVoFwPvvv8/Bgwc5ePAgPj4+5OXlMWXKFAAOHCgNOLt3737B8r766iv27dsHwKJFi9i/fz+bNm3CYDCQmprKJ5984pLfZrPx3XffcfDgQZ599lkANm/eTGHhhQO/KVOm4O/v73JM2RgPXr4w/PWqDb5XzK3U51vC6KZRu7m7s7f6wNICMk5YL3nfkTWFfDPyDAXnHLQe6s2d02pgMGmXvO9SNn6Sg8GkcfNof07GFgHQ7kEf2gzV63diS9GffoYQQgghhBB/VxJ4Cxfbt293nj/wwAMA1KtXj27dugGwY4c+vPr8Xeg07cKBX0l5Xl5eDBw4EIB27drRrFkzl/JK+Pv7M2DAAABatGjhTD99+vQFnzF27Fiys7NdjrFdI/SLvoFVG3yHRV6wXn+UFmchcVtpAGu3KmeQC2AtVM58swakMWtAGmlxpYuu7Zyfy4/PnsVapOj+vD99JgRhMP75oPvUgWJ2fZlLnwmBmMwa/O9Ha3DTrkhQL4QQQgghxN+dBN7isrRs2dJ5vmHDhitWbkBAgPPcZCpdVfti282bzWb8/PxcDrPJCLv0nvsqDb5vHVbhtp07ZuObkWf4+OYU5tydzvToVI6t0wPvWs3dqPW/Bc6shYqMBBsZCTZnMJ6yx8Kad7JQDnD30jiyqpAFD5xyHnn/2xcc4It+aXzRL420OH3httOHip1puadsLnVy2BTLJmTQaqA39Tt6ABDeWV/o7fj6Qo6v10cahEd5VLidQgghhBBCCFcSeP+NKaUoKipyOdq3b++8vnDhQgCSk5OdwXWHDh0AGDlyJEajvnr2c88957LwWkZGhnMIeceOHQEoKCjghx9+APQtyEoWTCspr0rsWAYxi/Tzqgq+jRXfcqtmEzcadvXAaIZzx6zYihQ1GpnoOMKXe2fVQjNcuHfZXlz64qE4X5H2e7HLcf71rCQbWUk2bBb1v3tL0xyucTfb5+WSf9ZO9D8DnGlthvjQ7kEfVkzMZPucXLo+5U+j7p4VbqcQQgghhBDClaYu1pUorkvR0dHExMSUe+2DDz7gl19+YdWqVWiaRvPmzUlOTiY3NxcfHx+2b99O8+bNAX07sZKVzQ0GAw0bNkTTNE6cOEHdunU5ceIERUVFdOzYkX379mEymYiIiOD48eMUFRURGhrKnj17CA4OZuLEiUyaNInw8HBOnDgBwJw5c3jkkUcASEhIqNy+4xMHQYe+evDd8wHoMVRPz83UA29LgR6I16yrp+9aBT9Nhw694fZRYDCApRAWvA6nE+GhCVDvf8PX43foC6s1bQ9dBsGsl2HUe/re3kIIIYQQQgjxB9LjLcr46aefeOWVV2jYsCFHjhzBZDJx1113sXnzZmfQDfDYY4+xdetWHnjgAUJDQ0lMTCQzM5OoqCheffVVADw8PIiJieGJJ54gJCSE+Ph4fH19efDBB9myZQvBwcFV15B2t+lB99qFVdfzvXpB1dVfCCGEEEIIcV2QHm9xfZo4CAY/CzXr6b3ZO5bpPeDtbtOvF+TCz59BcZG+7VjA/14AHNoG67+FyM7QdTBoGlgt8OsXkJEO/UfpgTjoW4ktnwPKIT3eQgghhBBCiAuSwFtcn968R5+DfTWY3GDMp6XBuxBCCCGEEEKcRwJvcX3KOgMFOVfnWV5+EnQLIYQQQgghLkgCbyGEEEIIIYQQogpVfC8kIf5Kyuvxlp5pIYQQQgghRDWQHm9xfZp8n74o2vnczPDkxxUKvu1WRewXOez/KZ/cdDteNYw06+1J16f8cfe68GYAmz7NZvNnFx7iPmp5HfzrmjgWU0jM+1lkp9ip2djEra8GEtra7My38o0MkndaeHhRCEa3C+/vLYQQQgghhLj2yXZifwMjRoxA0zSio6Mvmm/58uW0bt0aDw8PNE1j4sSJTJw4EU3TKreH9rXAatFXNR/1HtzzL/Dw1tMqOO972bgMNk/PISfVTkB9EwXn7Oycn8eSJ86iHBd+V+Vb20id1u4uh4e//mtmdAcPPwNFOQ6WvnAO72Aj/1hdh+ICxY/PnXOWkbzLwu/f5dNnUpAE3UIIIYQQQlwHJPC+St599100TcNkMpGbm+tM79WrF5qm4ebmRn5+fpn0Hj16XJX6ORwO7rvvPuLi4vD19SUqKop69epdlWdXmbOp+hZfLTrBnWMqfNupA8Uc+LkAgFteDuDRpXW468OaACTtsHBkdeEF7209xIdhC2s7j3v/G4zBqF9reac3Zl8DmSetWAsVoTe44+lvpHakO3mn7BRk2rFbFSsmZtD2Hh9C25gv+BwhhBBCCCHEX4cE3ldJt27dALDb7WzevBkAm83Gli1bnOexsbFl0rt3737Zz7Tb7djt9grlTU1NJSsrC4AFCxYQGxvLY489dtnPvias/1bfwxsqNbf7+IYi53lELy8AGnf3wGTWe58TNhWVe1959v9YQEGGAzToONwXgMAwN9w8NVLjiinMtnPqYDE+tY14BRrZ8p8crIWKbs/6V/gZQgghhBBCiGubBN5XSfv27fH09ARgw4YNAOzevZv8/Hxq1arlkr5r1y5n73e3bt3IyMjgySefpH79+ri5uVG7dm2GDRtGYmKis/zzh4TPmzePxo0b4+7uTlJSUpm6nD59msjISDRN46abbuLDDz+kfv36zut9+/ZF0zTmzJlTblvsdjvTpk2jRYsWmM1m/P396dWrl7P+AF27dkXTNJ588knnMzVNQ9M0Dhw4AMD48ePRNI3mzZsDkJeXx+jRo6lfvz5ms5ng4GC6dOnC3LlzK/+BA0R2hp+mlwbfFZSbbnOeewXpvyKaQcMzQD/PSavYywzlUOyYp49uaBLtQVBDNwA8/A0MeK8G+aftfH5rGm6eGne9X4Ozx6xsm5VDr3GB7P4qj89vS2V6dApr383EYZOlGIQQQgghhPirksD7KnFzc6NTp05AaYBd8vX55593+X79+vUAGI1GOnfuTI8ePZg+fTrp6elERESQk5PDl19+SefOnTlz5ozLc1JTUxkxYgQmk4natWuXqUdmZia9e/fm0KFDREVFsXLlSpo2bUrbtm2deSIjI4mKiiI4uPxe4scff5wXXniBgwcPEhYWhslkYtWqVdxyyy3ExMQAOOeTb9q0yeUrwMaNG13aW5J3/PjxfP7555w5c4aWLVvi6+vL1q1bWbt27UU/W4vFQk5Ojsthsdmh62Do0FsPvg9tu2gZFVHZ0PfImkIyT+pBfMdH/FyuNe7hycif6vDcjno8/G0IdW5wZ/mEDL2HXYP1H2TTONqTdg/4smNuHr9/l1/eI4QQQgghhBB/ARJ4X0Ulw823bdtGcXGxM/AcOnQoTZs2JTY2FqvV6gy827Zty+LFi9m3bx8AixYtYv/+/WzatAmDwUBqaiqffPKJyzOsVivTp0/n8OHDpKSkEBYW5ryWl5dHv3792Lt3L506dWLFihX4+/vTv39/vv/+e2e+6dOnExsbS//+/cu04dixY/z3v/8F4JlnnuHIkSMcP36c8PBwbDYb48ePB0qD6bi4OHJycti4cSNGoxEvLy82btyI1Wpl27ZtLnmPHDkCwLhx49i1axfHjx/n9OnTPPfccxf9XKdMmYK/v7/LMWVjPGga3D5KD77Xf3uJn04p35DSXfYKMhyA3ntdlKWf+9UxVqic7XP03u46bdyp1+7i87V3f51H5gkbt7wcwMlYfSh723u8afegDwAntlR8eLsQQgghhBDi2iKB91VUMl+7qKiIbdu2sXHjRkJDQ2nUqBHdu3enoKCAHTt2OHuHu3Xrxvbt2wHw8vJi4MCBALRr145mzZoBsGPHDpdneHp6MmrUKAA0TcNgKP0R79y5k61btxIeHs7y5cvx83Ptha2InTt3UrID3QMPPACAv78/t99+u0t9br75Ztzd3XE4HGzZsoVNmzbRtm1bOnXqxMaNG9m1axcFBfoCZiWB94ABAwA98A4PD6dPnz58/PHH5fbcn2/s2LFkZ2e7HGO7RuirmBsMevAd2bnCbWzY1cN5Hr9Sr+Ox9UXYLHq7G3bRr6fFWZg1II1ZA9JIi3Pduixlt4XUPcUAdBzhe9Hn5abb2PBRNtEvBuAVZKRkgz+jm4bBJKuaCyGEEEII8VcngfdV1KlTJ0wmvTd1xowZnD171tkLXhKUf/7552RkZLikVUZwcLBLsH0+b29vAE6ePMn8+fMrXXZleHl50bFjRwBWrlzJrl276NKlC127diUhIYFvv9V7oJs1a0ZISAgAo0aNIiYmhueff57mzZuzc+dOJk6cyG233XbRZ5nNZvz8/FwOs8kIv34BlkI9+O46uMJ1D2npTuTt+qJqa97OYtaANH589iwA9dqbaXqrPlffWqjISLCRkWDDWug6EL2ktzsgzETE//JfyMo3MwltY6bVXfrPp0EnPbA/vrGIY+v1FdTDO8kK50IIIYQQQvxVSeB9FXl7e9OuXTsAvvrqK4AygffChQud+bt27eoMXgsKCvjhhx8AffG1w4cPA9ChQweXZ2jahXtIO3TowGuvvQbAU0895axDZbRv3975jJK6Zmdn8+uvv5apT0lP9syZM7Farc7AG/QXD+fnAX0IfsuWLXnvvfdYvnw5P//8MwD79+/n3LnSfa4rLCMdFryuB98X+VzK029yEJ3/4YdfHSNZSTa8goy0e9CHu6fXRDNcvKzMRCtH1+oBc4eHfS6a/9CyAhK3Wug1PtCZ1qi7J12f8mfbrBxWTsqk3YM+tBniU6n6CyGEEEIIIa4dpktnEVdSt27d2LZtGzabzfk9QIMGDahfv75zFfLmzZsTHBzM/fffz/vvv8++ffsYOnQoERERHD9+HIfDQWhoKGPGVHx/aoA33niDlJQUZs+ezfDhwwkICKBfv34Vvr9x48aMHDmSWbNm8dFHH/HLL7+QkZFBRkYGJpOJSZMmOfNGR0czefJksrOzAejSpQt+fn4YjUby8vKceUr8+9//5ptvvqFevXoEBQVx9OhRAOrWrUtQUFCl2glA/1Hw20w9+L7toUrdanTT6DrGn65jLrytV9hNHry4r36Z9MAwN174vWx6eZr39aJ5X68y6Z0f96Pz45WfCiCEEEIIIYS49kiP91VWEmgDBAQE0KpVK+f3PXr0KJPPw8ODmJgYnnjiCUJCQoiPj8fX15cHH3yQLVu2XHDl8YuZMWMG/fr1w2q1MmTIEJcVxyviP//5D++++y6RkZEkJiZitVq57bbbWLNmjUsgXTLPGyAsLIy6devi6+tL69atnXnOz9+/f3+6detGYWEhcXFxeHh4MGDAAH799deL9uRfUK0weGgCnE7Uh50LIYQQQgghRDXQVMlKWUJcTyYOgsHPQs16euD98+dgK4ZR70Fo4+qunRBCCCGEEOJvRIaai+uTyQ2WfFg2zUuGbwshhBBCCCGuLunxFtenrDNQkOOa5uUHAZUfmi+EEEIIIYQQf4YE3kIIIYQQQgghRBWSoebi+lTS4y293EIIIYQQQohqdlk93ocPH+att94iNjaWiIgIXn31VVasWMHgwYNdVukWotpMvg+sFnAzw5MfVyj4tlsVsV/ksP+nfHLT7XjVMNKst76ntrvXhTcAyE23seU/OaTsKSb3lA2HFfzrGml5lzfth/lidHNdkf33xXnsXZTHueP6lnL+dY10GO7LDYP0vbpjv8hh99d5FOc5CLvJTK8JQfjUNALgsCnm3XOKOq3d6TPxMrZYE0IIIYQQQlx1ld5ObO/evXTs2JEFCxZw5MgRzp07h4eHBxMnTuTzzz+vijqKcowYMQJN01y249I0DU3TmDNnzgXvW7dunTPfunXrqryeFxMdHV2mDVeM1QJdB+tf/zjX+wKWjctg8/QcclLtBNQ3UXDOzs75eSx54izKceH3U5mJNvYuyic7xYZ/qAnNCGeP2oiZls2atzNd8q56K5PlEzNJ32/FM8BAYLiJggwHKbuLATixuYgNH2VzwyBvhi2szbH1Rax7N8t5/7b/5lKQaSf6nwGV/kiEEEIIIYQQ1aPSgffLL79MXl4e7du3d6a1bduWoKAg1q5de0Urd71799130TQNk8lEbm6uM71Xr15omoabmxv5+fll0nv06EHjxo2JioqiRYsW1VH1v4ajuyuc9dSBYg78XADALS8H8OjSOtz1YU0AknZYOLK68IL3evgb6DMxkDEb6zJ8cQiPLw/Fv57eQ33glwJnvpQ9FnYvzEMzwF0f1uDxlaEMXxzCk+vrcsu/AgA4fUgPwOu1M1OjsRteQQbOHNbTMk9a2fKfHG57JRCzb6V/dYUQQgghhBDVpNL/e9+0aRN169Zly5YtLun169cnKSnpilXs76Bbt24A2O12Nm/eDIDNZnN+tjabjdjY2DLp3bt3Z9y4ccTGxjJ9+vRqqHnFFRcXV9/DT52scNbjG4qc5xG9vABo3N0Dk1kfJp6wqajc+wBqNXOn9RAfTO56Xg9/AzWbuAE40wAOL9eDcJ9aRvZ9n89HnZL5/LZUVr2VScmEj1rN3QFI3mXh3DErBRkOgpu5o5Ri+cRMGnb1cNZPCCGEEEII8ddQ6cDbbrfj4+OD0Wh0ST9z5gwOh+OKVezvoH379nh6egKwYcMGAHbv3k1+fj61atVySd+1a5ez97tbt27lDjUvz7fffkujRo3w9PTk9ttvJyUlpdx8O3bs4K677qJGjRqYzWYaNWrEtGnTACgsLGTgwIE0bNgQb29vzGYzTZs2Zfz48S6BdcnQ8YceeogXX3yRWrVq0axZMwAyMzO555578PLyIiws7ILTEubNm0fbtm3x9fXF19eXyMhIHnrooYp8nGV17FvhrLnpNue5V5D+a6EZNDwD9POcNHuFy8pIsJK41QJA67u9S9NP2P73LDsnt1rwCzGSe8rO7oV5/PKvcwA0uNmDbs/4E7cknwUPnKJRNw+iXwwg7rt8Th8qpvtz/vz22jk+6ZbCF/3S2PdjftkKCCGEEEIIIa4plQ68W7RoQXx8PG+++SYAOTk5vPDCC6SmpsrCapXk5uZGp06dgNIAu+Tr888/7/L9+vXrATAajXTu3LlC5e/Zs4f777+fhIQEzGYz8fHxPP7442Xybd68mS5duvDTTz+Rl5dH06ZNycnJcT7bYrHw448/UlhYSEREBLVq1eLo0aO88cYbvPrqq2XK+/bbb/noo4+oXbs2fn5+ADz22GMsWrSIwsJCvLy8eOGFF9ixY4fLfXv37mXEiBHs3buXkJAQGjRoQHJyMgsWLLhoOy0WCzk5OS6HxWaH2g0q9DldTGVXHkyLs/DViNNYCxVNb/Oky5P+zmsOW2lpQ2cE88gPdejypP75HIspIjtFD8w7/Z8fo9eE8szWegz+RF8Ubt37WfR4PoC4Jfns+6GAHs/7E9zMjWXjMjh71PrnGimEEEIIIYSoUpUOvJ955hmUUkyYMAFN0zh48CAffPABmqYxZsyYqqjjda1kuPm2bdsoLi52BrtDhw6ladOmxMbGYrVanYF3SW9wRUybNg2Hw4G/vz+HDx/m6NGjDB48uEy+1157jeLiYgICAoiLi2Pfvn2cPn2aSZMmAeDt7c3+/ftJT09n9+7dJCUlMWzYMAC+/vrrcp+9fft24uLi2LVrF8eOHWPJkiUA/Otf/+LQoUPs3LkTi8Xics/Ro0dRShEREcHhw4eJi4sjKyuLmJiYi7ZzypQp+Pv7uxxTNsZX6DMq4RtSurNeQYY+ckM5FEVZ+rlfHWO5953vyJpCvhl5hoJzDloP9ebOaTUwmEqHmvvWLi0jpJU+pLzODe7OtJLA+49WT84kOMKd1kO8ORlbhIe/gRsG+dBqoDfKAYlbLzwMXgghhBBCCFH9Kh14Dxs2jLfffhtPT0+UUiil8PDwYPLkyc5gTFRc9+7dASgqKmLbtm1s3LiR0NBQGjVqRPfu3SkoKGDHjh1s2rQJKA3UK2L//v0AdOnShdq1awN6QP9HW7duBWDIkCFEREQAYDAYaNOmjfN8wYIFREREYDab0TTN2QudmppapryePXs67zUajc56ANx9990ANGvWjNatW7vc16VLFwIDA4mPj6dGjRpERUXxxBNPXLKdY8eOJTs72+UY2zUCMk9d8t4SDbt6OM/jV+pzsY+tL8Jm0XupG3bRr6fFWZg1II1ZA9JIiyt9cbBzfi4/PnsWa5Gi+/P+9JkQhMHouo1YeKfSZ6TvK/7f1//1VmsQGGbij46sKeT4+iL6TAxE0zSUAqM+fRxj2exCCCGEEEKIa9Bl/df9pZde4qmnnnIGVC1btnTOVRaV06lTJ0wmEzabjRkzZnD27FnuvfdeQA/KZ82axeeff05GRoYz7Wp7++23mTJlCgDh4eGEhISQnJxMSkpKufP6S4L8ygoJCWH//v3Mnz+fnTt3EhcXx4wZM5g5cyabN28mKiqq3PvMZjNms9k10WSE2KUVf3ZLdyJv9+LgrwWseTuL3V/lkZWk90DXa2+m6a36v29roSIjweY8B3218jXvZAHg7q1xZFUhR1aVroI+8KOa+AQbadbHi53zc0nfb2Xx42fwr2dyDhO/YaC3S687gCXPwarJmXR+3I+gBnq0Hd7Jg+2zc0nfX8zx9UVoBqh/0x/aLoQQQgghhLimXPaeRBs2bCAmJoaYmBg2btx4Jev0t+Lt7U27du0A+Oqrr4DSXu2SIHvhwoXO/F27dq1w2S1btgT0lehPnz4NwOLFi8vkKwlov/vuO44ePQqAUorff/8dwLmyekREBCdOnGDTpk3OHu3yaJprT+/5W559//33AMTHxzvLL5GamsqZM2d46aWX+Oabbzhw4ADNmzfH4XBc3r8x36BKZe83OYjO//DDr46RrCQbXkFG2j3ow93Ta6IZtAveZy8unbtdnK9I+73Y5Si5bnTTGDqjFm2GeuPurZGVaKNmEzdu+VcAvScGlil3/QdZePobuGlk6dSCm//hR2R/L7597DRH1xXSZ2IgwU3dy9wrhBBCCCGEuHZUusc7OTmZQYMGsWvXLpf0G2+8ke+//5769etfscr9XXTr1o1t27Zhs9mc3wM0aNDAZZu25s2bExwcXOFyn3/+eb788kuys7OJiIggODi43C3f3nzzTXr27ElmZiYtW7YkIiKC9PR0unTpwg8//EDr1q35+eefiY+Pp2HDhlitVgoLL7yv9R81adKEgQMH8sMPPzBlyhS+//57kpKSMBqNzjYDHDhwgF69ehEcHExoaCg5OTkkJCQAcMMNN1T4eU6dBsBvX1Q4u9FNo+sYf7qO8b9gnrCbPHhxX/1Lpl2Ih7+B3hOC6D3h0nl7jSv74sDd28Ad79So0LOEEEIIIYQQ14ZK93iPGjWKnTt3Oud3lxy7d+8ud8VscWnnz9sOCAhwWR2+R48e5eariBtvvJGFCxfSoEEDioqKCA8P57PPPiuT7+abb2bTpk0MGDAAHx8fDh8+jI+Pj7N3/ZVXXmH48OEEBASQk5PDfffdV6G51+ebNWsWd999Nx4eHmRnZ/P66687V3Qv0ahRI+677z78/PyIj4/nzJkztGnThhkzZtC7d+9KPQ8Ak1vl7xFCCCGEEEKIK0xTSlVqxyRPT09sNhuffvop999/P6CvbD169Gjc3d0pKCiokooKUSkTB0G3IbBhMYx6D0IbV3eNhBBCCCGEEH9TlQ68w8PD8fX1Zd++fS7prVq1oqCggOPHj1/RCgpxWd68B2xWcDPDkx9DQMWH6AshhBBCCCHElVTpOd7/+te/eOmllzh06BDNmzcH4NChQyQkJPDxxx9f8QoKcVnGfAoFOeDlJ0G3EEIIIYQQolpVuse7Z8+exMbG4nA4nAtexcXFYTab6dChQ2nBmsbq1auvbG2FEEIIIYQQQoi/mEoH3gZDxdZj0zQNu91+WZUS4k/LOiM93kIIIYQQQohrQqWHmj/88MNl9mkW4prz6VNgtVRqjrfdqoj9Iof9P+WTm27Hq4aRZr096fqUP+5eF37hlJtuY8t/ckjZU0zuKRsOK/jXNdLyLm/aD/PF6Kb/vmSn2JjRJ63cMvpMDKT1EB8A4r7PY8t/cig45yCklTu9JwQS1KB0hfbvRp/BYYehM+SFghBCCCGEEH8FlQ6858yZUwXVEH91c+bM4ZFHHgGgkoMoqobVAh36wo5les93BQLvZeMyOPBzAZoBAsNNZCXZ2Dk/j9MHrdz732A0Q/kvnDITbexdlI+bl0ZgmImsZBtnj9qImZZNdrKt3P2467R2d/neq4YRgHPHrSyfkEnLO73o9nQAswel89trGTy4oDYAB37JJ2mHhRHfh1T2ExFCCCGEEEJUk0rv492iRQumTp1KSkpKVdRHVJOMjAxefvllIiMj8fT0xMfHh7Zt2zJ58mSXLeKio6PRNI0RI0ZUX2UrKm59hbOeOlDMgZ/1dt7ycgCPLq3DXR/WBCBph4UjqwsveK+Hv4E+EwMZs7EuwxeH8PjyUPzr6YH0gV/K315v2MLaLkeTnp4AnD1iRTkgtK0Zn1pGghqYOHPYCkBhlp2172TRZYw/AfUq/c5MCCGEEEIIUU0qHXgfOnSIsWPH0qBBA/r06cPChQspLLxwUCKufcnJydx444288847HDp0iJCQEPz9/dm7dy+vvfYaXbp0ITc3t7qrSXFxceVucDNXOOvxDUXO84heXgA07u6Byaz3cidsKir3PoBazdxpPcQHk7ue18PfQM0m+tDwkrQ/+qRbCh92TGbukHT2LspDOfRRAjWbuqEZIHWPhbzTdjJO2Ahuppe1dmoWfqEm2g/zqXC7hBBCCCGEENWv0oH3c889R3h4OHa7nZUrV/LQQw8REhLCo48+yrp166qgiqKqPfHEEyQmJgLw1VdfkZCQQEpKClOmTAFgz549vPrqq2iaRkxMDABz585F0zQ0TePEiRMu5W3evJmOHTvi5eVFu3btiI2Ndbm+detWbr/9dgICAvDw8KBdu3YsXrzYJU9J2VOnTmXw4MH4+PgwatSoyjWs810VzpqbbnOeewXpvxaaQcMzQD/PSav4QoEZCVYSt1oAaH23d5nrXkEGfIL1HvHTh6ysmJTJ+g+zAajRyI0+kwJJ2mFhZv80gpu60e+NIE5sKeLgrwX0GhdIzPtZTI9O4fPbUtk6K6fC9RJCCCGEEEJUj0qval5i165dLF68mCVLlhAfH+9ccK1Bgwb885//5IknnriiFRVVIzMzk5o1a+JwOIiOjmbt2rXOaw6HgyZNmpCQkEBQUBBNmjTh4MGD5ObmUrNmTRo3bgzA999/z/Lly51zvL28vKhfvz7Hjh3DZrMRHh7O0aNHMZlMbNq0iZ49e2K1Wp0964cPHwb0YP7hhx8GcP57cnd3x8PDg7CwMDp37syMGTPKtMFisWCxWFzSzFOHYR70FCydDqPeg9DGF/0cVkzKYO+ifAD+ubceBqP+/M9uTSXvlJ0GXTwY+p9LzxNPi7OwZMxZCs45aHqbJ3e+VwODSS+ruMBBdrKN4Ah9fndhtp2vHj7NuWM2TB4aT2+p61yI7XzWQgezB6XTvK8XviEmVr2ZSben/ck7Y2f3V3kM+bwmDbt6XrJuQgghhBBCiOpR6R7vEu3atePhhx/mzjvvxNtb79VTSpGQkMBTTz3Fc889d8UqKarOkSNHcDgcALRt29blmsFgoHXr1oA+B3zp0qW0a9cOgP79+xMbG0tsbCx16tRxue/tt9/m0KFDTJs2DYCTJ09y9OhRAF577TWsViu9evUiKSmJQ4cO8eyzzwLw6quvlqlfo0aNOHHiBHFxcXz22WfltmHKlCn4+/u7HFM2xlfqc/ANKZ0zXZChfx7KoSjK0s/96hgvWcaRNYV8M/IMBecctB7qzZ3TSoNuAHcvgzPoBvD0NzoDZluRojDTUW65Gz/JwWDSuHm0Pydj9SHv7R70oc1Q/ffuxJYLD4MXQgghhBBCVL9KB955eXnMnDmTm2++mZYtWzJt2jTy8/MJCQlh3LhxfPXVVwQGBjJv3ryqqK+oQuVtE1fRfdvP99BDDwH6QnwlTp06BcC2bdsAWLlyJW5ubmiaxocffgjoc83/uGjf8OHDCQwMBMBoLD/4HTt2LNnZ2S7H2K4REL+jwnVu2NXDeR6/Ul8Q7dj6ImwWfUBIwy769bQ4C7MGpDFrQBppcaW97Dvn5/Ljs2exFim6P+9PnwlBzl7zEkfWFLrMFS/KcXBik74+gpunhmdg2c/61IFidn2ZS58Jgfp88/+NTzG4aS5BvRBCCCGEEOLaVemlkevUqUNBQYFzy6iePXsyevRoBg4ciMmkF/f999+zaNGiK1tTUSWaNGmCwWDA4XCwe/dul2sOh4O9e/cCEBQURHBwxfaNDggIAHD+e4CyW4zVrVuXevXqlbnXZrO5fF+7du1LPs9sNmM2/2EhNZMRDm+rUH0BQlq6E3m7Fwd/LWDN21ns/iqPrCS9LvXam2l6q94zbS1UZCTYnOcAKXssrHknCwB3b40jqwo5sqp0wcGBH9XEJ9jI6YPFbP4sB7Ovhl8dfdsxa4Fexk0jfcsMM3fYFMsmZNBqoDf1O+qBf3hnM0dWF3J8fSHZyXo9wqM8EEIIIYQQQly7Khx433LLLbRo0YL8/Hz8/f0ZPnw4o0ePplmzZmXyjhkzhn79+l3RioqqERQURP/+/Vm6dCnr1q3j66+/5r777gNg6tSpHD9+HIAHH3wQTdPw8tJX/M7Pz7+s53Xs2JGYmBjCw8NZtWoVnp56QJucnMzOnTsJDw93yV9eL3yFNbupUsF3v8lBBISZOLA0n6wkG15BRiJ6edLtaf8L7uENYC8ufalQnK9I+7243OuNoz3JTrWRsruYrCQbJrNGcIQb7Yf50ryvV5lyt8/LJf+sneh/BjjT2gzxISPBxoqJmRhM0PUpfxp1l/ndQgghhBBCXMsqvLiawWCgU6dOPPbYY9x///3OgEn89SUlJdG1a1fnyuYNGjSguLiY1NRUQJ/7HRMTg5+fH88//zwffPABBoOBNm3aUKtWLZYtW8acOXOci6uV/JNat24dPXv2BGDt2rVER0ezfv16br31Vmw2G/7+/jRs2JAzZ86QmppK9+7dnSvjlwTcs2fPvrw9wycOggFPVHhxNSGEEEIIIYSoKpWewDty5EgJuq8z9evXZ9euXbz00ks0a9aMtLQ0MjMzad26NW+++SabNm3Cz88PgBdeeIHbbrsNLy8vdu/ezY4dFZ9HDdC9e3fWr19Pv3790DSNAwcO4Obmxt13380LL7xQFc0TQgghhBBCiGpVqR7vxo0bM27cuIvmK9kOSohqNXEQdLoDYn+WHm8hhBBCCCFEtapU4H2p+baappVZHEuIavHmPWCzgpsZnvwYAiq2MJwQQgghhBBCXGmVDrwvll3TNOx2+xWrnBCXLesMFOSAl58E3UIIIYQQQohqVantxCIjI/n000+rqi5CXDkBwRJwCyGEEEIIIa4JlQq8/fz86NGjR1XVRYgrJ+uM/lWCbyGEEEIIIUQ1q1TgLcRfxidjQNMqPL/bblXEfpHD/p/yyU2341XDSLPennR9yh93rwsv/p+bbmPLf3JI2VNM7ikbDiv41zXS8i5v2g/zxeimr4tgsyiWT8wgfV8xGSdsoKBOa3eGLaztUl7c93ls+U8OBecchLRyp/eEQIIauDmvfzf6DA47DJ0hLxSEEEIIIYT4q6jwdmLDhw/n9ttvr8q6CHHl2IrBatHneVfAsnEZbJ6eQ06qnYD6JgrO2dk5P48lT5xFOS68rkFmoo29i/LJTrHhH2pCM8LZozZipmWz5u3M0upYFAeWFlBcoDD7lL9I4bnjVpZPyKR+BzOP/VKHM/FWfnstw3n9wC/5JO2w0Gt8YAU/BCGEEEIIIcS1oMKB9+zZs3nttdeqsi7iKouOjkbTNOdhNBqpW7cuAwYMYPPmzdVdvavm1IFiDvxcAMAtLwfw6NI63PVhTQCSdlg4srrwgvd6+BvoMzGQMRvrMnxxCI8vD8W/nhGAA78UOPO5e2uMXhvK6NWh1GrmXm5ZZ49YUQ4IbWvGp5aRoAYmzhy2AlCYZWftO1l0GeNPQD0ZqCKEEEIIIcRfSYUDb3H9cnd3JyoqitatW3P69Gl+/vlnevTowbZt26q7alfF8Q1FzvOIXl4ANO7ugcms90wnbCoq9z6AWs3caT3EB5O7ntfD30DNJvrQ8JI0AINRwyfYeNF61GzqhmaA1D0W8k7byThhI7iZXtbaqVn4hZpoP8znMloohBBCCCGEqE4SeAvq1KlDbGwsu3fv5ocffgDAZrOxcOFCAH766Se6du2Kj48PHh4e3HjjjcyaNculjHnz5tG2bVt8fX3x9fUlMjKShx56yCXPggUL6NixI15eXvj6+tK3b1/27NnjvG632xk7diyNGjXCw8ODoKAgOnTowLvvvlul7c9NL9173itI/5XQDBqeAfp5TlrFt8jLSLCSuNUCQOu7vStVjxqN3OgzKZCkHRZm9k8juKkb/d4I4sSWIg7+WkCvcYHEvJ/F9OgUPr8tla2zKjaMXgghhBBCCFG9ZMyquKgFCxY4A+jatWvj4eHBnj17eOyxx0hPT+fVV19l7969jBgxAqUUTZo0wcPDgxMnTnDo0CHmz58PwNSpU/nXv/4FQEREBHl5eSxfvpyNGzeyfft251Z1b7/9NkajkZYtW1JQUEBcXBw+Pj68+OKLF6yjxWLBYrG4pJltdsymi/cwX0qFNrg/T1qchSVjzmItVDS9zZMuT/pX+pk3DPLhhkGlvdrWQgeLR5+h4whf0uKK2TE3j25P+5N3xs76D7Kp1cyNhl09K/0cIYQQQgghxNUjPd6CtLQ0OnXqxI033sjAgQMBMJlM3H///bz66qsAREVFcfLkSRISEhg0aBAAkydPpqCggKNHj6KUIiIigsOHDxMXF0dWVhYxMTEAFBQUMGnSJAAmTZrE4cOHOXnyJB06dCA/P5+33noLgCNHjgDwyCOPsHfvXo4cOcK5c+cu2eM9ZcoU/P39XY4pG+Mr3H7fkNL3TwUZDgCUQ1GUpZ/71bl0AH9kTSHfjDxDwTkHrYd6c+e0GhhM5S+iVhkbP8nBYNK4ebQ/J2P1Ie/tHvShzVC9N/3ElgsPgxdCCCGEEEJcGyodeBuNRrp06VImfeTIkURFRV2RSomrq7i4mK1bt/L7778THBxM//79iYmJoWHDhiQmJgIwePBgzGYzmqZx3333AVBYWMj+/fvp0qULgYGBxMfHU6NGDaKionjiiSec5e/fv5+CAn2hsQkTJqBpGm5ubuzYsQOA2NhYAO644w40TWPmzJnUrVuXnj178uabbxIUFHTR+o8dO5bs7GyXY2zXiAq3v2FXD+d5/Eq9nsfWF2Gz6H3eDbvo19PiLMwakMasAWmkxZX2sO+cn8uPz57FWqTo/rw/fSYEYTD++aD71IFidn2ZS58Jgfp88/91wRvctCsS1AshhBBCCCGujkoPNVdKoVTZQbj79u1j586dV6RS4uoKDw/nxIkTZdJPnz5doftDQkLYv38/8+fPZ+fOncTFxTFjxgxmzpzJ5s2bMRhK3+9ERkbi5+fncn+NGjUA6NOnD7t27WLRokXs3buX3bt3s27dOubMmcPRo0fx8Sl/YTGz2YzZbHZNrMQw85CW7kTe7sXBXwtY83YWu7/KIytJn/ddr72ZprfqQ7mthYqMBJvzHCBlj4U172QB+srlR1YVcmRV6SroAz+q6VxU7Yt+aQDkndbnjJ8+VOxMu29OML61S38dHTbFsgkZtBroTf2OeuAf3tnMkdWFHF9fSHayXo/wqNKXBkIIIYQQQohrU4UD79dff915npyc7PJ9fn4+v//+Ox4eEgRcT2rVqkVYWBiJiYksWbKEZ555Bnd3d77++msAPD09admyJampqZw9e5aXXnrJeW9kZCSHDh1i48aNjB49Gk9PTwoLC+nbty/Tpk1D0/Qe2927d1NYqAeqJT3ukydPBiA9PZ06depw6tQpDh8+TPv27ausrf0mBxEQZuLA0nyykmx4BRmJ6OVJt6f90QwX7l22F5e+hCrOV6T9XnzB6yXBfOm10jSH6yW2z8sl/6yd6H8GONPaDPEhI8HGiomZGEzQ9Sl/GnWX+d1CCCGEEEJc6zRVXvd1OQwGA5qmoZRyBk3nU0rRuXNnNm3adMUrKapGdHQ0MTExF+zxhvIXVzt58iQAb775Jq+++iqrVq2iV69eBAcHExoaSk5ODgkJCQAsX76c3r17M2XKFF555RUAQkNDCQ4OJikpiYyMDCZMmMDEiRN57bXXeOutt6hXrx7BwcEkJiZy9uxZvLy8SElJISAgoOKNm6jPQ2fUexDa+LI+HyGEEEIIIYS4Eirc4x0WFoamaSQmJuLu7k5ISIjzmpeXF82bN+fNN9+skkqK6jNs2DD8/PyYOnUqu3fvJisri7Zt2zJmzBgeffRRABo1asR9993H9u3biY+Px2g00qZNG5588kl69+4N6POw69atyyeffMK+ffvIzs6mfv363HPPPQwePBiA7t27s2vXLn7//Xf27duHr68vt9xyCxMmTKhc0C2EEEIIIYQQ15AK93iXMBgMdOrUic2bN1dVnYT486THWwghhBBCCHGNqPTiagkJCS4LWdlsNkwm2Q5cXGNM7oACL79LZhVCCCGEEEKIqlTp7cTCw8M5fPgwPXr0wMPDgx49erB69WpGjhwpveDi2jHmExjzKQQEV3dNhBBCCCGEEH9zle6qXrduHb1798Zm05dhVkoRFhbGnDlzALj55puvaAWFuCwScAshhBBCCCGuEZXu8R4/fjx2u51BgwY505o2bUrt2rVlRXNx7cg6ox9CCCGEEEIIUc0q3eO9Y8cOGjZsyHfffYfBUBq316lTh/j4+CtaOSEu2ydjQNPgyY8r1Ptttypiv8hh/0/55Kbb8aphpFlvT7o+5Y+714XfT+Wm29jynxxS9hSTe8qGwwr+dY20vMub9sN8MbrpW++l7rWw+q1Mzh6z4V/XSI/nA2jco3QP7q3/zWHH3Fwe/akOHv6Vfh8mhBBCCCGEuIZV+n/4JpOJPy6E7nA4SElJwWg0XrGKCfGn2IrBaoGCnAplXzYug83Tc8hJtRNQ30TBOTs75+ex5ImzKMeFF/7PTLSxd1E+2Sk2/ENNaEY4e9RGzLRs1rydCejTMX58/hzWQsU/VtfBu4aRpS+coyjH4Sxj8/Qcbh0bKEG3EEIIIYQQ16FK/y//xhtv5MSJE/zf//0fAGfOnOH+++/nzJkztG/f/opXUFwd0dHRaJqGpmkYjUZ8fX1p1qwZjzzyCLt27ap0eSNGjHCWV1JmzZo1uf3229m9e3e590ycONGZ39/fn4KCgj/brAo5daCYAz/rz7rl5QAeXVqHuz6sCUDSDgtHVhde8F4PfwN9JgYyZmNdhi8O4fHlofjX019AHfhFL7Mw00HeKTu1It3x9DdSp7U71kJFVqK+TsKK1zMI72SmeV+vqmymEEIIIYQQoppUOvB++eWXAfjvf/+LpmkcP36cxYsXo2kaL7744hWvoLi63N3d6dixI/7+/hw5coQ5c+YQFRXFzJkzL7vMqKgoIiMjOXfuHL/99ht9+vShsNA1mFVKMW/ePOf3OTk5LFmy5LKfWRnHNxQ5zyN66cFv4+4emMz6MPGETUXl3gdQq5k7rYf4YHLX83r4G6jZxA3AmeYZaMCntpHTB4spzLaT9nsxbp4aAWEm4r7PI31fMb1eC6yStgkhhBBCCCGqX6UD7379+rFw4ULCwsJQSjlXNV+wYAH9+vWrijqKq6hOnTrExsaSnJzMtm3bCA8Px2azMXr0aA4dOgTAvn37GDx4MDVq1MDd3Z1GjRoxduzYMsF0idjYWPbt28e4ceMAfZTEgQMHXPLExMSQkJAAQIcOHQCcK+VXtdx0m/PcK0j/ldAMGp4B+nlOmr3CZWUkWEncagGg9d3eelmaxl3v18DkofH5rWnkn7Uz4L0a2K2Kde9l0/1Zf05utfBFvzQ+6ZbCb6+do7jAcaWaJ4QQQgghhKhmlV5cDeDee+/l3nvv5ezZswDUrFnzilZKXBs6dOjARx99xMCBA7HZbMyaNYuRI0fSuXNn8vLy8PHxoUmTJhw6dIi3336bnTt3smLFikuWazKZCA0NdUkrCbI7duzIuHHjuPPOO1m7di1JSUnUr1//ouVZLBYsFotLmtlmx2z6c2sOXHhmd/nS4iwsGXMWa6Gi6W2edHnS33kttI2Zh78Nccm/9IWz1Ghkol47M3OHnKJJT08aR3uwbFwmXjWM9Hgu4E/VXwghhBBCCHFtqHSPd3Z2NomJiRQWFlKzZk1iYmJ45pln+O9//1sV9RPVrFu3bs7zAwcO8PbbbzuD7gMHDnDgwAHef/99AFauXMnatWvLlNGpUydatWrFG2+8gbe3Nx999BF16tRxXs/Ly2Px4sUAPPTQQ/Tt25eaNWvicDiYO3fuJes4ZcoU/P39XY4pGyu+wr5vSOn7p4IMvadZORRFWfq5X51LB/BH1hTyzcgzFJxz0HqoN3dOq4HBpF0w/7GYQo6sKaTPxCCStltQDmg1yJsbBvng4W/g5JYLD28XQgghhBBC/LVUOvB+/PHHadiwIQcOHGDp0qUMHTqUTz75hP/7v/9j6tSpVVFHUY0cDtchz9u3bwf0gLykJ/qBBx5wXt+xY0eZMrZu3cr+/fsBaNCgAbfddpvL9cWLF5Ofn4+bmxv33Xcfbm5u3HvvvQAVCrzHjh1Ldna2yzG2a0SF29iwq4fzPH6lviDasfVF2Cx6n3fDLvr1tDgLswakMWtAGmlxpT3sO+fn8uOzZ7EWKbo/70+fCUEYjBcOuosLHKx8I5NO/+dHjcZulGwSYNSnhmO4rHEoQgghhBBCiGtVpQPvnTt3EhAQQPv27fnuu+/QNI3evXujlKpQkCT+WjZs2OA8b9GixWWV4XA42L59OzVq1GD//v3ce++9LlvSlQwzt9vtNG3alICAAOcIiqNHj7Jx48aLlm82m/Hz83M5KjPMPKSlO5G364uqrXk7i1kD0vjxWX0aRb32Zprequ+3bS1UZCTYyEiwYS3U65+yx8Kad7JQDnD30jiyqpAFD5xyHnlnys4PX/9hNmYfjajH/AAIizKjGSBhYxFpcRYKzjkIi/Ioc58QQgghhBDir6nSgXdqaiphYWEAxMXFceONN/Lbb7/RrFkzEhMTr3gFRfXZsWMHzz33HABGo5FHHnmEjh07AnpAnpycDMDChQud95QsjHY+TdPo0KEDEyZMAGDPnj3OoeUJCQmsX78e0AP0kh7r8xdquxqLrPWbHETnf/jhV8dIVpINryAj7R704e7pNdEMF+69theXvkAozlek/V7scpx/HSB1r4W93+bRZ1IQRje93OCm7vSZGMiR1YUsGnWGyP5e3PwPv6ppqBBCCCGEEOKq09T5XY8VEBgYSEBAAIcOHSI4OJiBAwcyb9482rRpw4kTJ8jOzq6quooqFB0dTUxMDO7u7tx4442kpKSQkpKCUgqTycRnn33GY489xsGDB7npppuc87zr16/PoUOHUErRq1cv5+JqI0aMcI6AKPknVlhYSHh4OGfOnOHGG29k165dTJw4kUmTJuHm5sbp06cJCAhw1um5557jww8/xM/Pj7S0NLy8KrHP9cRB+tdR70Fo4yvyGQkhhBBCCCHE5ah0j3dkZCSJiYnUrl2b/Px8oqKiAEhOTqZevXpXvILi6iouLmbbtm1kZWXRpEkThg8fztatW3nssccA/ee/ZcsWBg0ahLu7O0eOHKFBgwa8/PLL/Pjjjxct29PTk6effhqA3bt388svvzj37u7Zs6dL0A0wePBg4Oru6S2EEEIIIYQQV1qle7x//fVXBg8eTHFxMY0bN2bnzp0cOHCAm2++mUceeYRZs2ZVVV2FqDjp8RZCCCGEEEJcIyq9fvLtt99OcnIyiYmJtGzZErPZTMuWLTly5Ag1atSoijoKUXkmd0CBl8yVFkIIIYQQQlSvSvd4/1FycjLbt2+nRYsWNGvW7ErVS4g/J+uM/jUguHrrIYQQQgghhPjbq/Qc75deeolGjRoRGxvL3r17iYyMZMiQIdxwww389NNPVVFHISovIFiCbiGEEEIIIcQ1odKB94oVKzh9+jTt27dn9uzZ5Ofn4+vri81m45133qmKOgpReVlnSnu9hRBCCCGEEKIaVXqO94kTJwgPD8fNzY2dO3fSqFEjDh48SMOGDTl48GBV1FGIyvvkSdAM8OTHFer5tlsVsV/ksP+nfHLT7XjVMNKstyddn/LH3evi76e2/Cebo2uLOHO4GLtVT3tuZz1M5tL9v5N3Wdi9MJe0fcUUnHNgdNeo0chExxG+NL21dJu0uO/z2PKfHArOOQhp5U7vCYEENXBzXv9u9Bkcdhg6Q3rzhRBCCCGE+KuodI+31WrFaDQCcPjwYdq0aYObmxu1a9emqKjoildQ/H2MGDECTdOIjo7+84XZrGC1QEFOhbIvG5fB5uk55KTaCahvouCcnZ3z81jyxFmU4+LLIBxeUUjmSSueQcYL5jm5pYhDywqxFigC6psozneQsruYH545x6FlBQCcO25l+YRM6ncw89gvdTgTb+W31zKcZRz4JZ+kHRZ6jQ+sUJuEEEIIIYQQ14ZKB95hYWHs37+fPn36cO7cOW688UYA0tPTCQkJueIVrCrR0dFomkaDBg1c0tetW4emaWiaxpw5c6qlblfLnDlznG39Ozt1oJgDP+vB7y0vB/Do0jrc9WFNAJJ2WDiyuvCi99/9aU2e2lyX1oO9L5inZlM3hs4I5sn1dRmxJIRhC2uj/e+37+Av+QCcPWJFOSC0rRmfWkaCGpg4c1jvQi/MsrP2nSy6jPEnoF6lB6oIIYQQQgghqlGlA+/HHnsMpRQrV67E3d2dBx54gOPHj5OWlka7du2qoo4CKC4uru4qVBm73Y7dbq+25x/fUDpSI6KXPuy7cXcP51DxhE0XH8nhG2K65MuLZr29aHCzh/P7WpFuuHvr9xjd9a81m7qhGSB1j4W803YyTtgIbqYPM187NQu/UBPth/lUsnVCCCGEEEKI6lbpwPuf//wnP/74I++99x47duygUaNGOBwOvvjiC1599dWqqGO1ycnJwcfHB03TmDlzpjM9Li7O2VMcGxvr0kv+008/0a1bNzw8PGjSpAmLFy92KfPQoUMMHTqU4OBg3N3diYyM5LPPPnPJ06BBAzRN48UXX2TkyJEEBATQp08fAOdzpk2bxrBhw/D19aVu3bq8+eabLmVkZ2fzzDPPEB4ejru7O/Xq1eP555+noEDv2R0xYgSPPPKIM39JuRMnTuS1115D0zRatmzpvN6iRQs0TWP69OkArFmzxnlPeno6ABkZGTz55JPUr1/fOf1g2LBhJCYmOsuZOHGic6TBvHnzaNy4Me7u7iQlJZX5/E+fPk1kZCSapnHTTTeRlZVV4Z9dZeSm25znXkH6r4Rm0PAM0M9z0q78S4EDPxdgyVWgwQ2D9WC6RiM3+kwKJGmHhZn90whu6ka/N4I4saWIg78W0GtcIDHvZzE9OoXPb0tl66yKDaMXQgghhBBCVK9KB94AAwYM4Pnnn3cGZk2aNKFv37789ttvV7Ry1c3Pz48HHngAgP/+97/O9O+++w6AiIgIOnXq5HLPPffcw+nTpzGbzRw7dox7772X3bt3A3DkyBE6derE4sWLcTgcNGvWjMOHD/PEE0/w+uuvl3n+v//9b77++mvCwsLw9PR0uTZ27FjWrFmDh4cHqampjBs3jpUrVwJ673h0dDT//ve/ncHruXPn+OCDDxgwYABKKRo3bkyjRo2c5UVFRREVFUW9evWcc6wPHjxIZmYmGRkZHDp0CICNGzcCsGHDBgCaNWtGSEgIRUVF9OjRg+nTp5Oenk5ERAQ5OTl8+eWXdO7cmTNnXFcYT01NZcSIEZhMJmrXrl2m7ZmZmfTu3ZtDhw4RFRXFypUrCQgIKPfnZLFYyMnJcTkstj8fLP+pDe4vIm5JHsvG6XO3o18IoGGX0p7wGwb5MGpZKM9ur8d9c2rhG2JkxaQMOo7wJS2umB1z82j3gC9Nenqy/oNsEjZefBi8EEIIIYQQovpdVuBdoqioiIULF9K7d28aNGjA+PHjr1S9rpqTJ086e241TaNnz54u10ePHg3Ali1bnMFnSeD98MMPlynvueee4/Dhwxw+fJiAgAAcDodzm7W33nqL7OxsWrVqRVJSEnFxcXzwwQcAvP322+Tm5rqU5efnx+HDh/n9999ZunSpy7UOHTpw4sQJDh48iJubPhx59erVAHz11Vfs2bMHd3d3fv/9d/bu3UtsbCyg91SvWbOGcePGMW7cOGd5sbGxxMbG8thjj3HzzTfj7u6OUopNmzaxadMmlFL4+fk5A++SryVB+ldffcW+ffsAWLRoEfv372fTpk0YDAZSU1P55JNPXOpvtVqZPn06hw8fJiUlhbCwMOe1vLw8+vXrx969e+nUqRMrVqzA39+/vB8fAFOmTMHf39/lmLIx/oL5/8g3pHTOdEGGAwDlUBRl6ed+dS68aFplKKXY8O9slo3PBKDvG4F0HO570Xs2fpKDwaRx82h/TsbqQ97bPehDm6H6fPITW2RBQyGEEEIIIa51lxV4b968mVGjRlGnTh0eeughVq9ejd1uR6mq6iOsOu7u7s7e3qioKCIjI12u33jjjURFRQF6r3d8fDz79u1D0zQeeuihMuXdf//9AISEhDiD+Li4OAC2bdsGwL59+/D29kbTNJ599lkACgsL+f33313Kuvvuu6lfvz6AcyX5Evfccw/u7u7UrFmTWrVqAXDq1CmX5xQXFxMREYGmabRt29Z5b0kQfiFeXl507NgRwBl4GwwG/u///o+kpCQSEhKcZZQE3tu3b3feO3DgQADatWtHs2bNANixY4fLMzw9PRk1ahSgD3M3GEr/Ke7cuZOtW7cSHh7O8uXL8fPzu2h9x44dS3Z2tssxtmvERe85X8OupT3O8Sv1ofjH1hdhs+j/nkt6pNPiLMwakMasAWmkxVkqXD7o25X98nIGsTNyMPtq3P1ZMDcMuvh87VMHitn1ZS59JgTq883/9+tlcNMwmP7eC+IJIYQQQgjxV1Lh5ZFTUlKYO3cuc+fO5ejRowDOQFvTND788EMGDx5cNbWsQnXq1HEJRNetW1em1/uJJ55g69atzJ8/H19fvYeyZ8+eLr20lVGzZk0aN25cJv2PwXV5Q7BLnD/s2mTSf4x/fPHh7u7uXHX+fIGBl96OKjo6mk2bNrFx40aUUrRq1Yr+/fszbdo0PvnkE/Ly8pz5LkdwcLBLsH0+b29v8vPzOXnyJPPnz+fJJ5+8aFlmsxmz2eyaaKp4L3VIS3cib/fi4K8FrHk7i91f5ZGVpM/7rtfeTNNb9WH+1kJFRoLNeV7i53+dI+33YoqyHc60/96VhqZp9Hjen4heXmyfk8vBX/Sg3s3LwMaPs9n4cbbe3ppGBv27pkudHDbFsgkZtBroTf2OeuAf3tnMkdWFHF9fSHayXo/wKA+EEEIIIYQQ17YKB97h4eEopZzBXevWrXnooYeYOHEiBQUFPP3001VWyep2zz338Nxzz5Genu4cNl7eMHOAb775htatW3P69GnWrVsHwA033ABAx44dOXDgAP7+/vz6668EBQUBcPbsWVavXl1mvvjlbvNV0lttt9uZPn26c7X5oqIifvnlF2699VZA750ukZ+fj7d36XZY0dHRTJ482dmTPXLkSKKiojCZTMyYMQMond9d8szPPvuMgoICfvjhBwYOHMiuXbs4fPgwoA+Nr2jbOnToQLdu3XjzzTd56qmnCAoKco4kqCr9JgcREGbiwNJ8spJseAUZiejlSben/dEMF/855J2yOwP1EtnJ+hzz4nz998VerFzy550qnYPuF1r2JcH2ebnkn7UT/c8AZ1qbIT5kJNhYMTETgwm6PuVPo+6eZe4VQgghhBBCXGNUBWmapgwGg7rpppvU3r17nekBAQHKYDBUtJhrRo8ePRSgwsPDXdLXrl2r0Af1qtmzZzvT//nPfzrTvb29VW5ubrn3eHt7q2bNmil/f38FKIPBoHbu3KmUUurQoUPKz89PAcrLy0u1bdtWhYWFKaPR6FKP8PBwBagJEyaUqXd5dSvJP3z4cKWUUkVFRap169bO57ds2VJFREQos9msAJWQkKCUUmrv3r3O8sLCwlRUVJTauHGjUkqp/Px85e7u7ry+YMECpZRSHTt2dKY9/vjjzjoUFhaqVq1aKUCZTCbVokUL5eHhoQAVGhqqTp8+rZRSasKECeV+7kopNXz4cAWoHj16KKWUeuSRRxSg3Nzc1K+//nqhH2X5JgzUj5SjlbtPCCGEEEIIIa6wSs/x3rFjB/369eOll14qMyf5evaPf/zD2Us7ePBgfHzKn5+7ePFiateuTVFREY0aNeKrr75y9jg3a9aMLVu2MHToULy8vNi/fz8Oh4O+ffvyxhtvXLG6ms1mYmJiePrpp6lfvz7x8fFkZmbSoUMHJk+e7BzC3rp1a8aNG0ft2rVJTExk69atZGbqC3+dP88b4Oabbwaga9euzrTzh5l7eHgQExPDE088QUhICPHx8fj6+vLggw+yZcsWgoODK92OGTNm0K9fP6xWK0OGDGHTpk2X83EIIYQQQgghRLXSlKrYimhz5sxh7ty5rF+/HqWUMwgtOd+/fz/Nmzev0spWJ4vFQu3atcnOzmb16tXccsstzmvnzwtPSEigQYMG1VRL4TRxkP511HsQWnY+vRBCCCGEEEJcLRXu8R4xYgRr167l2LFjjB8/ngYNGrgs5tWyZUtatGhRJZWsbsOGDaNTp05kZ2fTvn17l6BbXKPczGByA6+Lr4guhBBCCCGEEFWtwj3e5YmJiWH27Nl899135Ofno2kadrv90jf+xWiahpubGx07dmTu3Lk0adLE5br0eF+Dss7oXwMqP8RdCCGEEEIIIa6kPxV4l8jPz2fRokXMnTuXtWvXXol6CSGEEEIIIYQQ14UrEngLIYQQQgghhBCifBXex1uIv5S1X0OH3uAbdMmsdqsi9osc9v+UT266Ha8aRpr19qTrU/64e118GYTifAcbP8nm8IpCCs7Z8Q0x0vJObzqP8sNg0hcgPLGliM2fZZORYMOS68DDz0CNRm60f9iXprfo+3A77Ip172Zx8NcCHDZo1MODXuMCnc+35DqYdWca7e73pdMombcuhBBCCCHEX0mltxMT4i8h5hvIzaxQ1mXjMtg8PYecVDsB9U0UnLOzc34eS544i3JceECIciiWPHmWnfPzKDin35uTamfz9Bx+G5fhzHf2qJWzR6141zRSs4kbxfmKpB0Wfnz2LCm7LQDEfZ/PzgV5dH/WnwHv1eDA0gK2fpHjLGPdtCy8Ao3cNNL3Mj8QIYQQQgghRHWRwFtclujoaDRNo3Hjslt1HTt2DE3T0DSNN998E9BXxS9Ja9q0aZl71q1b57z+x6Nt27ZV1o5TB4o58HMBALe8HMCjS+tw14c1AUjaYeHI6sIL3ntkdSFJO/TAeeCHNXl0aR1u+VcAAAeWFnDqQDEAbe/14enN9Xjk+xCGLw5h8Kd6+coBqXv1+08fsgJQr72Z+h3Netph6//qUcS+H/LpMynQ2YsuhBBCCCGE+OuQwFtclhEjRgBw/PhxNm3a5HJtwYIFABgMBh5++GHy8vJYvHix8/rRo0fZuHHjBctu1KgRUVFRzqN169ZXvgH/c3xDkfM8opcXAI27e2Ay6wFuwqaicu8DSNioXzN5aDTq7uFShst1d43sVBsLHjjF3CHpLBlzFgDNAKFt9SC7VnM3AJJ3WkjargfjtZq5YStWrJiUyY0P+FDnBvOfb7AQQgghhBDiqpM53uKyDBkyhDFjxpCfn8/8+fPp0qWL81pJ4N2zZ0/CwsKYM2cO+fn5mM1mmjZtyr59+5gzZw5du3Ytt+xx48Y5A/uqlptuc557BenvoTSDhmeAgdxTdnLSLrw9Xk66fs3T34Bm0AN1rxql77Jy0krLthUp0n4vdn7v5qnR780g6v4v8L5hkDfnjlqJ+SAbh13RYoAXUf/nx5bPc7AXK9rd78v3T58leYcFn9pGov/pT8OunlfgExBCCCGEEEJUNenxFpfFx8eHIUOGAPDtt99isei9tFu2bOHo0aNAaa/4nDlzALjzzjsZNWoUAIsWLaKgoOCK1MVisZCTk+NyWGx/bj/5y17q/wI31mjkxov76jNmUyjdn/PHWqhYPinDORzdYNS45eVAxmyoy9Ob69F/Sg2yk21s+28OvcYHsf7DLI6vL6Tf5CDcvTV+fP4cBRl/ro1CCCGEEEKIq0MCb3HZSgLrzMxMfv75Z6C0t9vPz4/BgweTkJDA+vXrAXjooYe47777MJlM5OTksGTJknLLfeSRR1zmeE+cOPGi9ZgyZQr+/v4ux5SN8RVqg29I6aCPggwHoC+aVpSln/vVMV7wXr8Q/VphlsO5CFtJGfq9ZQeUePobiXrUDw8/A5YcxfY5ueWWrRyK5RMyad7Pi4ZdPDgZayE4wo0mPT2J7OeFtUCRel4PuhBCCCGEEOLaJYG3uGw9evSgYcOGAMyfPx+r1co333wDwNChQ/Hy8mLu3LkopQgODqZv374EBwfTp08foLQn/I/+OMe7Xr16F63H2LFjyc7OdjnGdo2oUBsadvVwnsev1Hvgj60vwmbRA+mGXfTraXEWZg1IY9aANNLi9N79Bv+712ZRHF9f5FLG+WX/vjiPwuzS3umU3RaKcvUA3VpYGqifb9eXeWSn2JyLtaHA4KYPZ5cF1oQQQgghhPhrkTne4rJpmsbDDz/MpEmT+PXXX5k3bx7nzp0D9N5wpRTz5s0D9F7x4OBgAIqK9CB17dq1JCUlUb9+fZdyKzvH22w2Yzb/YeEx04V7qs8X0tKdyNu9OPhrAWvezmL3V3lkJelzs+u1N9P0Vn0etbVQkZFgc54DNL3Fk7rt3EnZVcwPz54loL6JzJN6nsj+XtRu4Q7Alhk5rHgjE/+6JowmOJdgcw5Jb3Gnd5k65aTZ2PBxNr0nBOIZoLcjvLOZ4+uLyE6xkbCpCDdPjTo3uFf4MxJCCCGEEEJUH+nxFn/K8OHD0TQNq9XKs88+C0CTJk3o2rUrMTExJCQkAGCz2Zy90SXzwR0OB3Pnzq2uqjv1mxxE53/44VfHSFaSDa8gI+0e9OHu6TWdi6aVx2DUuHt6MO0e9MErSL/Xr46Rzv/wo9+bQc58zft5UaORGwUZdjJO2vAMMNCgiwd3f1aTZuetgl5ixeuZ1G9vpkX/0qD8lrGB1O9oZvagdDISrAx4rwbeNSr2ckEIIYQQQghRvTSl1GWvIyUE6Ht6x8TEOL9/4403eO211xgxYgRz586ldu3apKamYjCUvucZNGgQP/zwA02aNOHIkSOsW7eOnj17AvpQ85LecQBfX19WrlxZuUpNHASj3oPQsvuMCyGEEEIIIcTVJEPNxZ82YsQIZ+Bd3t7dd911l0vQDTB48GB++OGHcvf0Pn78OMePH3d+7+/vX8UtEEIIIYQQQoiqIz3e4vokPd5CCCGEEEKIa4TM8RbXpx73gm9gdddCCCGEEEIIIaTHWwghhBBCCCGEqErS4y2EEEIIIYQQQlQhCbyFEEIIIYQQQogqJKuai+vT2q+hQ2/wDbpkVrtVEftFDvt/yic33Y5XDSPNenvS9Sl/3L0u/m6qON/Bxk+yObyikIJzdnxDjLS805vOo/wwmPQ9wE9sKWLzZ9lkJNiw5Drw8DNQo5Eb7R/2pektngA47Ip172Zx8NcCHDZo1MODXuMCnc+35DqYdWca7e73pdMovz/54QghhBBCCCGuJunxFpdF0zQ0TWPOnDnVXZXyxXwDuZkVyrpsXAabp+eQk2r///buPK6qam3g+G8fhsNhHgUUBRxQQFJRI3NMs66Zc7fULK281q1suvXezEq7N1/tLcubad3Kq41WDmneyiwTccIRRxQcQBFEAhGQ4cCB9f5xYusR0KMJODzfz+d83Oy99trPXm4O5zlr7bXxbu5ISV4l2z87w9LHc1FVdU+BoKoUS5/IZftnZyjJs+5bmFXJxrmF/PjKKb1c7qEKcg9V4ObvgH9rJ8qLFRnbzCx/JpfMJDMAe74tZvvnZ+j1jBeD3vIjeUUJmz8q1OuIn3kaVx8Hbn7Y4zIbRAghhBBCCNFYJPFuQGVlZbz99tvExcXh6emJq6srERERPProozbPrb5SwsLC0DSNqVOnXvG661ufPn3QNI1x48bV63FOJpeT/N8SAPq+6M0jK4IZMssfgIxtZg6uLq1z34OrS8nYZk2ch87y55EVwfT9uzcAyStKOJlcDkDH+9x5amMID30bxNjFQQyfY61fVUHWLuv+OQcqAAjpbKR5V6N1XUrF73GUsXdZMXe+5qP3ogshhBBCCCGuHTLUvIHk5+fTr18/kpKSAPDw8KBVq1YcO3aMDz/8kG7dutGyZctGi6+yshIABweHRouhMRxZV6YvR/R3BaBVLxccjRoWsyJtQ5m+/nxp6637OrpotOzlotexevppfXtglDOOzhoFWRZWPJ9HZbniVLoFAM0ATTtak+wm7ZwAOL7djEem9f+iSVsnLOWKVa/l02m0O8Exxit89kIIIYQQQoiGID3eDeTJJ5/Uk+4XXniBU6dOsWfPHgoKCli7di1t27YF4LvvvqNHjx64u7vj4uJCp06dmDdvnk1d1cO8Z86cyZgxY/Dw8KBZs2a8/vrrAKSnp6NpGkePHgXgtdde0/cBmDp1KpqmERYWxqeffkqrVq1wdnYmIyPD7hjOt2DBAv0Ya9asITY2FpPJRGxsLImJiTZlN2/ezF133YW3tzcuLi7ExsayePFim/Nbu3YtAJ988oleb3p6+uU0/QUVZVv0ZVdf66+DZtAweVuXC09U1rlvYbZ1m8nLgGawtq2r39lfqcITZ+u2lClO7C4n50AFljKFk0lj0Jt+NPs98Y4Z5kbnMe6sfaeA757PJWqQK3F/8WTTB4VUlitiR3nw7VO5zL41k/nDsklbX3dPvBBCCCGEEOLqIj3eDaCgoIBvvvkGgA4dOvDGG2/oSTBAr169APj888954IEHAAgMDMTFxYWdO3cyfvx4srOzmTx5sk29kyZNwt/fHxcXF7KysnjllVeIi4ujffv2xMXFkZSURHl5Oc2aNSMkJKRGXFlZWYwbN442bdoQGBh4WTHUZsCAAYSFhWGxWEhKSmLkyJEcOnQIR0dHNmzYwG233UZFRQVBQUEEBQWRlJTEn//8Zz755BMefPBB4uLiSE5OpqioCH9/f1q1agWA0Vh7j6/ZbMZsNtusM1oq+SP9w5f9cPs6dvRr6cQLe5tTWlDJ7sXFJLxTwE+vncK7uSOBUc4YHDT6vuhD3xd99H1+Sy1ny38KGT4ngIRZpzmSUMqQd/zZPK+Q5c/lMWFlMK6+N9YIBSGEEEIIIa5F0uPdAFJTU7FYrL2fPXv2tEm6z1Wd1MbFxXH06FHS0tIYNmwYANOmTaOkpMSmfJcuXUhPT2f//v04OVmHKq9evZrg4GASExMJDg4GYPz48SQmJtboea6oqGDu3LmkpKSQmZlJixYtLjmG2rz55pscOHCAmTNnAnD06FEOHToEwMsvv0xFRQX9+/cnIyODAwcO8Mwzz9icf2JiIrGxsQAMHDhQj736fM43ffp0vLy8bF7T16deNE4Aj6Cz3z2VnKoCrJOmlZ22LnsG153YegZZt5WertInYauuw7pvze+1TF4OxD3iiYunAXOhYuuColrrVlWKn6bk026AK+HdXTiaaCYgwonWt5mIHOBKRYkia3e5XecohBBCCCGEaFySeDcApc52g9aVdOfk5HDs2DEAhg8fjtFoRNM0Ro4cCUBpaSn79u2z2efee+/F2dkZf39/mjRpAsDJkyftjstkMjFhwgQ9rtzc3EuOoTbVPeZRUVH6uuq4tmzZAsDPP/+Mk5MTmqYxa9YsAI4fP05mZqbd8VebNGkSBQUFNq9JPSLs2je8h4u+nPqz9UuFwwllWMzW/7Pw7tbtJ/aYmTfoBPMGneDEHmvvetjv+1rMiiMJZTZ1nFv37sVnKC04O2Q9M8lMWZE1Qa8oPZuon2vHF2coyLTok7WhwOBkvXZkgjUhhBBCCCGuLTLUvAG0bdsWR0dHLBYL69evRylVZwJ+Kby9vfVlR0frf+W5Sf7FBAQEYDBc+e9equOqjglqxlXX8PfqkQGXwmg01hyG7mjfEOygaGci73Jl/w8l/DrjNEkLz3A6wxpDSGcjbfpZn7NdUao4lWbRlwHa9DXRLNaZzB3lLHsmF+/mjuQftZaJHOhKYJQzAJs+LGTVP/PxauaIgyPkpVn0IelRg91qxFR4wsK62QXcMcUHk7f1PEK7GTmSUEZBpoW0DWU4mTSCY5wvoZWEEEIIIYQQjUV6vBuAl5cX9957LwBJSUm89NJLNgnmL7/8wqFDh2jRogUAS5cuxWw2o5Tiq6++Aqy909HR0Zd0XFdX62zcxcXFtW4/P/lv0qTJFY/hfF27dgUgNDSUNWvW6MPIFy9ezKRJkwgNDbUr9itpwDRfuj3miWewA6czLLj6OhB7vzsj5vrrk6bVxuCgMWJuALH3u+Pqa93XM9iBbo95MuB1X71cuwGu+LV0ouRUJaeOWjB5Gwjr7sKI9/1pW8uM6av+kU/zzkaiBp5NyvtO8qF5VyPzh2VzKq2CQW/54eYn93cLIYQQQghxLZAe7wYye/ZskpOT2blzJzNmzGDu3LmEhYWRkZFBfn4+8+fPZ9q0aTzwwANs3ryZ0NBQXFxc9JnJJ0+erCej9mrXrh379+/n3XffJT4+nvbt2zN//vwL7nOlYzjfP/7xD/r168fGjRsJDg4mPDyc3377jaysLHr16sWQIUP02H/88UeWLl1KbGwsTZo0YeXKlX/o2HVxcNLo8aQXPZ70qrNMi5tdeGFv8xrrje4G+k3yod8kn1r2sur9rDe9n7U/nnveD6ixzt3fgRFza64XQgghhBBCXP2kx7uB+Pr6smnTJt566y26du1KVVUVKSkp+Pj4MH78eHr16sWYMWNYvnw53bt3p6ioiOzsbDp27MjHH39s12zi53v99de55ZZbMBgMbNu2jT179lx0nysdw/l69epFQkICAwYMQNM0kpOTcXJyYsSIETz//PN6ueeff57bb78dV1dXkpKS2LZt2x8+thBCCCGEEEI0Bk1dyk3BQlwrpg6DCW9B01aNHYkQQgghhBDiBic93uL61Ps+8Kh7+LcQQgghhBBCNBTp8RZCCCGEEEIIIeqR9HgLIYQQQgghhBD1SBJvIYQQQgghhBCiHsnjxMT1ac1X0OUO8PC9aNHKCkXiR4Xs+66YouxKXP0caHuHiR4TvXB2vfB3U+XFVax/r4CUVaWU5FXiEeRA9GA3uk3wxOBofQZ4+qYyNr5fwKk0C+aiKlw8Dfi1dKLzgx606WsCoKpSEf/mafb/UEKVBVr2dqH/Kz768c1FVcwbfILYUR7cMsHzDzaOEEIIIYQQoiFJjzfQp08fNE1j3Lhx18VxrjRN09A0jalTpwIQHx+vr0tPT2/U2Oq09msoyrer6MpXTrFxbiGFWZV4N3ekJK+S7Z+dYenjuaiquqdAUFWKpU/ksv2zM5TkWfctzKpk49xCfnzllF4u91AFuYcqcPN3wL+1E+XFioxtZpY/k0tmkhmAPd8Ws/3zM/R6xotBb/mRvKKEzR8V6nXEzzyNq48DNz/scZkNIoQQQgghhGgsV0XivWDBAj2Rc3BwICMj44ofIz09XT9GfHz8Fa/fnuNERUURFxdHq1aX9oirsrIy3nnnHW699Va8vb0xGo20aNGC22+/nbfffvsKRm8fT09P4uLiiIuLw2g01ssxqttwwYIF9VJ/tZPJ5ST/twSAvi9688iKYIbM8gcgY5uZg6tL69z34OpSMrZZE+ehs/x5ZEUwff/uDUDyihJOJpcD0PE+d57aGMJD3wYxdnEQw+dY61dVkLXLun/OgQoAQjobad7V2qY5KRW/x1HG3mXF3Pmaj96LLoQQQgghhLh2XBVDzc9Nrqqqqvjkk094+eWXGy+gejJ37txL3icvL49+/fqxa9cuAFxdXYmIiKCoqIi1a9eyevVqnnvuuTr3r6ysBMDBweHygq5FbGwsiYmJV6y+xnRkXZm+HNHfFYBWvVxwNGpYzIq0DWX6+vOlrbfu6+ii0bKXi17H6umn9e2BUc44OmsUZFlY8XweleWKU+kWADQDNO1oTbKbtHMC4Ph2Mx6Z1v+zJm2dsJQrVr2WT6fR7gTH1M+XHEIIIYQQQoj61eg93mlpaSQkJADQpUsXAD755BObMgUFBTz99NOEhobi7OxMSEgIzz33HCUlJXqZlJQUBg8eTJMmTTAajYSEhDBgwAC2bNnCggULCA8P18vedtttaJpGnz59bI6jlOJ///d/adq0KT4+PowZM4aioiJ9e1VVFf/6179o3749Li4u+Pj48Oc//5m0tDSAix6ntqHmRUVFPP/887Rq1QpnZ2f8/Pz405/+RGmptaf1ySef1JPup59+mry8PPbs2UN6ejq5ubnMnz9fr2vq1KlomkZYWBiffvqpXmdGRgY///wzPXv2pEmTJjg7O+Pp6UnPnj358ccfbdpg9+7d3HLLLbi4uNChQwfWr19f4/+srqHmP/74I71798bDwwOTyUTPnj1Zs2aNvv3c0QALFizg7rvvxtXVlfDwcObNm2dTd7WHHnpIP6f6UJRt0Zddfa2/DppBw+RtXS48UVnnvoXZ1m0mLwOawRqzq9/ZX6nCE2frtpQpTuwuJ+dABZYyhZNJY9CbfjT7PfGOGeZG5zHurH2ngO+ezyVqkCtxf/Fk0weFVJYrYkd58O1Tucy+NZP5w7JJW193T7wQQgghhBDi6tLoifcnn3yCUoqgoCA++ugjAA4dOqQnfOXl5fTp04d3332XnJwcIiMjycvL45133mHQoEFUP4Z81KhRrFixAovFQnR0NFVVVaxcuZLk5GQCAgLo2LGjfszIyEji4uKIioqyiWXRokXMmDEDFxcXTp8+zRdffMGMGTP07U8++STPPPMM+/bto3Xr1jg4OLB48WJuvfVWcnJy7D5OtepzmzlzJkeOHKFp06b4+vqyatUqzGYzp0+fZtGiRQB06NCBt99+GxcXF31/Ly+vWu8Xz8rKYty4cTg6OhIYGAjAvn372Lx5Mx4eHrRv3x6lFOvXr2fw4MF6Yl9aWspdd93F5s2bqaqqoqKigoEDB9rz38jXX3/NwIEDSUhIwM/Pj+DgYNavX0///v1tku9qEyZMYN++fTg5OZGens6ECRM4cOCAPoy9WsuWLYmLi6NTp051HttsNlNYWGjzMlvqTpjtcdkPt69jR7+WTrywtzlPbmhKr2e9qChV/PTaKX04usFBo++LPjy5rhlPbQxh4HQ/Co5b2PKfQvq/6kvCrNMcSShlwDRfnN00lj+XR8mpP3aOQgghhBBCiIbRqIm3UopPP/0UgNGjR9OxY0duuukm4Ozw84ULF7Jz506cnZ3ZvXs3u3bt0oc5//rrr/z6668AHDx4EIAVK1awY8cOsrKyOHLkCH369GHgwIF8++23+nHnzp1LYmJijaHfjo6O7N+/n0OHDtG5c2cAVq9eDVh75j/44APA+mXB3r17SU9PJyQkhOzsbGbPnm33cap99dVX7NixA4D/+7//Iz09nYMHD7Jnzx5cXV1JTU3Vh4r37NkTg8H63zV06FC957i2+6ArKiqYO3cuKSkpZGZm0qJFC4YNG0ZOTg6HDx9mx44dHDt2DA8PDywWC4sXLwbgyy+/JDMzE4DvvvuO5ORku+8hf/HFF1FK8fDDD5OWlsbhw4cZNmwYlZWVvPrqqzXKDxkyhCNHjrBu3TrAOpogPj6+xjD2V155hcTERJt2Pd/06dPx8vKyeU1fn2pX3B5BZ++2KDlVBVgnTSs7bV32DK57iL5nkHVb6ekqfRK26jqs+9a8k8Pk5UDcI564eBowFyq2LiiqUaY6hp+m5NNugCvh3V04mmgmIMKJ1reZiBzgSkWJImt3uV3nKIQQQgghhGhcjZp4r127Vh+m/cADD9j8u2jRIkpKStiyZQtg7R2OiIhA0zSbXuXqJG3QoEGAdXh3ZGQkI0aMYOXKlQQHB9sdT9++fWnWrBkGg4F27doBcPLkSQC2bdum966PHTsWTdPw8PDg+PHjNnFcis2bNwNgNBpt7tOOjo7G2dnZpmx10g3Qtm1bOnToUGe9JpOJCRMmANZJygwGA2azmXHjxtGkSRMcHBzw9fXVh9FnZWUB1l5xsN5H/qc//QmAe++996Ln8dtvv+lDzv/zn/9gMBgwGAx6slx9nue6//770TTNZjRAdVtfqkmTJlFQUGDzmtQjwq59w3ucHUGQ+rP11oXDCWVYzNb/6/Du1u0n9piZN+gE8wad4MQe64RoYb/vazErjiSU2dRxbt27F5+htOBs73RmkpmyImuCXlF6NlE/144vzlCQadEna0OBwck6nF0mWBNCCCGEEOLa0qiTq53bU1t9H7TFYr0vtrCwkKVLl+rbnZ2dax1u7OPjA8Cnn37K4MGDiY+PJzk5mR9++IGlS5eyd+9e5syZY1c83t7e+rKjo7VpqpPtc3Xs2LHGbN6hoaF2HaMu597XXK1t27Y4ODhQWVnJxo0b9fVvvPEGDz30EJGRkbXWFRAQYJOoAwwcOJBDhw7h6OhITEwMLi4uJCUlUV5erveqXygWe7Vs2ZKAgIAa68vLbXtnq9u6up2h9ra2h9ForDm7uqN9k8kFRTsTeZcr+38o4dcZp0laeIbTGdZrMKSzkTb9rM/ZrihVnEqz6MsAbfqaaBbrTOaOcpY9k4t3c0fyj1rLRA50JTDK+uXJpg8LWfXPfLyaOeLgCHlpFn1IetRgtxoxFZ6wsG52AXdM8cHkbT2P0G5GjiSUUZBpIW1DGU4mjeAY5xr7CiGEEEIIIa4+jZZ4nzlzRh/iDNYJ1M63YMECxowZA1hn5547dy6xsbGA9RFb33//Pf369QNg3bp1DBs2jJEjRwIwY8YMJk2apE/c5up6dmbq4uLiS463c+fOaJqGUopx48bx9NNPA+j3Snt5eV3yceLi4pg7dy5ms5lZs2bpvd779++nVatWeHl5ce+997Jw4UK2bdvGlClTePXVVy86Q/n5iXNeXh6HDh0C4B//+AeTJk0iPT1d79WvFh0drce9atUq7rjjDpv/o7oEBAQQGhrK0aNHiY2NZeHChXpCnZqaytGjR2v04F+MyWSitLT0sv6vLtWAab54t3AkeUUxpzMsuPo6ENHfRM+nvPRJ02pjcNAYMTeA9bMLSP25lNMZFjyDHYga5Ea3Rz31cu0GuHIkoYzCExYqShUmbwOBUc50HuNOy56mGvWu+kc+zTsbiRp4NinvO8mHitJTzB+WjUegA4Pe8sPN78rNVC+EEEIIIYSoR6qRzJ8/X2Ht91N79+612TZr1iwFKIPBoNLT09VNN92k/xwdHa0iIiKU0WhUgEpLS1NKKdWsWTNlMplURESE6tixo3JyclKAGj16tFJKqaqqKuXn56cA5ePjo26++Wb17rvvKqWU6t27twLU2LFj9RjGjh2rABUaGqqvmzBhgh5zeHi4iomJUZ6engpQ8+fPv+TjmM1mFRsbq9cZFham2rRpowwGg8rPz1dKKZWbm6ufP6A8PT1Vx44dVWBgoL6u+thTpkypEXN1TCEhIQpQTk5Oqn379srHx0e5ubnZxFNSUqKaNm2qAOXs7KyioqL0MoCaMmWKUkqpNWvW6Ouq2/+LL77Q1wUEBNjEWF1/WlqaXmbNmjV6fOfXr5RSnTp1UoByd3dXXbt2VZMmTbrwBXW+KUOVyjx0afsIIYQQQgghRD1otHu8q4eZR0RE6D2t1YYPHw5YJ9z67LPPWLt2LU899RTNmzcnNTWV/Px8unTpwrRp0/RZux966CGio6PJzc0lOTmZoKAgJkyYwHvvvQdYe4E/+ugjWrduTWFhIVu2bOHo0aOXFPP777/PO++8Q0xMDFlZWRw9epSwsDCee+45faj8pRzH2dmZNWvW8Le//Y3w8HAyMzPJy8vj9ttv14dO+/n5kZiYyBtvvEHnzp2pqqriwIEDmEwm7rzzTj744AOGDh16wbg1TWPJkiV07dpVH7r+xRdf4O/vb1POZDLx/fff07VrV33dhSY1O9fo0aP573//S+/evSktLSUlJQUPDw8efPBBxo8fb1cd53r33XeJiYmhvLycrVu3kppq32RpQgghhBBCCHG10ZS6zBtrhbiaTR0GE96Cpq0aOxIhhBBCCCHEDa7Rn+MtRL3ofR94+DR2FEIIIYQQQgghPd5CCCGEEEIIIUR9kh5vIYQQQgghhBCiHkniLYQQQgghhBBC1KNGe463EPVqzVfQ5Q7w8L1o0coKReJHhez7rpii7Epc/Rxoe4eJHhO9cHa98HdT5cVVrH+vgJRVpZTkVeIR5ED0YDe6TfDE4Gh9Bnj6pjI2vl/AqTQL5qIqXDwN+LV0ovODHrTpa32Od1WlIv7N0+z/oYQqC7Ts7UL/V3z045uLqpg3+ASxozy4ZYJnnfEIIYQQQgghrj7S492AwsLC0DSNqVOnNnYoV0R8fDyapqFpGvHx8Ze077hx49A0TX8M2xW39msoyrer6MpXTrFxbiGFWZV4N3ekJK+S7Z+dYenjuaiquqdAUFWKpU/ksv2zM5TkWfctzKpk49xCfnzllF4u91AFuYcqcPN3wL+1E+XFioxtZpY/k0tmkhmAPd8Ws/3zM/R6xotBb/mRvKKEzR8V6nXEzzyNq48DNz/scZkNIoQQQgghhGgsN2zi3adPHz1p7NChg822vLw8TCaTvv3FF1+0u95zk9H09HSbbZ06dSIuLo6QkJArcQo1VCezmqbRpEkTzGazvs1isdCsWTN9+8iRI+slhmvNyeRykv9bAkDfF715ZEUwQ2ZZn2+esc3MwdWlde57cHUpGdusbTx0lj+PrAim79+9AUheUcLJ5HIAOt7nzlMbQ3jo2yDGLg5i+Bxr/aoKsnZZ9885UAFASGcjzbtan+Gek1Lxexxl7F1WzJ2v+ei96EIIIYQQQohrxw2beJ9r9+7dJCQk6D9//PHHlJWVXfHjfPvttyQmJjJ+/PgrXvf5fvvtN77++mv95yVLlpCVlVXvx73WHFl39v85or8rAK16ueBotCa4aRvqvg7S1lu3ObpotOzlYlOHzXZnjYIsC5+PPskn92Sz9MlcADQDNO1oTbKbtHMC4Ph2Mxlbrcl4k7ZOWMoVq17Lp9Nod4JjjH/8hIUQQgghhBAN7oZPvJ2crAnP7NmzAaisrGTu3Ln6+nOdOnWKJ554gubNm+Pk5ERgYCBjxozh2LFjAEydOpXbbrtNLx8eHo6maYwbNw6ofaj5sWPHePDBBwkKCsLJyYmQkBAef/xxTp06O1T53GHZc+bMISwsDA8PD+6++26ys7NrxOno6GhzTucu13ZepaWlTJ48mdatW+Ps7Iyvry9Dhw5lz549NuW++eYbWrZsiclk4q677iIzM7NGXVOnTkXTNMLCwvR1FxoFcC6z2cyUKVNo06YNzs7ONGnShIcffpjc3Nw69/mjirIt+rKrr/XXQTNomLyty4UnKuvctzDbus3kZUAzWBN1V7+zv1KFJ87WbSlTnNhdTs6BCixlCieTxqA3/Wj2e+IdM8yNzmPcWftOAd89n0vUIFfi/uLJpg8KqSxXxI7y4Nuncpl9aybzh2WTtr7unnghhBBCCCHE1eWGn1ytY8eO5OXlsWzZMo4fP87WrVs5duwYo0aNYuHChXq5srIyevfuzd69e3F0dCQiIoIjR47wxRdfsGbNGnbu3ElISAiRkZHs379fr9toNNKqVataj52Tk0O3bt3IysrCaDQSERFBamoq77//PuvWrWPr1q24uLjo5Tdu3MjmzZtp3rw5Z86c4fvvv+dvf/sbX3zxhU29fn5+REZGEh8fT2JiIkajkQ0bNujHOnr0qE35wYMH88svv6BpGm3btuX48eMsX76c1atXs3XrVtq1a8fOnTsZNWoUVVVVeHl5kZqayqOPPnql/hsAGD58OD/88AMODg5ER0eTnp7O/Pnz2bx5M9u2bcNkMtW6n9lsthlWD2C0VPJH+ocv++H2dezo19KJF/Y2p7Sgkt2Li0l4p4CfXjuFd3NHAqOcMTho9H3Rh74v+uj7/JZazpb/FDJ8TgAJs05zJKGUIe/4s3leIcufy2PCymBcfR0uN1IhhBBCCCFEA7nhe7wNBgNPPPEEFouF999/X+8Znjhxok25hQsXsnfvXgAWLVrEvn372LBhAwaDgaysLN577z3Gjx/P3Llz9X2qh5a/8sortR57zpw5ZGVlYTAY2LhxI/v27WPRokUA7N271ybxB2tvfGJiIqmpqQwbNgyA1atX11p3dfyzZ8+u85wA1qxZwy+//ALA22+/zf79+9m/fz/u7u6cOXOG6dOnAzBz5kw96U5JSeHQoUMMHz68rma9ZGvXruWHH34A4Ndff2XXrl0cOHAAk8lEcnIyX375ZZ37Tp8+HS8vL5vX9PWpdh3XI+jsd08lp6oA66RpZaety57BdSe2nkHWbaWnq/RJ2KrrsO5b83stk5cDcY944uJpwFyo2LqgqNa6VZXipyn5tBvgSnh3F44mmgmIcKL1bSYiB7hSUaLI2l1u1zkKIYQQQgghGtcNn3gDPPzww7i5uTF79mzWrFlD586d6datm02ZrVu3AuDq6srQoUMBiI2NpW3btgBs27btko9bXWfbtm2JjY0FYOjQobi6utZaZ0xMjD4RXFRUFAAnT56ste4hQ4bQokULFi1axMKFCwkODuaee+6pMwaA0aNHAxASEkLPnj1tYti3bx8A3bt3JzAwEIA///nPl3rKddqyZYu+3Lt3bzRNo2nTppSWWodUJyYm1rnvpEmTKCgosHlN6hFh13HDe5wdUZD6s3WStcMJZVjM1kQ6vLt1+4k9ZuYNOsG8QSc4scfaux72+74Ws+JIQplNHefWvXvxGUoLzg5Zz0wyU1ZkTdArSs8m6ufa8cUZCjIt+mRtKDA4WYezywRrQgghhBBCXFtu+KHmAN7e3owZM4Z///vfQO09w1cDb29vfbn6Pu66ODg48Ne//pVJkyZRUVHBo48+Wuv93VeaplmTwsrKs4lmQUHBJdURFxdXY11QUFCd5Y1GI0bjeQPLHe0bgh0U7UzkXa7s/6GEX2ecJmnhGU5nWO/NDulspE0/6/D2ilLFqTSLvgzQpq+JZrHOZO4oZ9kzuXg3dyT/qLVM5EBXAqOcAdj0YSGr/pmPVzNHHBwhL82iD0mPGuxWI6bCExbWzS7gjik+mLyt5xHazciRhDIKMi2kbSjDyaQRHONs1zkKIYQQQgghGpf0eP/uySefBCAgIKDWR2117doVgJKSEpYtWwbAjh07SElJAaBLly4Aem81QHFx8QWPWV1nSkoKO3bsAGDZsmWUlJTY1Hm5xo8fj4uLC05OTnXej10dA6AP5z5+/Djr1q2ziSE6OhqADRs2kJOTA8DixYtr1NekSRPAev96dcJdW7kLxTFp0iQSExNJTExk/fr1TJ06lUceeeSidVyuAdN86faYJ57BDpzOsODq60Ds/e6MmOuvT5pWG4ODxoi5AcTe746rr3Vfz2AHuj3myYDXffVy7Qa44tfSiZJTlZw6asHkbSCsuwsj3ven7TmzoFdb9Y98mnc2EjXwbFLed5IPzbsamT8sm1NpFQx6yw83P7m/WwghhBBCiGuCukH17t1bASouLk5fl5eXpwoKCvSfsfZLqr///e+qtLRUtW/fXgHK0dFRRUVFKRcXFwWopk2bqpycHKWUUrm5ucrJyUkBKigoSMXFxalFixYppZQKDQ1VgJoyZYpSSqmTJ0+q4OBgBSij0aiio6OVo6OjAlT79u1VaWmpUkqpsWPHKkD17t1bj23KlCl6fNWqywUGBurr8vPzVX5+vv5zdQz33Xefvu72229XgNI0TUVGRioPDw8FKHd3d7V//36llFI7duxQmqYpQHl5eanWrVsro9Gox7BmzRqllFL79+9XBoNBASo8PFx17txZ/xlQaWlpdZ7TnXfeqZdr27atioqKUm5ubjb1223KUKUyD13aPkIIIYQQQghRD6TH+xy+vr54enrWus3FxYW1a9fy+OOPExQURGpqKh4eHtx///1s2rSJgIAAwDqj+Lvvvkvz5s05efIkmzdvrvWRX2DtHU5MTOSBBx7A29ublJQUAgMDeeyxx1i7dq3NjOaXy9vb22aIem2+++47XnrpJcLDwzl48CCOjo4MGTKEjRs30q5dOwA6derEl19+SVhYGGVlZYSGhvL+++/XqKtdu3Z8+OGHhIWFceLECfz9/W0mnLuQZcuW8eqrr9KmTRuOHDlCdnY2kZGRvPzyy7Rv3/6Sz10IIYQQQgghrgaaUuqyn5wkxFVr6jCY8BY0rf1RbkIIIYQQQgjRUKTHW1yfet8HHj4XLyeEEEIIIYQQ9Ux6vIUQQgghhBBCiHokPd5CCCGEEEIIIUQ9ksRbCCGEEEIIIYSoR46NHYAQ9WLNV9DlDvDwvWjRygpF4keF7PuumKLsSlz9HGh7h4keE71wdr3wd1PlxVWsf6+AlFWllORV4hHkQPRgN7pN8MTgaH0G+JnfKvl1Rj7Ze8spyKwEoN2fTAx6y9+mrsSPCkn66gzlZ6pocbOR/lN8cfe3Pqu7yqL49N6TBN/kzJ1TL35OQgghhBBCiKuH9HiLetOnTx80TWPcuHENf/C1X0NRvl1FV75yio1zCynMqsS7uSMleZVs/+wMSx/PRVXVPQWCqlIsfSKX7Z+doSTPum9hViUb5xby4yun9HLFeZWk/FQKGjgatVrrSt9Yxrp/FRAzzI0xXwZyOKGM+DdP69u3/KeIkvxK+vzN265zEkIIIYQQQlw9JPG+BlQnsGFhYTbrx40bh6ZpaFrtyZy4uJPJ5ST/twSAvi9688iKYIbMsvZEZ2wzc3B1aZ37HlxdSsY2MwBDZ/nzyIpg+v7dG4DkFSWcTC4HwDfMkSfXN2XCyqa4+tX+K5dzwFo2JNaIXysnXH0N/JZiXZd/tIJN/y7k9pd8MHrIr6wQQgghhBDXGvkUL64J5eXl9VLvkXVl+nJEf1cAWvVy0Xum0zaU1bofQNp66zZHF42WvVxs6jh3u5OLAZO3wwXjaNLOGYDjO8zkHa6g5FQVAW2dUUrx09R8wnu42NQthBBCCCGEuHZI4n0dqe4B79OnD3PmzCE0NBQXFxcGDBhARkaGXi4xMZF+/frh5+eHi4sLYWFhDB06lMOHD+tltm3bxpAhQ/Dz88NoNNKyZUtmzpwJQGlpKUOHDiU8PBw3NzeMRiNt2rTh1VdfvWiCXFBQwNNPP01oaCjOzs6EhITw3HPPUVJSUut5/N///R8hISG4uLhc4dayKsq26MuuvtZfB82gYfK2LheeqKxz38Js6zaTlwHNYE3Uz+3RLjxhqXW/2oTd6kLPp73Ys7SYz0efpGVPF/q84M2eJcXkHCin17Ne/PhyHu/1zOSjASfYu7zY/pMUQgghhBBCNCqZXO06lJiYyJYtWwgLC6OiooKVK1cydOhQtm3bhlKKu+++m7y8PAIDA4mMjCQzM5Ply5fzzDPP0KpVKzZu3Mhtt91GeXk5zs7OtGnThuzsbNatW8ff/vY3zGYzy5cvJzAwkIiICHJzczl06BD//Oc/KS0t5c0336w1rvLycvr06cPOnTtxcXEhMjKS1NRU3nnnHXbt2sUvv/xiM2x+06ZNrFu3jrZt21JWVnfPs9lsxmw226wzWiox/oE2vOyH21/2jnDLXzy55S+e+s9nciuJf/s0vZ/zZs/SYvYuK+FP//Th8NoyVr5yiqBoZ/xbO13+AYUQQgghhBANQnq8r0OVlZVs3bqV5ORk5s6dC8COHTv46aefyM/PJy8vD4Dt27eTlJRETk4Oe/fuJSoqCoCXX36Z8vJyvL292bNnD3v37iUnJ4fXXnsNADc3N/bt20d2djZJSUlkZGQwZswYAL766qs641q4cCE7d+7E2dmZ3bt3s2vXLhITEwH49ddf+fXXX23Kl5eX89///pfk5GROnjxZZ73Tp0/Hy8vL5jV9fapdbeURdPa7p5JTVYB10rSy09Zlz+C6h4h7Blm3lZ6u0idhq67Duu8f+15r9bR8AiKcuekeN44mluHiZSBmmDvth7qhquDY5rq/jBBCCCGEEEJcPSTxvgZc6uRpMTExREdHAzBq1Ch9/Z49e/Dz86Nbt24AtG7dmpiYGEaNGkVSUhL+/tZJxTZv3gzAPffcQ0REBAAGg4EOHTroy59//jkREREYjUY0TePzzz8HICsrq864tmzZAlgT6oiICDRNo2PHjvr26iS8Wtu2bRkwYAAADg51J8CTJk2ioKDA5jWpR8RFWskqvMfZIeypP1uHux9OKMNitibS4d2t20/sMTNv0AnmDTrBiT3W3vWw3/e1mBVHEsps6ji/7kt18NdSjiSUcedUHzRNQylw+L1z20HGqQghhBBCCHFNkY/w1wA3NzcATp06ZbO+uufa3d39kupbvXo1X375JRs2bCA5OZnFixfz1VdfceLECV544YWL7j9jxgymT58OQGhoKEFBQRw/fpzMzEyqqqousjc4OzvTqVOnGut9fHxsfg4MDLTrfIxGI0bjeQPLHS88mVm1oGhnIu9yZf8PJfw64zRJC89wOsN6b3ZIZyNt+pkAqChVnEqz6MsAbfqaaBbrTOaOcpY9k4t3c0fyj1rLRA50JTDKOmFa0UkLX437DYAzOdb7wg8nlPHRgBMA/OXHYJuYzGeq+GVaPt0e9cQ3zJpth97iwtb5RWTvK+dIQhmaAZrf/EcG0wshhBBCCCEaivR4XwOqe4WLior4+OOPsVgsbN++nTVr1gDoPdHV9uzZw/79+wH4+uuv9fUxMTEopdi4cSPjxo3jP//5D4mJiTzyyCMAJCQkABAXFwfAkiVLOHToEABKKXbv3g2c7ZmOiIggPT2dDRs21IihNl27dgWsQ+Hnzp1LYmIiiYmJxMfH88ILLzB69Gib8g31mLQB03zp9pgnnsEOnM6w4OrrQOz97oyY669PmlYbg4PGiLkBxN7vjquvdV/PYAe6PebJgNd99XJVFjidYeF0hoWq3+dbqyhR+rrzJbxzGpOXgZsf9tDX3fqYJ5EDXflmfA6H4ku5c6oPAW2cr1wjCCGEEEIIIeqNppT6A9NBiYZw/PhxOnTooPd4W4ceK315xYoVDBw4kHHjxvHJJ5/g5uZGVVUV4eHhHDhwgKqqKjp27MiOHTuorKzEyckJDw8PmjdvjsFgIDk5maqqKl566SWmTZtWY3K1iIgIsrOz6d69O8uWLWPy5Mn87//+L4A+gVtpaakeX3Vsffr0Ye3atYwdO5YFCxZgNpu5+eab2b17NwaDgcjISCoqKjh69Chms5m0tDTCwsL08+jduzfx8fGX12hTh8GEt6Bpqz/W+EIIIYQQQgjxB0mP9zUgJCSEjRs3ct999xEYGIjBYMDLy4u+ffvy448/MnDgQJvyXbp04d1336W4uBgnJyfuuOMOli1bhqZpODg48NhjjxEeHk5mZiaHDh0iLCyM559/nldffRWAW2+9lQ0bNjBo0CDc3d1JSUnB3d2dHj16APDSSy8xduxYvL29KSwsZOTIkTz++OMXPQ+j0cjatWt56qmnaN68OampqeTn59OlSxemTZtm99ByIYQQQgghhLiWSI/3deSK9BRfL6THWwghhBBCCHGVkB5vcX3qfR94+Fy8nBBCCCGEEELUM5nVXFyfbhvZ2BEIIYQQQgghBCBDzYUQQgghhBBCiHolQ82FEEIIIYQQQoh6JIm3EEIIIYQQQghRjyTxFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEW1x2z2czUqVMxm82NHcp1T9q64UhbNxxp64Yjbd1wpK0bjrR1w5G2bjjS1n+cppRSjR2EEFdSYWEhXl5eFBQU4Onp2djhXNekrRuOtHXDkbZuONLWDUfauuFIWzccaeuGI239x0mPtxBCCCGEEEIIUY8k8RZCCCGEEEIIIeqRJN5CCCGEEEIIIUQ9ksRbXHeMRiNTpkzBaDQ2dijXPWnrhiNt3XCkrRuOtHXDkbZuONLWDUfauuFIW/9xMrmaEEIIIYQQQghRj6THWwghhBBCCCGEqEeSeAshhBBCCCGEEPVIEm8hRL2aOnUqmqYRFhbW2KFc98aNG4emafTp06exQ7nuSVs3Lk3T0DSNBQsWNHYo170+ffqgaRrjxo1r7FCue2FhYWiaxtSpUxs7lOuetHXDkffrsyTxFteNr776itjYWEwmE76+vtxzzz0cPny4scO66s2cOZM+ffoQHByM0WgkNDSUsWPHcuTIEZty1R++zn/16NGjkSK/9lR/CVHby2Kx6OUqKip47bXXaNmyJc7OzoSEhPDss89y5syZRoz+2pGenl5nO5//Qav6w9f5rzFjxjTeCVzFEhISuOuuuwgICNDb6oMPPqhRzt5r+NChQ9xzzz34+vpiMpmIjY3l66+/bqjTuarZ09ZFRUU888wzdO7cGX9/f0wmExEREbzyyisUFRXZlK3r9+Hll19uyNO6Ktl7Xdv7d/DkyZM8/PDDNGnSBKPRSFRUFO+9915Dnc5VzZ62XrBgwQXfw+Pj44ELv9d//PHHjXB2Vxd7P9/J+3XDcWzsAIS4EubNm8f48eMBCA8PJy8vjyVLlrBu3Tp27dpFUFBQI0d49Zo9ezbHjh2jbdu2mEwm0tLS+PTTT1m1ahUpKSl4enralG/ZsiUBAQH6z9HR0Q0d8jXP39+fVq1a2azTNE1ffvjhh/n8888xGAy0adOGI0eOMGvWLJKSkvj1118xGOQ70wsxGo3ExcXZrDt9+jQpKSkABAcH19gnMjLS5lpv3bp1/QZ5jdqxYwc///wzLVu2JDc3t85y9lzDJ06coHv37uTk5ODp6UlwcDBJSUmMHDmS4uJiHn744QY8s6uPPW2dl5fHv/71L4xGI+3atSMzM5ODBw/y+uuvs337dn744Yca+3Ts2NFmVuLmzZvX2zlcK+y9rqtd6O9gcXExvXv3JiUlBZPJRGhoKPv372fixInk5OTwj3/8o17O4VphT1sHBATUeA8/duwYJ06cAKj1M9355Zs0aXKFIr522fv5Tt6vG5AS4hpnNpuVv7+/AtSIESOUUkplZmYqDw8PBaiJEyc2coRXt9dff10dPXpU//mZZ55RgALU0qVL9fW9e/dWgJo/f/4l1T9lyhQFqNDQUKWUUsXFxapHjx4KUK1atVLHjh27EqdxTahui7Fjx9ZZZvv27Xr7z549Wyml1HfffaevW7JkSZ37jh07VgGqd+/eSimlTp48qdq1a6cA1bVrV5Wfn38Fz+ba8sQTTyhA+fj4qKKiIn19aGioAtSaNWsuqb4bta1zc3NVSUmJSktL06/J999/36aMvdfwxIkTFaA8PDxUZmamUkqpESNGKED5+/srs9lcZxzVdVW/Hy1dulQ5OjoqQL3++uv1cOYNz562PnHihHrzzTdVYWGhUkqp0tJSdcstt+jlT506pZetXpeWlnZJcVS/91e/b6WmpqqgoCAFqLvvvluVlZX9ofO8GtjT1krZ93dw5syZClCapqldu3YppZR67rnnFKCcnJxUdnZ2nftWvx9NmTJFKaXUxo0blZubmwLUhAkTVFVV1R86z6uBvW19vpiYGAWo/v376+vOreNS3Qhtbc/nO3m/bljSbSKueVu3btW/NR0xYgQATZs25ZZbbgFg5cqVjRbbtWDy5Mm0aNFC/7lnz576cm3Panz22WcxGo20bNmSCRMmcPLkSbuPZTabGTp0KOvXr6d169bEx8ffkL0tS5YswWQyERwczN13301SUpK+7ccff9SXq6/ngQMH4uLiAth/Pefn53PHHXdw4MAB4uLi+Pnnn/H29r5yJ3ENycvLY/78+QD89a9/xd3dvUaZESNG4OLiQkREBP/zP/9DYWGh3fXfSG3t5+eHyWS6YBl7r+Hqct26daNp06YADB8+HIDc3Fy2bdtmV0wrV65k5MiRWCwWpk+fzuTJky/hjK5e9rR1UFAQzz//PB4eHgC4uLjQtWtXAAwGA46ONQc2dunSBVdXV6Kjo5kxYwZms9numI4ePUq/fv3Izs5m8ODBLFmy5Lp4pq89bX2uC/0drL6u27Rpw0033QSc/T2oqKhg9erVdh0jKSmJAQMGUFxczF//+lc++OADm5FR16pLbWuw/o7v2bMHgBdeeKHWMgEBAbi7u9OpUyc+/PBDqqqq7K7/em1rez7fyft1w5LEW1zzMjIy9OVzhxYFBgYC1uFJwj6VlZV8+OGHgHUoXb9+/Wy2m0wmmjVrRkBAAGlpaXz00Ud069aN4uLii9ZtsVi47777+Pnnn2nTpg3x8fGEhITUy3lczRwcHAgKCiIsLIzs7Gy+//57unXrpifftV3PBoMBf39/wL7r+cyZMwwYMIBdu3Zxyy23sGrVKry8vOrhbK4Nc+fOpaSkBKPRyMSJE2ts9/DwoFmzZnh5eXHw4EHefPNN7rzzTrs+uElb12TvNVxdrrb37XPLXUhCQgLDhw+nvLycN954gxdffPGPn8A1LCcnhyVLlgAwcuRIPSGv5uPjQ0hICEajkeTkZCZNmsSDDz5oV93Z2dn069ePjIwMhgwZwuLFi3F2dr7i53C1u9jfwStxXe/fv5877riDgoICHn/8cebMmXNdJIKX68033wSgQ4cO9O/fv8b2Jk2a6Mngzp07efTRR5k0aZJddd8obV3X5zt5v25YkniL65ZSqrFDuKYUFxczbNgwfvrpJ4KCglixYoVNT8Y777xDfn4+e/fuJSMjQ/+jlpaWxrfffnvR+jMzM1m+fDnu7u6sWbOGZs2a1du5XK1Gjx5NTk4OBw8eZP/+/fo3yWazmTlz5lxw30u5nrdv387mzZsJDQ3lp59+qnGf/o3k3LYdM2ZMjXsDFy9eTH5+Prt37yYzM5MHHngAgMTERDZu3HjR+qWt7WfPNXyp79vz58+ntLSUZ599lv/5n/+53NCuC4cPH6ZHjx5kZWXRvXv3GhNWJSYmkpeXx86dO8nMzKRv374AfPPNNzYfvuvy008/cfjwYW6++WYWLVqEk5NTvZzH1exy/w5e6nX9zTffkJuby/Dhw6/bRNBe1fcZAzz//PM22wICAti9ezcnT55k165dHDt2jKioKMB6f3N5eflF678R2vpin+9qI+/X9UMSb3HNO3eock5OTo3lc4fZiNplZ2fTu3dvVqxYQUREBBs2bND/eFXr1KmT/kataRqjR4/Wt9nzTaeLiwsODg6cOXOGWbNmXdH4rxURERH4+vrqP9955534+fkBZ9uwtuu5qqqKvLw8wL7r2c3NDbAOC/3ss8+uTPDXqE8//ZSTJ0+iaRp/+9vfamzv0qULDg4OADg6OnLvvffq2+y5rqWta7L3Gq4uV9v79rnlLqT6toGFCxfe0E+x2LRpE7fccgsHDx5k0KBBrFq1qkZvd1xcnJ5UuLq6MmzYMH2bPYl3dVvXNWnbjcCev4NX8rpetWoVmzdv/uOBX8PeeustwNquI0eOtNnm5uZGTEyM/rOvry8DBgwAoLS01K6J8q73tr7Y5zt5v25YkniLa17Xrl315KV6iF1WVhaJiYkA/OlPf2q02K4F+/bt45ZbbmH79u307NmTTZs20bJlS5syOTk5vP322zaPpzn3ERL2PKM7MDBQ74F56623mDFjxpU5gWvIG2+8YZPM/fzzz/oft+o2PPd6rb6ev//+e8rKympsr0uXLl30RwRNnDiRhQsXXpH4rzVKKWbOnAlY71mLjIy02b5v3z7mzZun3+NaWVnJ4sWL9e32XNfS1jXZew1X/7tp0yaysrIAWLp0KWCd+b9Lly4XPdY///lP2rdvT3Z2Nv3799dnPb6RLF68mL59+5Kbm8vEiRNZtmwZrq6uNmUSEhJYvHgxlZWVAJSVlbF8+XJ9e2ho6EWPM3z4cB544AEqKysZOXIka9euvbIncpWz9+9g9XV98OBBdu/eDZz9PXBycqpxC1dtJk6cSN++fTlz5gx33XUXycnJV+o0rinHjh3jm2++AeDpp5+uMWfB8uXLWbVqlf7z6dOn9ZFkbm5uNjPP1+V6bmt7Pt/J+3UDa7x53YS4cv7973/rsyaGh4crT09PfabF6tkXRe0iIiL0tuvYsaOKi4vTXx999JFS6uzMoY6Ojqpdu3aqefPm+j6RkZGqtLS0zvrPn9V86tSp+r7V9d8oQkNDlaZpqkWLFioyMlJpmqYA5ebmpvbt26eXGzVqlAKUwWBQ7dq1U05OTgpQPXv2VJWVlXXWf/5M2w899JA+k+4PP/xQ36d31Vm+fLl+ra1du7bG9jVr1ihAGY1GFR0drQIDA/Xyffv2veCstjdqWy9ZskS1atVKnxEYUAEBAapVq1Zq9OjRejl7ruHjx4/rT6Tw9PRU4eHhep0ffvjhBeOoLjd//nyVkZGhQkJCFKBiYmJsZvK+ltnT1pmZmfr7iLOzs837d1xcnNq+fbtSSqn58+fr7zUxMTHKx8dHr/Ohhx66YBznzmpeXl6u+vfvr/+fVdd/rbOnre39O1hUVKTatGmjAGUymWz+xr700ksXjOPcmbYLCgpUhw4dFKCaNWum0tPT670dGoK97yFKKfXss88qQHl5eekz95+r+vOFl5eXuummm5S7u7te52uvvXbBOG6Etrbn851S8n7dkCTxFteNzz//XHXs2FEZjUbl5eWlhg8frlJTUxs7rKveuX/8zn9VP2bjzJkzavLkyapr167K19dXmUwm1a5dO/Xiiy9e9E3z/MRbKaX+8pe/6G/yixYtqsezu7r8+9//VrfffrsKDg5WRqNRhYWFqfvvv18dOHDAplx5ebl69dVXVVhYmHJyclJNmzZVTz31VK0fPM51fjJYUVGhBgwYoADl6uqq1q9fX1+ndlXq2bOnAtTNN99c6/bs7Gz13HPPqZtuukl5eXkpd3d3FRMTo6ZPn65KSkouWPeN2tbVCVxtr+q2UMr+azglJUUNHz5ceXl5KaPRqDp27Ki++OKLi8Zx7gc5pZTau3ev8vb2VoDq1q2bKi4uvpKn3SjsaetzH6dU26v6MXkHDx5Ujz32mIqMjFTu7u7Ky8tLde7cWX3wwQeqoqLignGc/zixwsJCFRsbqydMKSkp9dgKDcOetr6Uv4NZWVlq7Nixyt/fXzk5Oal27dqpWbNmXTSO8x9xlZWVpa9r06aNOnny5JU+9QZn73vI6dOn9cfCvvDCC7XWtW3bNjV27FjVunVr5erqqvz8/NStt96qvv7664vGcSO0tT2f75SS9+uGpCklM1AJIYQQQgghhBD1Re7xFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEWQgghhBBCCCHqkSTeQgghhGgQZWVlvP3228TFxeHp6YmrqysRERE8+uijHDlypNHi0jQNTdNYsGBBo8UghBDi+ubY2AEIIYQQ4vqXn59Pv379SEpKAsDDw4NWrVpx7NgxPvzwQ7p160bLli0bOUohhBCifkiPtxBCCCHq3ZNPPqkn3S+88AKnTp1iz549FBQUsHbtWtq2bQvAd999R48ePXB3d8fFxYVOnToxb948m7pq66Hu06cPmqYxbtw4ANLT023K3X333bi6uhIeHq7XFx8fj6Zpeh0PPfQQmqYRFhYGQEpKCoMHD6ZJkyYYjUZCQkIYMGAAW7ZsqadWEkIIcb2SxFsIIYQQ9aqgoIBvvvkGgA4dOvDGG2/g6Hh20F2vXr3o1q0bn3/+OUOGDGHDhg24u7sTFBTEzp07GT9+PNOmTbvs40+YMIF9+/bh5OREeno6EyZM4MCBA3h6ehIXF6eXa9myJXFxcXTq1AmAUaNGsWLFCiwWC9HR0VRVVbFy5UqSk5MvOxYhhBA3Jkm8hRBCCFGvUlNTsVgsAPTs2dOml/lckydPBiAuLo6jR4+SlpbGsGHDAJg2bRolJSWXdfwhQ4Zw5MgR1q1bB0BVVRXx8fHExsaSmJiol3vllVdITEzk22+/BeDgwYMArFixgh07dpCVlcWRI0fo06fPZcUhhBDixiWJtxBCCCHqlVJKX64r6c7JyeHYsWMADB8+HKPRiKZpjBw5EoDS0lL27dt3Wce///770TSNqKgofd3Jkycvut+gQYMAuO2224iMjGTEiBGsXLmS4ODgy4pDCCHEjUsmVxNCCCFEvWrbti2Ojo5YLBbWr1+PUqrOBPxSVFZW6ssFBQV1lvP29gawGd5+7pcBdfn0008ZPHgw8fHxJCcn88MPP7B06VL27t3LnDlzLj9wIYQQNxzp8RZCCCFEvfLy8uLee+8FICkpiZdeekkfeg7wyy+/cOjQIVq0aAHA0qVLMZvNKKX46quvADCZTERHRwPQpEkTwDqEHeDAgQPs2bPnsuMzmUwAFBcX26xft24dw4YN44MPPiAhIYEpU6YAkJCQcNnHEkIIcWOSxFsIIYQQ9W727Nl07NgRgBkzZuDn50eHDh3w9fWlf//+pKam6hOobd68mdDQUMLDw/X7rSdPnoyrqysA/fr1A2DmzJncdtttdOvWza4e7Lq0a9cOgBdffJGbb76Zl156CYAHHngAHx8f2rZtS6dOnXj11VcBuOmmmy77WEIIIW5MkngLIYQQot75+vqyadMm3nrrLbp27UpVVRUpKSn4+Pgwfvx4evXqxZgxY1i+fDndu3enqKiI7OxsOnbsyMcff6xPvAbw9ttvM3DgQEwmE4cPH+all16iR48elx3bu+++S0xMDOXl5WzdulXvSX/ooYeIjo4mNzeX5ORkgoKCmDBhAu+9994fbg8hhBA3Fk39ka+IhRBCCCGEEEIIcUHS4y2EEEIIIYQQQtQjSbyFEEIIIYQQQoh6JIm3EEIIIYQQQghRjyTxFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEWQgghhBBCCCHq0f8DlvnWh8YdVxYAAAAASUVORK5CYII=",
85
+ "text/plain": [
86
+ "<Figure size 1000x350 with 1 Axes>"
87
+ ]
88
+ },
89
+ "metadata": {},
90
+ "output_type": "display_data"
91
+ }
92
+ ],
93
+ "source": [
94
+ "csv_file = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
95
+ "out_file = current_dir.parent / \"plots/Figure_12.svg\"\n",
96
+ "sortByFrequency_model_types_csv(csv_file, out_file)"
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": null,
102
+ "id": "e91e903f",
103
+ "metadata": {},
104
+ "outputs": [],
105
+ "source": []
106
+ }
107
+ ],
108
+ "metadata": {
109
+ "kernelspec": {
110
+ "display_name": "latm",
111
+ "language": "python",
112
+ "name": "python3"
113
+ },
114
+ "language_info": {
115
+ "codemirror_mode": {
116
+ "name": "ipython",
117
+ "version": 3
118
+ },
119
+ "file_extension": ".py",
120
+ "mimetype": "text/x-python",
121
+ "name": "python",
122
+ "nbconvert_exporter": "python",
123
+ "pygments_lexer": "ipython3",
124
+ "version": "3.10.15"
125
+ }
126
+ },
127
+ "nbformat": 4,
128
+ "nbformat_minor": 5
129
+ }
jupyter_notebooks/SuppM_Figure_S13_Danbooru_Taxonomy.ipynb ADDED
@@ -0,0 +1,1848 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "68a6e559-6ef4-4c7f-8736-93bc8588e3bc",
6
+ "metadata": {},
7
+ "source": [
8
+ "# \"Danbooru\" Taxonomy"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "36aff030",
14
+ "metadata": {},
15
+ "source": [
16
+ "### Getting Tags and Categories from Danbooru.donmai.us"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": 14,
22
+ "id": "a97e3a64-5990-44a9-809d-45030e85f161",
23
+ "metadata": {
24
+ "execution": {
25
+ "iopub.execute_input": "2025-03-17T12:48:07.782534Z",
26
+ "iopub.status.busy": "2025-03-17T12:48:07.781679Z",
27
+ "iopub.status.idle": "2025-03-17T12:48:07.786016Z",
28
+ "shell.execute_reply": "2025-03-17T12:48:07.785465Z",
29
+ "shell.execute_reply.started": "2025-03-17T12:48:07.782509Z"
30
+ }
31
+ },
32
+ "outputs": [],
33
+ "source": [
34
+ "from pathlib import Path\n",
35
+ "import subprocess\n",
36
+ "import json\n",
37
+ "import urllib.parse"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "id": "97917f01-8f5b-4aa7-8869-9a7d853843d0",
43
+ "metadata": {},
44
+ "source": [
45
+ "#### Paths credentials"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": 2,
51
+ "id": "1f50a910-66c2-472a-8027-dd589b6ae755",
52
+ "metadata": {
53
+ "execution": {
54
+ "iopub.execute_input": "2025-03-17T12:48:08.799790Z",
55
+ "iopub.status.busy": "2025-03-17T12:48:08.798840Z",
56
+ "iopub.status.idle": "2025-03-17T12:48:08.905544Z",
57
+ "shell.execute_reply": "2025-03-17T12:48:08.904995Z",
58
+ "shell.execute_reply.started": "2025-03-17T12:48:08.799770Z"
59
+ }
60
+ },
61
+ "outputs": [],
62
+ "source": [
63
+ "current_dir = Path.cwd()"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "code",
68
+ "execution_count": 6,
69
+ "id": "a0f846e3-1fcc-4839-a459-c974f9c46d06",
70
+ "metadata": {
71
+ "execution": {
72
+ "iopub.execute_input": "2025-03-17T12:48:09.490977Z",
73
+ "iopub.status.busy": "2025-03-17T12:48:09.490799Z",
74
+ "iopub.status.idle": "2025-03-17T12:48:09.701470Z",
75
+ "shell.execute_reply": "2025-03-17T12:48:09.700912Z",
76
+ "shell.execute_reply.started": "2025-03-17T12:48:09.490961Z"
77
+ }
78
+ },
79
+ "outputs": [],
80
+ "source": [
81
+ "api_key = (current_dir.parent / \"misc/credentials/api-key_danbooru\").read_text().strip()\n",
82
+ "username = (current_dir.parent / \"misc/credentials/username_danbooru\").read_text().strip()"
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "markdown",
87
+ "id": "c4c2290f-7705-4061-9b29-e1b74da732b8",
88
+ "metadata": {
89
+ "execution": {
90
+ "iopub.execute_input": "2025-03-12T20:39:34.470218Z",
91
+ "iopub.status.busy": "2025-03-12T20:39:34.469746Z",
92
+ "iopub.status.idle": "2025-03-12T20:39:34.472498Z",
93
+ "shell.execute_reply": "2025-03-12T20:39:34.472084Z",
94
+ "shell.execute_reply.started": "2025-03-12T20:39:34.470199Z"
95
+ }
96
+ },
97
+ "source": [
98
+ "### Query danbooru.donmai.us for a single Tag"
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "code",
103
+ "execution_count": 7,
104
+ "id": "00ffaf5f-a4ae-408a-8a7c-e6b174f091f8",
105
+ "metadata": {
106
+ "execution": {
107
+ "iopub.execute_input": "2025-03-17T12:48:11.712064Z",
108
+ "iopub.status.busy": "2025-03-17T12:48:11.711866Z",
109
+ "iopub.status.idle": "2025-03-17T12:48:20.143088Z",
110
+ "shell.execute_reply": "2025-03-17T12:48:20.142318Z",
111
+ "shell.execute_reply.started": "2025-03-17T12:48:11.712047Z"
112
+ }
113
+ },
114
+ "outputs": [
115
+ {
116
+ "name": "stdout",
117
+ "output_type": "stream",
118
+ "text": [
119
+ "🚀 Fetching tag details for 'wombat'...\n",
120
+ "🔎 Running cURL: curl -s -L --user parodyofsomething:FkzGApb17bfJayJMqKTzeyTw -X GET https://danbooru.donmai.us/tags.json?search%5Bname%5D=wombat&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\n",
121
+ "✅ Test complete! Data saved to `/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru/tag_info_test/wombat_tag_info.json`.\n"
122
+ ]
123
+ }
124
+ ],
125
+ "source": [
126
+ "output_dir = current_dir.parent / \"misc\" / \"danbooru\" / \"tag_info_test\"\n",
127
+ "output_dir.mkdir(parents=True, exist_ok=True)\n",
128
+ "\n",
129
+ "# Get the tag to test\n",
130
+ "tag_name = input(\"Enter a single tag to test: \").strip()\n",
131
+ "if not tag_name:\n",
132
+ " print(\"❌ ERROR: No tag entered.\")\n",
133
+ " exit()\n",
134
+ "\n",
135
+ "# Encode the tag name properly\n",
136
+ "encoded_tag = urllib.parse.quote(tag_name, safe=\"\") # Encode everything\n",
137
+ "\n",
138
+ "# API endpoints with correct encoding\n",
139
+ "BASE_URL = \"https://danbooru.donmai.us\"\n",
140
+ "#TAGS_API = f\"{BASE_URL}/tags.json?search%5Bname%5D={encoded_tag}\"\n",
141
+ "TAGS_API = f\"{BASE_URL}/tags.json?search%5Bname%5D={encoded_tag}&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\"\n",
142
+ "IMPLICATIONS_API = f\"{BASE_URL}/tag_implications.json?search%5Bantecedent_name%5D={encoded_tag}\"\n",
143
+ "ALIASES_API = f\"{BASE_URL}/tag_aliases.json?search%5Bantecedent_name%5D={encoded_tag}\"\n",
144
+ "WIKI_API = f\"{BASE_URL}/wiki_pages.json?search%5Btitle%5D={encoded_tag}\"\n",
145
+ "\n",
146
+ "# Function to run cURL command properly\n",
147
+ "def fetch_data(api_url, description):\n",
148
+ " print(f\"🚀 Fetching {description} for '{tag_name}'...\")\n",
149
+ "\n",
150
+ " # Build the cURL command\n",
151
+ " curl_command = [\n",
152
+ " \"curl\", \"-s\", \"-L\", # Silent & follow redirects\n",
153
+ " \"--user\", f\"{username}:{api_key}\", # Auth\n",
154
+ " \"-X\", \"GET\",\n",
155
+ " api_url\n",
156
+ " ]\n",
157
+ "\n",
158
+ " print(f\"🔎 Running cURL: {' '.join(curl_command)}\") # Show the command\n",
159
+ "\n",
160
+ " try:\n",
161
+ " result = subprocess.run(curl_command, capture_output=True, text=True, check=True)\n",
162
+ " if result.returncode == 0:\n",
163
+ " return json.loads(result.stdout) if result.stdout else None\n",
164
+ " else:\n",
165
+ " print(f\"⚠️ cURL failed (status {result.returncode}): {result.stderr}\")\n",
166
+ " return None\n",
167
+ " except subprocess.CalledProcessError as e:\n",
168
+ " print(f\"⚠️ cURL error: {e}\")\n",
169
+ " return None\n",
170
+ " except json.JSONDecodeError:\n",
171
+ " print(f\"⚠️ Failed to parse JSON: {result.stdout}\")\n",
172
+ " return None\n",
173
+ "\n",
174
+ "# Fetch tag details\n",
175
+ "tag_info = {\n",
176
+ " \"tag_details\": fetch_data(TAGS_API, \"tag details\"),\n",
177
+ " #\"implications\": fetch_data(IMPLICATIONS_API, \"tag implications\"),\n",
178
+ " #\"aliases\": fetch_data(ALIASES_API, \"tag aliases\"),\n",
179
+ " #\"wiki\": fetch_data(WIKI_API, \"wiki information\"),\n",
180
+ "}\n",
181
+ "\n",
182
+ "# Save to JSON\n",
183
+ "output_file = output_dir / f\"{tag_name}_tag_info.json\"\n",
184
+ "with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
185
+ " json.dump(tag_info, f, indent=4, ensure_ascii=False)\n",
186
+ "\n",
187
+ "print(f\"✅ Test complete! Data saved to `{output_file}`.\")\n"
188
+ ]
189
+ },
190
+ {
191
+ "cell_type": "markdown",
192
+ "id": "75a28e4c-e06f-4578-a338-4f0014953302",
193
+ "metadata": {},
194
+ "source": [
195
+ "## Hierarchy"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "code",
200
+ "execution_count": 8,
201
+ "id": "1a58f2d1-909b-41c0-bcb8-0a55e6835f53",
202
+ "metadata": {
203
+ "execution": {
204
+ "iopub.execute_input": "2025-03-17T12:48:42.365809Z",
205
+ "iopub.status.busy": "2025-03-17T12:48:42.365447Z",
206
+ "iopub.status.idle": "2025-03-17T12:48:42.376305Z",
207
+ "shell.execute_reply": "2025-03-17T12:48:42.375752Z",
208
+ "shell.execute_reply.started": "2025-03-17T12:48:42.365788Z"
209
+ }
210
+ },
211
+ "outputs": [],
212
+ "source": [
213
+ "manual_hierarchy = {\n",
214
+ " \"subject\": {\n",
215
+ " \"female\": {\n",
216
+ " \"female_general\": [\"woman\", \"girl\", \"1girl\", \"2girls\", \"3girls\", \"4girls\", \"5girls\", \"6+girls\", \"multiple girls\"]\n",
217
+ " },\n",
218
+ " \"male\": {\n",
219
+ " \"male_general\": [\"man\", \"boy\", \"1boy\", \"2boys\", \"3boys\", \"4boys\", \"5boys\", \"6+boys\", \"multiple boys\"]\n",
220
+ " },\n",
221
+ " \"koma\": {\n",
222
+ " \"koma_general\": [\"1koma\", \"2koma\"]\n",
223
+ " },\n",
224
+ " \"anthro\": {\n",
225
+ " \"anthro_general\": [\"cat girl\", \"fox girl\", \"dog girl\", \"cat boy\", \"furry\"]\n",
226
+ " }\n",
227
+ " },\n",
228
+ "\n",
229
+ " \"visual_characteristics\": {\n",
230
+ " \"image_composition_and_style\": {\n",
231
+ " \"artistic_license\": [],\n",
232
+ " \"image_composition\": {\n",
233
+ " \"backgrounds\": [],\n",
234
+ " \"censorship\": [],\n",
235
+ " \"colors\": [],\n",
236
+ " \"focus_tags\": [],\n",
237
+ " \"lighting\": [],\n",
238
+ " \"prints\": []\n",
239
+ " },\n",
240
+ " \"patterns\": [],\n",
241
+ " \"symbols\": [],\n",
242
+ " \"text\": [],\n",
243
+ " \"year_tags\": []\n",
244
+ " }\n",
245
+ " },\n",
246
+ " \"body\": {\n",
247
+ " \"body_parts\": {\n",
248
+ " \"ass\": [],\n",
249
+ " \"breasts_tags\": [],\n",
250
+ " \"face_tags\": {\n",
251
+ " \"eyes_tags\": []\n",
252
+ " },\n",
253
+ " \"ears_tags\": [],\n",
254
+ " \"hair\": {\n",
255
+ " \"hair_color\": [],\n",
256
+ " \"hair_styles\": []\n",
257
+ " },\n",
258
+ " \"hands\": {\n",
259
+ " \"gestures\": []\n",
260
+ " },\n",
261
+ " \"neck_and_neckwear\": [],\n",
262
+ " \"posture\": [],\n",
263
+ " \"pussy\": [],\n",
264
+ " \"penis\": [],\n",
265
+ " \"shoulders\": [],\n",
266
+ " \"skin_color\": [],\n",
267
+ " \"tail\": [],\n",
268
+ " \"wings\": []\n",
269
+ " },\n",
270
+ " \"injury\": []\n",
271
+ " },\n",
272
+ " \"attire_and_body_accessories\": {\n",
273
+ " \"attire\": {\n",
274
+ " \"dress\": [],\n",
275
+ " \"handwear\": [],\n",
276
+ " \"headwear\": [],\n",
277
+ " \"legwear\": [],\n",
278
+ " \"neck_and_neckwear\": [],\n",
279
+ " \"sexual_attire\": {\n",
280
+ " \"bra\": [],\n",
281
+ " \"panties\": []\n",
282
+ " },\n",
283
+ " \"sleeves\": [],\n",
284
+ " \"swimsuit\": []\n",
285
+ " },\n",
286
+ " \"embellishment\": [],\n",
287
+ " \"eyewear\": [],\n",
288
+ " \"fashion_style\": [],\n",
289
+ " \"nudity\": []\n",
290
+ " },\n",
291
+ " \"sex\": {\n",
292
+ " \"sex_acts\": {\n",
293
+ " \"simulated_sex_acts\": []\n",
294
+ " },\n",
295
+ " \"sexual_positions\": []\n",
296
+ " },\n",
297
+ " \"objects\": {\n",
298
+ " \"computer\": [],\n",
299
+ " \"airplanes\": [],\n",
300
+ " \"armor\": [],\n",
301
+ " \"ground_vehicles\": [],\n",
302
+ " \"helicopters\": [],\n",
303
+ " \"pokemon_objects\": [],\n",
304
+ " \"ships\": [],\n",
305
+ " \"weapons\": [],\n",
306
+ " \"audio_tags\": [],\n",
307
+ " \"cards\": {\n",
308
+ " \"playing_card_faces\": []\n",
309
+ " },\n",
310
+ " \"eyewear\": [],\n",
311
+ " \"piercings\": [],\n",
312
+ " \"sex_objects\": []\n",
313
+ " },\n",
314
+ " \"creatures\": {\n",
315
+ " \"animals\": {\n",
316
+ " \"birds\": [],\n",
317
+ " \"cats\": [],\n",
318
+ " \"dogs\": []\n",
319
+ " },\n",
320
+ " \"legendary_creatures\": []\n",
321
+ " },\n",
322
+ " \"plants\": {\n",
323
+ " \"plant\": {\n",
324
+ " \"tree\": []\n",
325
+ " },\n",
326
+ " \"flowers\": []\n",
327
+ " },\n",
328
+ " \"games\": {\n",
329
+ " \"game_activities\": [],\n",
330
+ " \"board_games\": [],\n",
331
+ " \"sports\": [],\n",
332
+ " \"video_game\": []\n",
333
+ " },\n",
334
+ " \"real_world\": {\n",
335
+ " \"companies_and_brand_names\": [],\n",
336
+ " \"holidays_and_celebrations\": [],\n",
337
+ " \"jobs\": [],\n",
338
+ " \"locations\": [],\n",
339
+ " \"people\": [],\n",
340
+ " \"real_world_locations\": []\n",
341
+ " },\n",
342
+ " \"more\": {\n",
343
+ " \"dances\": [],\n",
344
+ " \"family_relationships\": [],\n",
345
+ " \"food_tags\": [],\n",
346
+ " \"fire\": [],\n",
347
+ " \"groups\": [],\n",
348
+ " \"phrases\": [],\n",
349
+ " \"scan\": [],\n",
350
+ " \"subjective\": [],\n",
351
+ " \"technology\": [],\n",
352
+ " \"verbs_and_gerunds\": [],\n",
353
+ " \"water\": [],\n",
354
+ " \"disambiguation_pages\": [],\n",
355
+ " \"magazine_publications\": [],\n",
356
+ " \"special_moves\": [],\n",
357
+ " \"uniforms\": [],\n",
358
+ " \"pokemon_media\": [],\n",
359
+ " \"tagged_songs\": [],\n",
360
+ " \"vocaloid_derivatives\": [],\n",
361
+ " \"vocaloid_songs\": [],\n",
362
+ " \"vocal_synthesizers\": [],\n",
363
+ " \"vocal_synth_derivatives\": [],\n",
364
+ " \"vocal_synth_songs\": [],\n",
365
+ " \"deemo_songs\": []\n",
366
+ " },\n",
367
+ " \"copyrights_artists_projects_and_media\": {\n",
368
+ " \"genres_of_video_games\": {\n",
369
+ " \"fighting_games\": [],\n",
370
+ " \"platform_games\": [],\n",
371
+ " \"role-playing_games\": [],\n",
372
+ " \"shooter_games\": [],\n",
373
+ " \"visual_novel_games\": []\n",
374
+ " },\n",
375
+ " \"artists\": {\n",
376
+ " \"named_drawfags\": [],\n",
377
+ " \"pixiv_projects\": []\n",
378
+ " }\n",
379
+ " },\n",
380
+ " \"characters\": {\n",
381
+ " \"ace_attorney\": [],\n",
382
+ " \"arknights\": [],\n",
383
+ " \"atelier\": [],\n",
384
+ " \"azur_lane\": [],\n",
385
+ " \"bleach\": [],\n",
386
+ " \"bokujou_monogatari\": [],\n",
387
+ " \"brave_girl_ravens\": [],\n",
388
+ " \"cardcaptor_sakura\": [],\n",
389
+ " \"danganronpa\": [],\n",
390
+ " \"digimon\": {\n",
391
+ " \"digimon_characters\": []\n",
392
+ " },\n",
393
+ " \"dragon_ball\": [],\n",
394
+ " \"dragon_quest\": [],\n",
395
+ " \"fate_series\": [],\n",
396
+ " \"final_fantasy\": [],\n",
397
+ " \"fire_emblem\": [],\n",
398
+ " \"flower_knight_girl\": [],\n",
399
+ " \"gensou_suikoden\": [],\n",
400
+ " \"girls_frontline\": [],\n",
401
+ " \"girls_und_panzer\": [],\n",
402
+ " \"gundam_mechas\": [],\n",
403
+ " \"hunter_x_hunter\": [],\n",
404
+ " \"jojo_no_kimyou_na_bouken\": [],\n",
405
+ " \"kamen_rider\": [],\n",
406
+ " \"kantai_collection\": [],\n",
407
+ " \"kingdom_hearts\": [],\n",
408
+ " \"mahou_sensei_negima\": [],\n",
409
+ " \"meitantei_conan\": [],\n",
410
+ " \"minecraft\": [],\n",
411
+ " \"naruto\": [],\n",
412
+ " \"nippon_ichi\": [],\n",
413
+ " \"one_piece\": [],\n",
414
+ " \"oshiro_project\": [],\n",
415
+ " \"pokemon\": {\n",
416
+ " \"elite_four_members\": [],\n",
417
+ " \"gym_leaders\": [],\n",
418
+ " \"families_of_pokemon_main_characters\": [],\n",
419
+ " \"pokemon_ranger_characters\": [],\n",
420
+ " \"pokemon_trainer_classes\": []\n",
421
+ " },\n",
422
+ " \"pretty_cure\": [],\n",
423
+ " \"ragnarok_online\": [],\n",
424
+ " \"rosenkreuzstilette\": [],\n",
425
+ " \"sailor_moon\": [],\n",
426
+ " \"street_fighter\": [],\n",
427
+ " \"super_smash_bros\": [],\n",
428
+ " \"world_witches_series\": [],\n",
429
+ " \"tales_of\": [],\n",
430
+ " \"toaru_majutsu_no_index\": [],\n",
431
+ " \"touhou\": [],\n",
432
+ " \"touken_ranbu\": [],\n",
433
+ " \"ultra_series\": [],\n",
434
+ " \"umamusume\": [],\n",
435
+ " \"vocaloid\": [],\n",
436
+ " \"yu_gi_oh\": [],\n",
437
+ " \"genderswap_characters\": [],\n",
438
+ " \"official_mascots\": [],\n",
439
+ " \"real_life_racehorses\": []\n",
440
+ " },\n",
441
+ " \"metatags\": {\n",
442
+ " \"tag_group_metatags\": [],\n",
443
+ " \"drawing_software\": []\n",
444
+ " }\n",
445
+ "}\n"
446
+ ]
447
+ },
448
+ {
449
+ "cell_type": "markdown",
450
+ "id": "824760af-a568-47f8-b076-b30e8606b9ac",
451
+ "metadata": {},
452
+ "source": [
453
+ "## Tag Groups"
454
+ ]
455
+ },
456
+ {
457
+ "cell_type": "code",
458
+ "execution_count": 9,
459
+ "id": "557cdd69-f6b1-4c33-9f67-a13b00c0146b",
460
+ "metadata": {
461
+ "execution": {
462
+ "iopub.execute_input": "2025-03-17T12:48:43.777297Z",
463
+ "iopub.status.busy": "2025-03-17T12:48:43.776934Z",
464
+ "iopub.status.idle": "2025-03-17T12:48:43.787862Z",
465
+ "shell.execute_reply": "2025-03-17T12:48:43.787288Z",
466
+ "shell.execute_reply.started": "2025-03-17T12:48:43.777279Z"
467
+ }
468
+ },
469
+ "outputs": [],
470
+ "source": [
471
+ "tag_groups = {\n",
472
+ " # **Subjects (Humans, Anthro, etc.)**\n",
473
+ " \"subject\": \"subject\",\n",
474
+ " \"subject.female\": \"female\",\n",
475
+ " \"subject.female_general\": \"female_general\", # Added\n",
476
+ " \"subject.female.1girl\": \"1girl\",\n",
477
+ " \"subject.female.2girls\": \"2girls\",\n",
478
+ "\n",
479
+ "\n",
480
+ " # **Male Subjects**\n",
481
+ " \"subject.male\": \"male\",\n",
482
+ " \"subject.male.1boy\": \"1boy\",\n",
483
+ " \"subject.male.2boys\": \"2boys\",\n",
484
+ " \"subject.male.3boys\": \"3boys\",\n",
485
+ " \"subject.male.4boys\": \"4boys\",\n",
486
+ " \"subject.male.5boys\": \"5boys\",\n",
487
+ " \"subject.male.6+boys\": \"6+boys\",\n",
488
+ " \"subject.male.multiple_boys\": \"multiple_boys\",\n",
489
+ " \"subject.male.man\": \"man\",\n",
490
+ " \"subject.male.boy\": \"boy\",\n",
491
+ "\n",
492
+ " # **Koma (Manga Panel Counts)**\n",
493
+ " \"subject.koma\": \"koma\",\n",
494
+ " \"subject.koma.1koma\": \"1koma\",\n",
495
+ " \"subject.koma.2koma\": \"2koma\",\n",
496
+ "\n",
497
+ " # **Anthropomorphic Characters**\n",
498
+ " \"subject.anthro\": \"anthro\",\n",
499
+ " \"subject.anthro.cat_girl\": \"cat_girl\",\n",
500
+ " \"subject.anthro.fox_girl\": \"fox_girl\",\n",
501
+ " \"subject.anthro.dog_girl\": \"dog_girl\",\n",
502
+ " \"subject.anthro.cat_boy\": \"cat_boy\",\n",
503
+ " \"subject.anthro.furry\": \"furry\",\n",
504
+ "\n",
505
+ " # **Visual Characteristics**\n",
506
+ " \"visual_characteristics\": \"visual_characteristics\",\n",
507
+ " \"visual_characteristics.image_composition_and_style\": \"image_composition_and_style\",\n",
508
+ " \"visual_characteristics.image_composition_and_style.artistic_license\": \"artistic_license\",\n",
509
+ " \"visual_characteristics.image_composition_and_style.image_composition\": \"image_composition\",\n",
510
+ " \"visual_characteristics.image_composition_and_style.image_composition.backgrounds\": \"backgrounds\",\n",
511
+ " \"visual_characteristics.image_composition_and_style.image_composition.censorship\": \"censorship\",\n",
512
+ " \"visual_characteristics.image_composition_and_style.image_composition.colors\": \"colors\",\n",
513
+ " \"visual_characteristics.image_composition_and_style.image_composition.focus_tags\": \"focus_tags\",\n",
514
+ " \"visual_characteristics.image_composition_and_style.image_composition.lighting\": \"lighting\",\n",
515
+ " \"visual_characteristics.image_composition_and_style.image_composition.prints\": \"prints\",\n",
516
+ " \"visual_characteristics.image_composition_and_style.image_composition.style_parodies\": \"style_parodies\",\n",
517
+ " \"visual_characteristics.image_composition_and_style.patterns\": \"patterns\",\n",
518
+ " \"visual_characteristics.image_composition_and_style.symbols\": \"symbols\",\n",
519
+ " \"visual_characteristics.image_composition_and_style.text\": \"text\",\n",
520
+ " \"visual_characteristics.image_composition_and_style.year_tags\": \"year_tags\",\n",
521
+ "\n",
522
+ "\n",
523
+ " # Body\n",
524
+ " \"body\": \"body\",\n",
525
+ " \"body.body_parts\": \"body_parts\",\n",
526
+ " \"body.body_parts.ass\": \"ass\",\n",
527
+ " \"body.body_parts.breasts_tags\": \"breasts_tags\",\n",
528
+ " \"body.body_parts.face_tags\": \"face_tags\",\n",
529
+ " \"body.body_parts.face_tags.eyes_tags\": \"eyes_tags\",\n",
530
+ " \"body.body_parts.ears_tags\": \"ears_tags\",\n",
531
+ " \"body.body_parts.hair\": \"hair\",\n",
532
+ " \"body.body_parts.hair.hair_color\": \"hair_color\",\n",
533
+ " \"body.body_parts.hair.hair_styles\": \"hair_styles\",\n",
534
+ " \"body.body_parts.hands\": \"hands\",\n",
535
+ " \"body.body_parts.hands.gestures\": \"gestures\",\n",
536
+ " \"body.body_parts.neck_and_neckwear\": \"neck_and_neckwear\",\n",
537
+ " \"body.body_parts.posture\": \"posture\",\n",
538
+ " \"body.body_parts.pussy\": \"pussy\",\n",
539
+ " \"body.body_parts.penis\": \"penis\",\n",
540
+ " \"body.body_parts.shoulders\": \"shoulders\",\n",
541
+ " \"body.body_parts.skin_color\": \"skin_color\",\n",
542
+ " \"body.body_parts.tail\": \"tail\",\n",
543
+ " \"body.body_parts.wings\": \"wings\",\n",
544
+ " \"body.injury\": \"injury\",\n",
545
+ "\n",
546
+ " # Attire & Accessories\n",
547
+ " \"attire_and_body_accessories\": \"attire_and_body_accessories\",\n",
548
+ " \"attire_and_body_accessories.attire\": \"attire\",\n",
549
+ " \"attire_and_body_accessories.attire.dress\": \"dress\",\n",
550
+ " \"attire_and_body_accessories.attire.handwear\": \"handwear\",\n",
551
+ " \"attire_and_body_accessories.attire.headwear\": \"headwear\",\n",
552
+ " \"attire_and_body_accessories.attire.legwear\": \"legwear\",\n",
553
+ " \"attire_and_body_accessories.attire.mask\": \"mask\",\n",
554
+ " \"attire_and_body_accessories.attire.neck_and_neckwear\": \"neck_and_neckwear\",\n",
555
+ " \"attire_and_body_accessories.attire.sexual_attire\": \"sexual_attire\",\n",
556
+ " \"attire_and_body_accessories.attire.sexual_attire.bra\": \"bra\",\n",
557
+ " \"attire_and_body_accessories.attire.sexual_attire.panties\": \"panties\",\n",
558
+ " \"attire_and_body_accessories.attire.sleeves\": \"sleeves\",\n",
559
+ " \"attire_and_body_accessories.attire.swimsuit\": \"swimsuit\",\n",
560
+ " \"attire_and_body_accessories.embellishment\": \"embellishment\",\n",
561
+ " \"attire_and_body_accessories.eyewear\": \"eyewear\",\n",
562
+ " \"attire_and_body_accessories.fashion_style\": \"fashion_style\",\n",
563
+ " \"attire_and_body_accessories.nudity\": \"nudity\",\n",
564
+ "\n",
565
+ " # Sex\n",
566
+ " \"sex\": \"sex\",\n",
567
+ " \"sex.sex_acts\": \"sex_acts\",\n",
568
+ " \"sex.sex_acts.simulated_sex_acts\": \"simulated_sex_acts\",\n",
569
+ " \"sex.sexual_positions\": \"sexual_positions\",\n",
570
+ "\n",
571
+ " # Objects\n",
572
+ " \"objects\": \"objects\",\n",
573
+ " \"objects.computer\": \"computer\",\n",
574
+ " \"objects.airplanes\": \"airplanes\",\n",
575
+ " \"objects.armor\": \"armor\",\n",
576
+ " \"objects.ground_vehicles\": \"ground_vehicles\",\n",
577
+ " \"objects.helicopters\": \"helicopters\",\n",
578
+ " \"objects.pokemon_objects\": \"pokemon_objects\",\n",
579
+ " \"objects.ships\": \"ships\",\n",
580
+ " \"objects.weapons\": \"weapons\",\n",
581
+ " \"objects.audio_tags\": \"audio_tags\",\n",
582
+ " \"objects.cards\": \"cards\",\n",
583
+ " \"objects.cards.playing_card_faces\": \"playing_card_faces\",\n",
584
+ " \"objects.eyewear\": \"eyewear\",\n",
585
+ " \"objects.piercings\": \"piercings\",\n",
586
+ " \"objects.sex_objects\": \"sex_objects\",\n",
587
+ "\n",
588
+ " # Creatures\n",
589
+ " \"creatures\": \"creatures\",\n",
590
+ " \"creatures.animals\": \"animals\",\n",
591
+ " \"creatures.animals.birds\": \"birds\",\n",
592
+ " \"creatures.animals.cats\": \"cats\",\n",
593
+ " \"creatures.animals.dogs\": \"dogs\",\n",
594
+ " \"creatures.legendary_creatures\": \"legendary_creatures\",\n",
595
+ "\n",
596
+ " # Plants\n",
597
+ " \"plants\": \"plant\",\n",
598
+ " \"plant.plant\": \"plant\",\n",
599
+ " \"plant.tree\": \"tree\",\n",
600
+ " \"plant.flowers\": \"flowers\",\n",
601
+ "\n",
602
+ " # Games\n",
603
+ " \"games\": \"games\",\n",
604
+ " \"games.game_activities\": \"game_activities\",\n",
605
+ " \"games.board_games\": \"board_games\",\n",
606
+ " \"games.sports\": \"sports\",\n",
607
+ " \"games.video_game\": \"video_game\",\n",
608
+ " \"games.fighting_games\": \"fighting_games\",\n",
609
+ "\n",
610
+ " # Real World\n",
611
+ " \"real_world\": \"real_world\",\n",
612
+ " \"real_world.companies_and_brand_names\": \"companies_and_brand_names\",\n",
613
+ " \"real_world.holidays_and_celebrations\": \"holidays_and_celebrations\",\n",
614
+ " \"real_world.jobs\": \"jobs\",\n",
615
+ " \"real_world.locations\": \"locations\",\n",
616
+ " \"real_world.people\": \"people\",\n",
617
+ " \"real_world.real_world_locations\": \"real_world_locations\",\n",
618
+ "\n",
619
+ " # More Categories\n",
620
+ " \"more\": \"more\",\n",
621
+ " \"more.dances\": \"dances\",\n",
622
+ " \"more.family_relationships\": \"family_relationships\",\n",
623
+ " \"more.food_tags\": \"food_tags\",\n",
624
+ " \"more.fire\": \"fire\",\n",
625
+ " \"more.groups\": \"groups\",\n",
626
+ " \"more.phrases\": \"phrases\",\n",
627
+ " \"more.scan\": \"scan\",\n",
628
+ " \"more.subjective\": \"subjective\",\n",
629
+ " \"more.technology\": \"technology\",\n",
630
+ " \"more.verbs_and_gerunds\": \"verbs_and_gerunds\",\n",
631
+ " \"more.water\": \"water\",\n",
632
+ "\n",
633
+ " # Genres of Video Games\n",
634
+ " \"copyrights_artists_projects_and_media\": \"copyrights_artists_projects_and_media\",\n",
635
+ " \"copyrights_artists_projects_and_media.genres_of_video_games\": \"genres_of_video_games\",\n",
636
+ " \"copyrights_artists_projects_and_media.genres_of_video_games.fighting_games\": \"fighting_games\",\n",
637
+ " \"copyrights_artists_projects_and_media.genres_of_video_games.platform_games\": \"platform_games\",\n",
638
+ " \"copyrights_artists_projects_and_media.genres_of_video_games.role-playing_games\": \"role-playing_games\",\n",
639
+ " \"copyrights_artists_projects_and_media.genres_of_video_games.shooter_games\": \"shooter_games\",\n",
640
+ " \"copyrights_artists_projects_and_media.genres_of_video_games.visual_novel_games\": \"visual_novel_games\",\n",
641
+ "\n",
642
+ " \"characters\": \"characters\",\n",
643
+ " \"characters.ace_attorney\": \"ace_attorney_characters\",\n",
644
+ " \"characters.arknights\": \"arknights_characters\",\n",
645
+ " \"characters.atelier\": \"atelier_characters\",\n",
646
+ " \"characters.azur_lane\": \"azur_lane_characters\",\n",
647
+ " \"characters.bleach\": \"bleach_characters\",\n",
648
+ " \"characters.bokujou_monogatari\": \"bokujou_monogatari_characters\",\n",
649
+ " \"characters.brave_girl_ravens\": \"brave_girl_ravens_characters\",\n",
650
+ " \"characters.cardcaptor_sakura\": \"cardcaptor_sakura_characters\",\n",
651
+ " \"characters.danganronpa\": \"danganronpa_characters\",\n",
652
+ " \"characters.digimon\": \"digimon\",\n",
653
+ " \"characters.digimon.digimon_characters\": \"digimon_characters\",\n",
654
+ " \"characters.dragon_ball\": \"dragon_ball_characters\",\n",
655
+ " \"characters.dragon_quest\": \"dragon_quest_characters\",\n",
656
+ " \"characters.fate_series\": \"fate_series_characters\",\n",
657
+ " \"characters.final_fantasy\": \"final_fantasy_characters\",\n",
658
+ " \"characters.fire_emblem\": \"fire_emblem_characters\",\n",
659
+ " \"characters.flower_knight_girl\": \"flower_knight_girl_characters\",\n",
660
+ " \"characters.gensou_suikoden\": \"gensou_suikoden_characters\",\n",
661
+ " \"characters.girls_frontline\": \"girls_frontline_characters\",\n",
662
+ " \"characters.girls_und_panzer\": \"girls_und_panzer_characters\",\n",
663
+ " \"characters.gundam_mechas\": \"gundam_mechas\",\n",
664
+ " \"characters.hunter_x_hunter\": \"hunter_x_hunter_characters\",\n",
665
+ " \"characters.jojo_no_kimyou_na_bouken\": \"jojo_no_kimyou_na_bouken_characters\",\n",
666
+ " \"characters.kamen_rider\": \"kamen_rider_characters\",\n",
667
+ " \"characters.kantai_collection\": \"kantai_collection_characters\",\n",
668
+ " \"characters.kingdom_hearts\": \"kingdom_hearts_characters\",\n",
669
+ " \"characters.mahou_sensei_negima\": \"mahou_sensei_negima_characters\",\n",
670
+ " \"characters.meitantei_conan\": \"meitantei_conan_characters\",\n",
671
+ " \"characters.minecraft\": \"minecraft_characters\",\n",
672
+ " \"characters.naruto\": \"naruto_characters\",\n",
673
+ " \"characters.nippon_ichi\": \"nippon_ichi_characters\",\n",
674
+ " \"characters.one_piece\": \"one_piece_characters\",\n",
675
+ " \"characters.oshiro_project\": \"oshiro_project_characters\",\n",
676
+ " \"characters.pokemon\": \"pokemon\",\n",
677
+ " \"characters.pokemon.elite_four_members\": \"elite_four_members\",\n",
678
+ " \"characters.pokemon.gym_leaders\": \"gym_leaders\",\n",
679
+ " \"characters.pokemon.families_of_pokemon_main_characters\": \"families_of_pokemon_main_characters\",\n",
680
+ " \"characters.pokemon.pokemon_ranger_characters\": \"pokemon_ranger_characters\",\n",
681
+ " \"characters.pokemon.pokemon_trainer_classes\": \"pokemon_trainer_classes\",\n",
682
+ " \"characters.pretty_cure\": \"pretty_cure_characters\",\n",
683
+ " \"characters.ragnarok_online\": \"ragnarok_online_characters\",\n",
684
+ " \"characters.rosenkreuzstilette\": \"rosenkreuzstilette_characters\",\n",
685
+ " \"characters.sailor_moon\": \"sailor_moon_characters\",\n",
686
+ " \"characters.street_fighter\": \"street_fighter_characters\",\n",
687
+ " \"characters.super_smash_bros\": \"super_smash_bros_characters\",\n",
688
+ " \"characters.world_witches_series\": \"world_witches_series_characters\",\n",
689
+ " \"characters.tales_of\": \"tales_of_characters\",\n",
690
+ " \"characters.toaru_majutsu_no_index\": \"toaru_majutsu_no_index_characters\",\n",
691
+ " \"characters.touhou\": \"touhou_characters\",\n",
692
+ " \"characters.touken_ranbu\": \"touken_ranbu_characters\",\n",
693
+ " \"characters.ultra_series\": \"ultra_series_characters\",\n",
694
+ " \"characters.umamusume\": \"umamusume_characters\",\n",
695
+ " \"characters.vocaloid\": \"vocaloid_characters\",\n",
696
+ " \"characters.yu_gi_oh\": \"yu_gi_oh_characters\",\n",
697
+ " \"characters.genderswap\": \"genderswap_characters\",\n",
698
+ " \"characters.official_mascots\": \"official_mascots\",\n",
699
+ " \"characters.real_life_racehorses\": \"real_life_racehorses\",\n",
700
+ "\n",
701
+ "\n",
702
+ " # metatags\n",
703
+ " \"metatags\": \"metatags\",\n",
704
+ " \"drawing_software\": \"drawing_software\",\n",
705
+ " \n",
706
+ "}\n"
707
+ ]
708
+ },
709
+ {
710
+ "cell_type": "code",
711
+ "execution_count": 10,
712
+ "id": "c4759995-396a-4ec1-9f35-e75b3875e13c",
713
+ "metadata": {
714
+ "execution": {
715
+ "iopub.execute_input": "2025-03-17T12:48:44.138013Z",
716
+ "iopub.status.busy": "2025-03-17T12:48:44.137852Z",
717
+ "iopub.status.idle": "2025-03-17T12:48:44.491853Z",
718
+ "shell.execute_reply": "2025-03-17T12:48:44.491271Z",
719
+ "shell.execute_reply.started": "2025-03-17T12:48:44.137999Z"
720
+ }
721
+ },
722
+ "outputs": [],
723
+ "source": [
724
+ "list_based_categories = [\n",
725
+ " # Image composition style\n",
726
+ " \"style_parodies\",\n",
727
+ " # Objects\n",
728
+ " \"computer\",\n",
729
+ " \"airplanes\",\n",
730
+ " \"armor\",\n",
731
+ " \"ground_vehicles\",\n",
732
+ " \"helicopters\",\n",
733
+ " \"pokemon_objects\",\n",
734
+ " \"ships\",\n",
735
+ " \"weapons\",\n",
736
+ " \"playing_card_faces\",\n",
737
+ " \n",
738
+ " # Creatures\n",
739
+ " \"animals\",\n",
740
+ " \"birds\",\n",
741
+ " \"cats\",\n",
742
+ " \"dogs\",\n",
743
+ " \"legendary_creatures\",\n",
744
+ "\n",
745
+ " # Plants\n",
746
+ " \"plant\",\n",
747
+ " \"tree\",\n",
748
+ " \"flowers\",\n",
749
+ "\n",
750
+ " # Games\n",
751
+ " \"game_activities\",\n",
752
+ " \"board_games\",\n",
753
+ " \"sports\",\n",
754
+ " \"video_game\",\n",
755
+ " \"fighting_games\",\n",
756
+ " \"platform_games\",\n",
757
+ " #\"role-playing_games\",\n",
758
+ " \"shooter_games\",\n",
759
+ " \"visual_novel_games\",\n",
760
+ "\n",
761
+ " # Real World\n",
762
+ " \"companies_and_brand_names\",\n",
763
+ " \"holidays_and_celebrations\",\n",
764
+ " \"jobs\",\n",
765
+ " \"locations\",\n",
766
+ " \"people\",\n",
767
+ " \"real_world_locations\",\n",
768
+ "\n",
769
+ " # More Categories\n",
770
+ " \"dances\",\n",
771
+ " \"family_relationships\",\n",
772
+ " \"food_tags\",\n",
773
+ " \"fire\",\n",
774
+ " \"groups\",\n",
775
+ " \"phrases\",\n",
776
+ " \"scan\",\n",
777
+ " \"subjective\",\n",
778
+ " \"technology\",\n",
779
+ " \"verbs_and_gerunds\",\n",
780
+ " \"water\",\n",
781
+ " \"airplanes\",\n",
782
+ "\n",
783
+ " # Artists\n",
784
+ " \"named_drawfags\",\n",
785
+ " \"pixiv_projects\",\n",
786
+ "\n",
787
+ " # Characters\n",
788
+ " \"ace_attorney_characters\",\n",
789
+ " \"arknights_characters\",\n",
790
+ " \"atelier_characters\",\n",
791
+ " \"azur_lane_characters\",\n",
792
+ " \"bleach_characters\",\n",
793
+ " \"bokujou_monogatari_characters\",\n",
794
+ " \"brave_girl_ravens_characters\",\n",
795
+ " \"cardcaptor_sakura_characters\",\n",
796
+ " \"danganronpa_characters\",\n",
797
+ " \"digimon\",\n",
798
+ " \"digimon_characters\",\n",
799
+ " \"dragon_ball_characters\",\n",
800
+ " \"dragon_quest_characters\",\n",
801
+ " \"fate_series_characters\",\n",
802
+ " \"final_fantasy_characters\",\n",
803
+ " \"fire_emblem_characters\",\n",
804
+ " \"flower_knight_girl_characters\",\n",
805
+ " \"gensou_suikoden_characters\",\n",
806
+ " \"girls_frontline_characters\",\n",
807
+ " \"girls_und_panzer_characters\",\n",
808
+ " \"gundam_mechas\",\n",
809
+ " \"hunter_x_hunter_characters\",\n",
810
+ " \"jojo_no_kimyou_na_bouken_characters\",\n",
811
+ " \"kamen_rider_characters\",\n",
812
+ " \"kantai_collection_characters\",\n",
813
+ " \"kingdom_hearts_characters\",\n",
814
+ " \"mahou_sensei_negima_characters\",\n",
815
+ " \"meitantei_conan_characters\",\n",
816
+ " \"minecraft_characters\",\n",
817
+ " \"naruto_characters\",\n",
818
+ " \"nippon_ichi_characters\",\n",
819
+ " \"one_piece_characters\",\n",
820
+ " \"oshiro_project_characters\",\n",
821
+ " \"pokemon_characters\",\n",
822
+ " \"elite_four_members\",\n",
823
+ " \"gym_leaders\",\n",
824
+ " \"families_of_pokemon_main_characters\",\n",
825
+ " \"pokemon_ranger_characters\",\n",
826
+ " \"pokemon_trainer_classes\",\n",
827
+ " \"pokemon\",\n",
828
+ " \"pretty_cure_characters\",\n",
829
+ " \"ragnarok_online_characters\",\n",
830
+ " \"rosenkreuzstilette_characters\",\n",
831
+ " \"sailor_moon_characters\",\n",
832
+ " \"street_fighter_characters\",\n",
833
+ " \"super_smash_bros_characters\",\n",
834
+ " \"world_witches_series_characters\",\n",
835
+ " \"tales_of_characters\",\n",
836
+ " \"toaru_majutsu_no_index_characters\",\n",
837
+ " \"touhou_characters\",\n",
838
+ " \"touken_ranbu_characters\",\n",
839
+ " \"ultra_series_characters\",\n",
840
+ " \"umamusume_characters\",\n",
841
+ " \"vocaloid_characters\",\n",
842
+ " \"yu_gi_oh_characters\",\n",
843
+ " \"genderswap_characters\",\n",
844
+ " \"official_mascots\",\n",
845
+ " \"real_life_racehorses\",\n",
846
+ "\n",
847
+ "\n",
848
+ " \n",
849
+ " # Other Lists\n",
850
+ " \"disambiguation_pages\",\n",
851
+ " \"magazine_publications\",\n",
852
+ " \"special_moves\",\n",
853
+ " \"uniforms\",\n",
854
+ " \"pokemon_media\",\n",
855
+ " \"tagged_songs\",\n",
856
+ " \"vocaloid_derivatives\",\n",
857
+ " \"vocaloid_songs\",\n",
858
+ " \"vocal_synthesizers\",\n",
859
+ " \"vocal_synth_derivatives\",\n",
860
+ " \"vocal_synth_songs\",\n",
861
+ " \"deemo_songs\",\n",
862
+ " #\"role_playing_games\",\n",
863
+ "\n",
864
+ " # Metatags\n",
865
+ " #\"metatags\",\n",
866
+ " #\"drawing_software\",\n",
867
+ "\n",
868
+ " # Pool Groups & Meta-Wikis\n",
869
+ " \"meta_wikis\"\n",
870
+ "]\n"
871
+ ]
872
+ },
873
+ {
874
+ "cell_type": "code",
875
+ "execution_count": 11,
876
+ "id": "8f8b6b29-e7bd-4814-8705-2b7a68f2d660",
877
+ "metadata": {
878
+ "execution": {
879
+ "iopub.execute_input": "2025-03-17T12:48:52.253272Z",
880
+ "iopub.status.busy": "2025-03-17T12:48:52.252077Z",
881
+ "iopub.status.idle": "2025-03-17T12:48:52.256578Z",
882
+ "shell.execute_reply": "2025-03-17T12:48:52.256016Z",
883
+ "shell.execute_reply.started": "2025-03-17T12:48:52.253248Z"
884
+ }
885
+ },
886
+ "outputs": [],
887
+ "source": [
888
+ "special_wiki_pages = [\n",
889
+ " \"plant\", \"tree\", \"computer\", \"on_object\", \"injury\", \"swimsuit\", \"on\" , \"mask\" # Add more here if needed\n",
890
+ "]"
891
+ ]
892
+ },
893
+ {
894
+ "cell_type": "code",
895
+ "execution_count": 12,
896
+ "id": "79234bc5-b287-4306-85cd-a27ddba769ea",
897
+ "metadata": {
898
+ "execution": {
899
+ "iopub.execute_input": "2025-03-17T12:48:55.161786Z",
900
+ "iopub.status.busy": "2025-03-17T12:48:55.160800Z",
901
+ "iopub.status.idle": "2025-03-17T12:48:55.165137Z",
902
+ "shell.execute_reply": "2025-03-17T12:48:55.164568Z",
903
+ "shell.execute_reply.started": "2025-03-17T12:48:55.161760Z"
904
+ }
905
+ },
906
+ "outputs": [],
907
+ "source": [
908
+ "import base64\n",
909
+ "HEADERS = {\n",
910
+ " \"Authorization\": f\"Basic {base64.b64encode(f'{username}:{api_key}'.encode()).decode()}\"\n",
911
+ "}"
912
+ ]
913
+ },
914
+ {
915
+ "cell_type": "code",
916
+ "execution_count": 13,
917
+ "id": "64d47678-78a4-4d55-a3a0-1c1b251b03b6",
918
+ "metadata": {
919
+ "execution": {
920
+ "iopub.execute_input": "2025-03-17T12:40:12.358876Z",
921
+ "iopub.status.busy": "2025-03-17T12:40:12.358573Z",
922
+ "iopub.status.idle": "2025-03-17T12:40:43.547891Z",
923
+ "shell.execute_reply": "2025-03-17T12:40:43.547451Z",
924
+ "shell.execute_reply.started": "2025-03-17T12:40:12.358851Z"
925
+ }
926
+ },
927
+ "outputs": [
928
+ {
929
+ "name": "stdout",
930
+ "output_type": "stream",
931
+ "text": [
932
+ "🚀 Starting Danbooru Tag Hierarchy API Fetcher...\n",
933
+ "❌ No data found for attire_and_body_accessories in any format\n",
934
+ "✅ Added 91 tags directly under attire_and_body_accessories.attire.dress\n",
935
+ "✅ Added 100 tags directly under attire_and_body_accessories.attire.handwear\n",
936
+ "✅ Added 251 tags directly under attire_and_body_accessories.attire.headwear\n",
937
+ "✅ Added 69 tags directly under attire_and_body_accessories.attire.legwear\n",
938
+ "✅ Added 64 tags directly under attire_and_body_accessories.attire.mask\n",
939
+ "✅ Added 264 tags directly under attire_and_body_accessories.attire.neck_and_neckwear\n",
940
+ "✅ Added 54 tags directly under attire_and_body_accessories.attire.sexual_attire.bra\n",
941
+ "✅ Added 83 tags directly under attire_and_body_accessories.attire.sexual_attire.panties\n",
942
+ "✅ Added 87 tags directly under attire_and_body_accessories.attire.sleeves\n",
943
+ "✅ Added 83 tags directly under attire_and_body_accessories.attire.swimsuit\n",
944
+ "✅ Added 4 tags directly under attire_and_body_accessories.embellishment\n",
945
+ "✅ Added 94 tags directly under attire_and_body_accessories.eyewear\n",
946
+ "✅ Added 62 tags directly under attire_and_body_accessories.fashion_style\n",
947
+ "✅ Added 189 tags directly under attire_and_body_accessories.nudity\n",
948
+ "❌ No data found for body in any format\n",
949
+ "✅ Added 40 tags directly under body.body_parts.ass\n",
950
+ "✅ Added 213 tags directly under body.body_parts.breasts_tags\n",
951
+ "✅ Added 54 tags directly under body.body_parts.ears_tags\n",
952
+ "✅ Added 200 tags directly under body.body_parts.face_tags.eyes_tags\n",
953
+ "✅ Added 31 tags directly under body.body_parts.hair.hair_color\n",
954
+ "✅ Added 157 tags directly under body.body_parts.hair.hair_styles\n",
955
+ "✅ Added 117 tags directly under body.body_parts.hands.gestures\n",
956
+ "✅ Added 264 tags directly under body.body_parts.neck_and_neckwear\n",
957
+ "✅ Added 74 tags directly under body.body_parts.penis\n",
958
+ "✅ Added 230 tags directly under body.body_parts.posture\n",
959
+ "✅ Added 50 tags directly under body.body_parts.pussy\n",
960
+ "✅ Added 64 tags directly under body.body_parts.shoulders\n",
961
+ "✅ Added 26 tags directly under body.body_parts.skin_color\n",
962
+ "✅ Added 79 tags directly under body.body_parts.tail\n",
963
+ "✅ Added 90 tags directly under body.body_parts.wings\n",
964
+ "✅ Added 55 tags directly under body.injury\n",
965
+ "❌ No data found for characters in any format\n",
966
+ "✅ Added 300 tags directly under characters.ace_attorney\n",
967
+ "✅ Added 579 tags directly under characters.arknights\n",
968
+ "✅ Added 296 tags directly under characters.atelier\n",
969
+ "✅ Added 706 tags directly under characters.azur_lane\n",
970
+ "✅ Added 223 tags directly under characters.bleach\n",
971
+ "✅ Added 98 tags directly under characters.bokujou_monogatari\n",
972
+ "✅ Added 46 tags directly under characters.brave_girl_ravens\n",
973
+ "✅ Added 144 tags directly under characters.cardcaptor_sakura\n",
974
+ "✅ Added 128 tags directly under characters.danganronpa\n",
975
+ "✅ Added 240 tags directly under characters.digimon.digimon_characters\n",
976
+ "✅ Added 613 tags directly under characters.dragon_ball\n",
977
+ "✅ Added 331 tags directly under characters.dragon_quest\n",
978
+ "✅ Added 814 tags directly under characters.fate_series\n",
979
+ "✅ Added 566 tags directly under characters.final_fantasy\n",
980
+ "✅ Added 727 tags directly under characters.fire_emblem\n",
981
+ "✅ Added 441 tags directly under characters.flower_knight_girl\n",
982
+ "✅ Added 221 tags directly under characters.genderswap\n",
983
+ "✅ Added 431 tags directly under characters.gensou_suikoden\n",
984
+ "❌ No data found for girls_frontline_characters in any format\n",
985
+ "✅ Added 0 tags directly under characters.girls_frontline\n",
986
+ "✅ Added 305 tags directly under characters.girls_und_panzer\n",
987
+ "✅ Added 267 tags directly under characters.gundam_mechas\n",
988
+ "✅ Added 313 tags directly under characters.hunter_x_hunter\n",
989
+ "✅ Added 348 tags directly under characters.jojo_no_kimyou_na_bouken\n",
990
+ "✅ Added 308 tags directly under characters.kamen_rider\n",
991
+ "✅ Added 465 tags directly under characters.kantai_collection\n",
992
+ "✅ Added 64 tags directly under characters.kingdom_hearts\n",
993
+ "✅ Added 72 tags directly under characters.mahou_sensei_negima\n",
994
+ "✅ Added 158 tags directly under characters.meitantei_conan\n",
995
+ "✅ Added 153 tags directly under characters.minecraft\n",
996
+ "✅ Added 198 tags directly under characters.naruto\n",
997
+ "✅ Added 336 tags directly under characters.nippon_ichi\n",
998
+ "✅ Added 375 tags directly under characters.official_mascots\n",
999
+ "✅ Added 464 tags directly under characters.one_piece\n",
1000
+ "✅ Added 305 tags directly under characters.oshiro_project\n",
1001
+ "✅ Added 72 tags directly under characters.pokemon.elite_four_members\n",
1002
+ "✅ Added 165 tags directly under characters.pokemon.families_of_pokemon_main_characters\n",
1003
+ "✅ Added 100 tags directly under characters.pokemon.gym_leaders\n",
1004
+ "✅ Added 37 tags directly under characters.pokemon.pokemon_ranger_characters\n",
1005
+ "✅ Added 149 tags directly under characters.pokemon.pokemon_trainer_classes\n",
1006
+ "✅ Added 925 tags directly under characters.pretty_cure\n",
1007
+ "✅ Added 1214 tags directly under characters.ragnarok_online\n",
1008
+ "✅ Added 434 tags directly under characters.real_life_racehorses\n",
1009
+ "✅ Added 23 tags directly under characters.rosenkreuzstilette\n",
1010
+ "✅ Added 320 tags directly under characters.sailor_moon\n",
1011
+ "✅ Added 170 tags directly under characters.street_fighter\n",
1012
+ "❌ No data found for super_smash_bros_characters in any format\n",
1013
+ "✅ Added 0 tags directly under characters.super_smash_bros\n",
1014
+ "❌ No data found for tales_of_characters in any format\n",
1015
+ "✅ Added 0 tags directly under characters.tales_of\n",
1016
+ "✅ Added 189 tags directly under characters.toaru_majutsu_no_index\n",
1017
+ "✅ Added 381 tags directly under characters.touhou\n",
1018
+ "✅ Added 128 tags directly under characters.touken_ranbu\n",
1019
+ "✅ Added 540 tags directly under characters.ultra_series\n",
1020
+ "✅ Added 250 tags directly under characters.umamusume\n",
1021
+ "✅ Added 140 tags directly under characters.vocaloid\n",
1022
+ "✅ Added 184 tags directly under characters.world_witches_series\n",
1023
+ "❌ No data found for yu_gi_oh_characters in any format\n",
1024
+ "✅ Added 0 tags directly under characters.yu_gi_oh\n",
1025
+ "❌ No data found for copyrights_artists_projects_and_media in any format\n",
1026
+ "❌ No data found for genres_of_video_games in any format\n",
1027
+ "✅ Added 175 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.fighting_games\n",
1028
+ "✅ Added 42 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.platform_games\n",
1029
+ "✅ Added 515 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.role-playing_games\n",
1030
+ "✅ Added 114 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.shooter_games\n",
1031
+ "✅ Added 185 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.visual_novel_games\n",
1032
+ "❌ No data found for creatures in any format\n",
1033
+ "✅ Added 232 tags directly under creatures.animals.birds\n",
1034
+ "✅ Added 224 tags directly under creatures.animals.cats\n",
1035
+ "✅ Added 237 tags directly under creatures.animals.dogs\n",
1036
+ "✅ Added 285 tags directly under creatures.legendary_creatures\n",
1037
+ "✅ Added 55 tags directly under drawing_software\n",
1038
+ "❌ No data found for games in any format\n",
1039
+ "✅ Added 21 tags directly under games.board_games\n",
1040
+ "✅ Added 175 tags directly under games.fighting_games\n",
1041
+ "✅ Added 61 tags directly under games.game_activities\n",
1042
+ "✅ Added 420 tags directly under games.sports\n",
1043
+ "✅ Added 144 tags directly under games.video_game\n",
1044
+ "✅ Added 213 tags directly under metatags\n",
1045
+ "❌ No data found for more in any format\n",
1046
+ "✅ Added 110 tags directly under more.dances\n",
1047
+ "✅ Added 29 tags directly under more.family_relationships\n",
1048
+ "✅ Added 68 tags directly under more.fire\n",
1049
+ "✅ Added 1051 tags directly under more.food_tags\n",
1050
+ "✅ Added 31 tags directly under more.groups\n",
1051
+ "✅ Added 76 tags directly under more.phrases\n",
1052
+ "✅ Added 35 tags directly under more.scan\n",
1053
+ "✅ Added 45 tags directly under more.subjective\n",
1054
+ "✅ Added 237 tags directly under more.technology\n",
1055
+ "✅ Added 446 tags directly under more.verbs_and_gerunds\n",
1056
+ "✅ Added 54 tags directly under more.water\n",
1057
+ "❌ No data found for objects in any format\n",
1058
+ "✅ Added 410 tags directly under objects.airplanes\n",
1059
+ "✅ Added 145 tags directly under objects.armor\n",
1060
+ "✅ Added 377 tags directly under objects.audio_tags\n",
1061
+ "✅ Added 66 tags directly under objects.cards.playing_card_faces\n",
1062
+ "✅ Added 80 tags directly under objects.computer\n",
1063
+ "✅ Added 94 tags directly under objects.eyewear\n",
1064
+ "✅ Added 407 tags directly under objects.ground_vehicles\n",
1065
+ "✅ Added 25 tags directly under objects.helicopters\n",
1066
+ "✅ Added 48 tags directly under objects.piercings\n",
1067
+ "✅ Added 94 tags directly under objects.pokemon_objects\n",
1068
+ "✅ Added 105 tags directly under objects.sex_objects\n",
1069
+ "✅ Added 256 tags directly under objects.ships\n",
1070
+ "✅ Added 917 tags directly under objects.weapons\n",
1071
+ "✅ Added 21 tags directly under plant.flowers\n",
1072
+ "✅ Added 59 tags directly under plant.plant\n",
1073
+ "✅ Added 44 tags directly under plant.tree\n",
1074
+ "✅ Added 59 tags directly under plants\n",
1075
+ "❌ No data found for real_world in any format\n",
1076
+ "✅ Added 363 tags directly under real_world.companies_and_brand_names\n",
1077
+ "✅ Added 138 tags directly under real_world.holidays_and_celebrations\n",
1078
+ "✅ Added 75 tags directly under real_world.jobs\n",
1079
+ "✅ Added 263 tags directly under real_world.locations\n",
1080
+ "✅ Added 1526 tags directly under real_world.people\n",
1081
+ "✅ Added 463 tags directly under real_world.real_world_locations\n",
1082
+ "❌ No data found for sex in any format\n",
1083
+ "✅ Added 16 tags directly under sex.sex_acts.simulated_sex_acts\n",
1084
+ "✅ Added 56 tags directly under sex.sexual_positions\n",
1085
+ "❌ No data found for subject in any format\n",
1086
+ "❌ No data found for anthro in any format\n",
1087
+ "❌ No data found for cat_boy in any format\n",
1088
+ "✅ Added 0 tags directly under subject.anthro.cat_boy\n",
1089
+ "❌ No data found for cat_girl in any format\n",
1090
+ "✅ Added 0 tags directly under subject.anthro.cat_girl\n",
1091
+ "❌ No data found for dog_girl in any format\n",
1092
+ "✅ Added 0 tags directly under subject.anthro.dog_girl\n",
1093
+ "❌ No data found for fox_girl in any format\n",
1094
+ "✅ Added 0 tags directly under subject.anthro.fox_girl\n",
1095
+ "❌ No data found for furry in any format\n",
1096
+ "✅ Added 0 tags directly under subject.anthro.furry\n",
1097
+ "❌ No data found for female in any format\n",
1098
+ "❌ No data found for 1girl in any format\n",
1099
+ "✅ Added 0 tags directly under subject.female.1girl\n",
1100
+ "❌ No data found for 2girls in any format\n",
1101
+ "✅ Added 0 tags directly under subject.female.2girls\n",
1102
+ "❌ No data found for female_general in any format\n",
1103
+ "✅ Added 0 tags directly under subject.female_general\n",
1104
+ "❌ No data found for koma in any format\n",
1105
+ "❌ No data found for 1koma in any format\n",
1106
+ "✅ Added 0 tags directly under subject.koma.1koma\n",
1107
+ "❌ No data found for 2koma in any format\n",
1108
+ "✅ Added 0 tags directly under subject.koma.2koma\n",
1109
+ "❌ No data found for male in any format\n",
1110
+ "❌ No data found for 1boy in any format\n",
1111
+ "✅ Added 0 tags directly under subject.male.1boy\n",
1112
+ "❌ No data found for 2boys in any format\n",
1113
+ "✅ Added 0 tags directly under subject.male.2boys\n",
1114
+ "❌ No data found for 3boys in any format\n",
1115
+ "✅ Added 0 tags directly under subject.male.3boys\n",
1116
+ "❌ No data found for 4boys in any format\n",
1117
+ "✅ Added 0 tags directly under subject.male.4boys\n",
1118
+ "❌ No data found for 5boys in any format\n",
1119
+ "✅ Added 0 tags directly under subject.male.5boys\n",
1120
+ "❌ No data found for 6+boys in any format\n",
1121
+ "✅ Added 0 tags directly under subject.male.6+boys\n",
1122
+ "❌ No data found for boy in any format\n",
1123
+ "✅ Added 0 tags directly under subject.male.boy\n",
1124
+ "❌ No data found for man in any format\n",
1125
+ "✅ Added 0 tags directly under subject.male.man\n",
1126
+ "❌ No data found for multiple_boys in any format\n",
1127
+ "✅ Added 0 tags directly under subject.male.multiple_boys\n",
1128
+ "❌ No data found for visual_characteristics in any format\n",
1129
+ "❌ No data found for image_composition_and_style in any format\n",
1130
+ "✅ Added 73 tags directly under visual_characteristics.image_composition_and_style.artistic_license\n",
1131
+ "✅ Added 115 tags directly under visual_characteristics.image_composition_and_style.image_composition.backgrounds\n",
1132
+ "✅ Added 90 tags directly under visual_characteristics.image_composition_and_style.image_composition.censorship\n",
1133
+ "✅ Added 54 tags directly under visual_characteristics.image_composition_and_style.image_composition.colors\n",
1134
+ "✅ Added 28 tags directly under visual_characteristics.image_composition_and_style.image_composition.focus_tags\n",
1135
+ "✅ Added 55 tags directly under visual_characteristics.image_composition_and_style.image_composition.lighting\n",
1136
+ "✅ Added 76 tags directly under visual_characteristics.image_composition_and_style.image_composition.prints\n",
1137
+ "✅ Added 495 tags directly under visual_characteristics.image_composition_and_style.image_composition.style_parodies\n",
1138
+ "✅ Added 41 tags directly under visual_characteristics.image_composition_and_style.patterns\n",
1139
+ "✅ Added 310 tags directly under visual_characteristics.image_composition_and_style.symbols\n",
1140
+ "✅ Added 242 tags directly under visual_characteristics.image_composition_and_style.text\n",
1141
+ "✅ Added 62 tags directly under visual_characteristics.image_composition_and_style.year_tags\n",
1142
+ "✅ Finished building hierarchy.\n"
1143
+ ]
1144
+ },
1145
+ {
1146
+ "ename": "FileNotFoundError",
1147
+ "evalue": "[Errno 2] No such file or directory: '/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru_donmai/danbooru_tags_step_01.json'",
1148
+ "output_type": "error",
1149
+ "traceback": [
1150
+ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
1151
+ "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
1152
+ "Cell \u001b[0;32mIn[13], line 198\u001b[0m\n\u001b[1;32m 196\u001b[0m \u001b[38;5;66;03m# Save cleaned JSON\u001b[39;00m\n\u001b[1;32m 197\u001b[0m output_file \u001b[38;5;241m=\u001b[39m current_dir\u001b[38;5;241m.\u001b[39mparent \u001b[38;5;241m/\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisc/danbooru_donmai/danbooru_tags_step_01.json\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m--> 198\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43moutput_file\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mw\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mutf-8\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m f:\n\u001b[1;32m 199\u001b[0m json\u001b[38;5;241m.\u001b[39mdump(manual_hierarchy, f, indent\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m4\u001b[39m, ensure_ascii\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[1;32m 201\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m✅ Hierarchy saved to `\u001b[39m\u001b[38;5;132;01m{\u001b[39;00moutput_file\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m`\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
1153
+ "File \u001b[0;32m~/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:324\u001b[0m, in \u001b[0;36m_modified_open\u001b[0;34m(file, *args, **kwargs)\u001b[0m\n\u001b[1;32m 317\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m file \u001b[38;5;129;01min\u001b[39;00m {\u001b[38;5;241m0\u001b[39m, \u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m}:\n\u001b[1;32m 318\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 319\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIPython won\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt let you open fd=\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfile\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m by default \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 320\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mas it is likely to crash IPython. If you know what you are doing, \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 321\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124myou can use builtins\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m open.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 322\u001b[0m )\n\u001b[0;32m--> 324\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mio_open\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfile\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
1154
+ "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru_donmai/danbooru_tags_step_01.json'"
1155
+ ]
1156
+ }
1157
+ ],
1158
+ "source": [
1159
+ "import requests\n",
1160
+ "import json\n",
1161
+ "import time\n",
1162
+ "import re\n",
1163
+ "from urllib.parse import quote\n",
1164
+ "\n",
1165
+ "# Base API URL\n",
1166
+ "BASE_URL = \"https://danbooru.donmai.us\"\n",
1167
+ "\n",
1168
+ "# Authentication (Replace with your credentials)\n",
1169
+ "AUTH = (username, api_key)\n",
1170
+ "\n",
1171
+ "# Storage for missing tags\n",
1172
+ "missing_tags = []\n",
1173
+ "\n",
1174
+ "# ✅ Initialize manual_hierarchy\n",
1175
+ "manual_hierarchy = {}\n",
1176
+ "\n",
1177
+ "def clean_hierarchy_and_count_tags(hierarchy):\n",
1178
+ " \"\"\"\n",
1179
+ " Recursively removes empty categories from the JSON hierarchy and counts total tags.\n",
1180
+ " \"\"\"\n",
1181
+ " total_tag_count = 0\n",
1182
+ "\n",
1183
+ " def clean_dict(d):\n",
1184
+ " nonlocal total_tag_count\n",
1185
+ " keys_to_delete = []\n",
1186
+ "\n",
1187
+ " for key, value in d.items():\n",
1188
+ " if isinstance(value, dict):\n",
1189
+ " clean_dict(value) # Recursively clean subcategories\n",
1190
+ " \n",
1191
+ " # ✅ If the subcategory is empty after cleaning, mark it for removal\n",
1192
+ " if not value:\n",
1193
+ " keys_to_delete.append(key)\n",
1194
+ " elif isinstance(value, list):\n",
1195
+ " # ✅ Count the total number of tags\n",
1196
+ " total_tag_count += len(value)\n",
1197
+ "\n",
1198
+ " # ✅ If the list is empty, mark it for deletion\n",
1199
+ " if not value:\n",
1200
+ " keys_to_delete.append(key)\n",
1201
+ "\n",
1202
+ " # ✅ Remove empty keys\n",
1203
+ " for key in keys_to_delete:\n",
1204
+ " del d[key]\n",
1205
+ "\n",
1206
+ " clean_dict(hierarchy)\n",
1207
+ "\n",
1208
+ " return hierarchy, total_tag_count\n",
1209
+ "\n",
1210
+ "\n",
1211
+ "def fetch_wiki_page(tag, is_list=False):\n",
1212
+ " \"\"\"\n",
1213
+ " Fetches the wiki page of a tag group, list category, or special wiki page using the API.\n",
1214
+ " Uses list_of_ for lists, tag_group: for regular groups, \n",
1215
+ " and direct queries for special cases.\n",
1216
+ " \"\"\"\n",
1217
+ " if tag in special_wiki_pages:\n",
1218
+ " prefixes = [\"\"] # Query directly with no prefix\n",
1219
+ " #print(f\"🔍 {tag} is a special wiki page, querying directly...\")\n",
1220
+ " elif is_list:\n",
1221
+ " prefixes = [\"list_of_\", \"tag_group:\"]\n",
1222
+ " else:\n",
1223
+ " prefixes = [\"tag_group:\"]\n",
1224
+ "\n",
1225
+ " # print(f\"🚀 Fetching {prefixes} {tag}\") # Debugging print\n",
1226
+ "\n",
1227
+ " for prefix in prefixes:\n",
1228
+ " query_tag = f\"{prefix}{tag}\".strip() # Avoid unnecessary :\n",
1229
+ " encoded_query = quote(query_tag, safe=\"\") # Proper URL encoding\n",
1230
+ " url = f\"{BASE_URL}/wiki_pages.json?search[title]={encoded_query}\"\n",
1231
+ "\n",
1232
+ " # print(f\"🚀 Fetching: {query_tag}\")\n",
1233
+ "\n",
1234
+ " try:\n",
1235
+ " response = requests.get(url, auth=AUTH) # Use authentication\n",
1236
+ " response.raise_for_status() # Raise error for bad responses (4xx, 5xx)\n",
1237
+ " wiki_data = response.json()\n",
1238
+ "\n",
1239
+ " if wiki_data and isinstance(wiki_data, list) and len(wiki_data) > 0:\n",
1240
+ " #print(f\"✅ Data found using {query_tag}\")\n",
1241
+ " return wiki_data[0].get(\"body\", \"\") # Extract the \"body\" text\n",
1242
+ "\n",
1243
+ " except requests.exceptions.HTTPError as e:\n",
1244
+ " if response.status_code == 401:\n",
1245
+ " print(f\"❌ Authentication Error (401 Unauthorized) for {query_tag}. Check your API key!\")\n",
1246
+ " exit() # Stop execution if authentication fails\n",
1247
+ " else:\n",
1248
+ " print(f\"❌ Error fetching {query_tag}: {e}\")\n",
1249
+ "\n",
1250
+ " # If all attempts fail\n",
1251
+ " print(f\"❌ No data found for {tag} in any format\")\n",
1252
+ " missing_tags.append(tag)\n",
1253
+ " return None\n",
1254
+ "\n",
1255
+ "\n",
1256
+ "\n",
1257
+ "\n",
1258
+ "def build_tag_hierarchy(tag_groups):\n",
1259
+ " \"\"\"\n",
1260
+ " Builds and structures the hierarchy properly, ensuring:\n",
1261
+ " - Categories with subcategories store their direct tags in `_general`\n",
1262
+ " - Parent categories exist before adding children\n",
1263
+ " - Ensures subject and other key groups are correctly retained\n",
1264
+ " \"\"\"\n",
1265
+ " processed_groups = set()\n",
1266
+ "\n",
1267
+ " for hierarchy_path, tag_group in sorted(tag_groups.items(), key=lambda x: x[0]):\n",
1268
+ " is_list = tag_group in list_based_categories # Check if it's a list-based category\n",
1269
+ "\n",
1270
+ " levels = hierarchy_path.split(\".\")\n",
1271
+ " current_level = manual_hierarchy\n",
1272
+ "\n",
1273
+ " for key in levels[:-1]: # Ensure each parent level exists\n",
1274
+ " if key not in current_level or not isinstance(current_level[key], dict):\n",
1275
+ " current_level[key] = {} # Create dictionary if missing\n",
1276
+ " current_level = current_level[key]\n",
1277
+ "\n",
1278
+ " last_level = levels[-1]\n",
1279
+ " has_subcategories = any(k.startswith(hierarchy_path + \".\") for k in tag_groups.keys())\n",
1280
+ "\n",
1281
+ " # ✅ Ensure the category itself exists\n",
1282
+ " if last_level not in current_level:\n",
1283
+ " current_level[last_level] = {} if has_subcategories else []\n",
1284
+ "\n",
1285
+ " # ✅ Fetch tags from API\n",
1286
+ " wiki_text = fetch_wiki_page(tag_group, is_list)\n",
1287
+ " extracted_tags = extract_tags_from_wiki(wiki_text, is_list)\n",
1288
+ "\n",
1289
+ " # ✅ Store tags under `<category>_general` if there are subcategories\n",
1290
+ " if has_subcategories:\n",
1291
+ " general_key = f\"{last_level}_general\"\n",
1292
+ "\n",
1293
+ " if isinstance(current_level[last_level], list):\n",
1294
+ " # print(f\"⚠️ Warning: {last_level} was a list but has subcategories. Converting to dictionary.\")\n",
1295
+ " current_level[last_level] = {}\n",
1296
+ "\n",
1297
+ " if general_key not in current_level[last_level]:\n",
1298
+ " current_level[last_level][general_key] = [] # Initialize `_general` list\n",
1299
+ "\n",
1300
+ " current_level[last_level][general_key].extend(extracted_tags)\n",
1301
+ " #print(f\"✅ Added {len(extracted_tags)} tags under {hierarchy_path} → {general_key}\")\n",
1302
+ "\n",
1303
+ " else:\n",
1304
+ " # ✅ If no subcategories exist, store tags directly\n",
1305
+ " if isinstance(current_level[last_level], list):\n",
1306
+ " current_level[last_level].extend(extracted_tags)\n",
1307
+ " else:\n",
1308
+ " # print(f\"⚠️ Warning: {last_level} was a dictionary but has no subcategories. Converting to list.\")\n",
1309
+ " current_level[last_level] = extracted_tags\n",
1310
+ "\n",
1311
+ " print(f\"✅ Added {len(extracted_tags)} tags directly under {hierarchy_path}\")\n",
1312
+ "\n",
1313
+ " print(f\"✅ Finished building hierarchy.\")\n",
1314
+ "\n",
1315
+ "\n",
1316
+ "\n",
1317
+ "\n",
1318
+ "\n",
1319
+ "def extract_tags_from_wiki(wiki_text, is_list=False):\n",
1320
+ " \"\"\"\n",
1321
+ " Extracts valid tags from the wiki text.\n",
1322
+ " Uses different extraction logic for tag groups and list-based pages.\n",
1323
+ " \"\"\"\n",
1324
+ " if not wiki_text:\n",
1325
+ " return []\n",
1326
+ "\n",
1327
+ " if is_list:\n",
1328
+ " # Extract `[[tag_name]]` from list pages\n",
1329
+ " tag_pattern = re.compile(r\"\\[\\[(.*?)\\]\\]\")\n",
1330
+ " tags = tag_pattern.findall(wiki_text)\n",
1331
+ " else:\n",
1332
+ " # Extract `[[tag_name]]` from tag groups (skip first item if \"Tag Groups\")\n",
1333
+ " tag_pattern = re.compile(r\"\\[\\[(.*?)\\]\\]\")\n",
1334
+ " tags = tag_pattern.findall(wiki_text)\n",
1335
+ " tags = tags[1:] if tags else [] # Skip first tag\n",
1336
+ "\n",
1337
+ " # Clean tags by removing alternative names `[[Tag|Alternative]]`\n",
1338
+ " cleaned_tags = [tag.split(\"|\")[0].strip() for tag in tags]\n",
1339
+ "\n",
1340
+ " return cleaned_tags\n",
1341
+ "\n",
1342
+ "if __name__ == \"__main__\":\n",
1343
+ " print(\"🚀 Starting Danbooru Tag Hierarchy API Fetcher...\")\n",
1344
+ "\n",
1345
+ " # ✅ Ensure manual_hierarchy exists before cleaning\n",
1346
+ " manual_hierarchy = {}\n",
1347
+ "\n",
1348
+ " # Build hierarchy dynamically\n",
1349
+ " build_tag_hierarchy(tag_groups)\n",
1350
+ "\n",
1351
+ " # ✅ Clean hierarchy and count total tags\n",
1352
+ " manual_hierarchy, total_tags = clean_hierarchy_and_count_tags(manual_hierarchy)\n",
1353
+ "\n",
1354
+ " # Save cleaned JSON\n",
1355
+ " output_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_01.json\"\n",
1356
+ " with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
1357
+ " json.dump(manual_hierarchy, f, indent=4, ensure_ascii=False)\n",
1358
+ "\n",
1359
+ " print(f\"\\n✅ Hierarchy saved to `{output_file}`\")\n",
1360
+ " print(f\"📊 Total tags in the final hierarchy: {total_tags}\")\n",
1361
+ "\n",
1362
+ " if missing_tags:\n",
1363
+ " print(f\"⚠️ The following tag groups were not found: {missing_tags}\")\n"
1364
+ ]
1365
+ },
1366
+ {
1367
+ "cell_type": "markdown",
1368
+ "id": "5bf6fa19-04dd-4799-8220-e37a4434a1f8",
1369
+ "metadata": {},
1370
+ "source": [
1371
+ "### Add subject keys etc"
1372
+ ]
1373
+ },
1374
+ {
1375
+ "cell_type": "code",
1376
+ "execution_count": 15,
1377
+ "id": "9bf7c0ed-0cfd-41c8-89a4-c59f1783c823",
1378
+ "metadata": {
1379
+ "execution": {
1380
+ "iopub.execute_input": "2025-03-17T12:49:06.895515Z",
1381
+ "iopub.status.busy": "2025-03-17T12:49:06.895119Z",
1382
+ "iopub.status.idle": "2025-03-17T12:49:07.135844Z",
1383
+ "shell.execute_reply": "2025-03-17T12:49:07.135358Z",
1384
+ "shell.execute_reply.started": "2025-03-17T12:49:06.895492Z"
1385
+ }
1386
+ },
1387
+ "outputs": [
1388
+ {
1389
+ "name": "stdout",
1390
+ "output_type": "stream",
1391
+ "text": [
1392
+ "✅ Adding woman to subject.female.female_general\n",
1393
+ "✅ Adding girl to subject.female.female_general\n",
1394
+ "✅ Adding 1girl to subject.female.female_general\n",
1395
+ "🔄 Moving 2girls from more.groups to subject.female.female_general\n",
1396
+ "✅ Adding 2girls to subject.female.female_general\n",
1397
+ "🔄 Moving 3girls from more.groups to subject.female.female_general\n",
1398
+ "✅ Adding 3girls to subject.female.female_general\n",
1399
+ "🔄 Moving 4girls from more.groups to subject.female.female_general\n",
1400
+ "✅ Adding 4girls to subject.female.female_general\n",
1401
+ "🔄 Moving 5girls from more.groups to subject.female.female_general\n",
1402
+ "✅ Adding 5girls to subject.female.female_general\n",
1403
+ "🔄 Moving 6+girls from more.groups to subject.female.female_general\n",
1404
+ "✅ Adding 6+girls to subject.female.female_general\n",
1405
+ "🔄 Moving multiple girls from more.groups to subject.female.female_general\n",
1406
+ "✅ Adding multiple girls to subject.female.female_general\n",
1407
+ "🔄 Moving guitar girl from objects.audio_tags to subject.female.female_general\n",
1408
+ "✅ Adding guitar girl to subject.female.female_general\n",
1409
+ "✅ Adding man to subject.male.male_general\n",
1410
+ "✅ Adding boy to subject.male.male_general\n",
1411
+ "✅ Adding 1boy to subject.male.male_general\n",
1412
+ "🔄 Moving 2boys from more.groups to subject.male.male_general\n",
1413
+ "✅ Adding 2boys to subject.male.male_general\n",
1414
+ "🔄 Moving 3boys from more.groups to subject.male.male_general\n",
1415
+ "✅ Adding 3boys to subject.male.male_general\n",
1416
+ "🔄 Moving 4boys from more.groups to subject.male.male_general\n",
1417
+ "✅ Adding 4boys to subject.male.male_general\n",
1418
+ "🔄 Moving 5boys from more.groups to subject.male.male_general\n",
1419
+ "✅ Adding 5boys to subject.male.male_general\n",
1420
+ "🔄 Moving 6+boys from more.groups to subject.male.male_general\n",
1421
+ "✅ Adding 6+boys to subject.male.male_general\n",
1422
+ "🔄 Moving multiple boys from more.groups to subject.male.male_general\n",
1423
+ "✅ Adding multiple boys to subject.male.male_general\n",
1424
+ "✅ Adding guitar boy to subject.male.male_general\n",
1425
+ "✅ Adding 1koma to subject.koma.koma_general\n",
1426
+ "✅ Adding 2koma to subject.koma.koma_general\n",
1427
+ "🔄 Moving cat girl from creatures.animals.cats to subject.anthro.anthro_general\n",
1428
+ "✅ Adding cat girl to subject.anthro.anthro_general\n",
1429
+ "✅ Adding fox girl to subject.anthro.anthro_general\n",
1430
+ "✅ Adding dog girl to subject.anthro.anthro_general\n",
1431
+ "🔄 Moving plant girl from plant.plant to subject.anthro.anthro_general\n",
1432
+ "🔄 Moving plant girl from plants to subject.anthro.anthro_general\n",
1433
+ "✅ Adding plant girl to subject.anthro.anthro_general\n",
1434
+ "🔄 Moving plant boy from plant.plant to subject.anthro.anthro_general\n",
1435
+ "🔄 Moving plant boy from plants to subject.anthro.anthro_general\n",
1436
+ "✅ Adding plant boy to subject.anthro.anthro_general\n",
1437
+ "🔄 Moving cat boy from creatures.animals.cats to subject.anthro.anthro_general\n",
1438
+ "✅ Adding cat boy to subject.anthro.anthro_general\n",
1439
+ "✅ Adding furry to subject.anthro.anthro_general\n",
1440
+ "✅ Adding monster boy to subject.anthro.anthro_general\n",
1441
+ "✅ Adding monster girl to subject.anthro.anthro_general\n",
1442
+ "✅ Adding demon girl to subject.anthro.anthro_general\n",
1443
+ "✅ Adding demon boy to subject.anthro.anthro_general\n",
1444
+ "✅ Adding magical boy to subject.anthro.anthro_general\n",
1445
+ "🔄 Moving magical girl from characters.sailor_moon to subject.anthro.anthro_general\n",
1446
+ "✅ Adding magical girl to subject.anthro.anthro_general\n",
1447
+ "✅ Updated JSON saved as `/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/danbooru_tags_step_02.json`\n"
1448
+ ]
1449
+ }
1450
+ ],
1451
+ "source": [
1452
+ "import json\n",
1453
+ "\n",
1454
+ "# Load the existing JSON file\n",
1455
+ "input_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_01.json\"\n",
1456
+ "output_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_02.json\"\n",
1457
+ "\n",
1458
+ "with open(input_file, \"r\", encoding=\"utf-8\") as f:\n",
1459
+ " manual_hierarchy = json.load(f)\n",
1460
+ "\n",
1461
+ "# ✅ Define the correct subject structure\n",
1462
+ "subject_structure = {\n",
1463
+ " \"subject\": {\n",
1464
+ " \"female\": {\n",
1465
+ " \"female_general\": [\"woman\", \"girl\", \"1girl\", \"2girls\", \"3girls\", \"4girls\", \"5girls\", \"6+girls\", \"multiple girls\", \"guitar girl\" ]\n",
1466
+ " },\n",
1467
+ " \"male\": {\n",
1468
+ " \"male_general\": [\"man\", \"boy\", \"1boy\", \"2boys\", \"3boys\", \"4boys\", \"5boys\", \"6+boys\", \"multiple boys\", \"guitar boy\"]\n",
1469
+ " },\n",
1470
+ " \"koma\": {\n",
1471
+ " \"koma_general\": [\"1koma\", \"2koma\"]\n",
1472
+ " },\n",
1473
+ " \"anthro\": {\n",
1474
+ " \"anthro_general\": [\"cat girl\", \"fox girl\", \"dog girl\", \"plant girl\", \"plant boy\", \"cat boy\", \"furry\", \"monster boy\", \"monster girl\", \"demon girl\" , \"demon boy\", \"magical boy\", \"magical girl\", ]\n",
1475
+ " }\n",
1476
+ " }\n",
1477
+ "}\n",
1478
+ "\n",
1479
+ "# ✅ Ensure \"subject\" exists in the hierarchy\n",
1480
+ "if \"subject\" not in manual_hierarchy:\n",
1481
+ " manual_hierarchy[\"subject\"] = {}\n",
1482
+ "\n",
1483
+ "# ✅ Ensure all subcategories exist\n",
1484
+ "for category, subcategories in subject_structure[\"subject\"].items():\n",
1485
+ " if category not in manual_hierarchy[\"subject\"]:\n",
1486
+ " manual_hierarchy[\"subject\"][category] = {}\n",
1487
+ "\n",
1488
+ " for subcategory, tags in subcategories.items():\n",
1489
+ " if subcategory not in manual_hierarchy[\"subject\"][category]:\n",
1490
+ " manual_hierarchy[\"subject\"][category][subcategory] = []\n",
1491
+ "\n",
1492
+ "# ✅ Move misplaced tags and also ensure missing tags are added\n",
1493
+ "for category, subcategories in subject_structure[\"subject\"].items():\n",
1494
+ " for subcategory, tags in subcategories.items():\n",
1495
+ " target_list = manual_hierarchy[\"subject\"][category][subcategory]\n",
1496
+ "\n",
1497
+ " for tag in tags:\n",
1498
+ " found_in_wrong_place = False # Track if tag was found elsewhere\n",
1499
+ "\n",
1500
+ " # ✅ Search for the tag in other categories and remove if found\n",
1501
+ " for key, value in list(manual_hierarchy.items()): # Use list() to avoid runtime changes\n",
1502
+ " if isinstance(value, list) and tag in value:\n",
1503
+ " print(f\"🔄 Moving {tag} from {key} to subject.{category}.{subcategory}\")\n",
1504
+ " value.remove(tag)\n",
1505
+ " found_in_wrong_place = True\n",
1506
+ "\n",
1507
+ " elif isinstance(value, dict): # Search deeper\n",
1508
+ " for subkey, subvalue in list(value.items()):\n",
1509
+ " if isinstance(subvalue, list) and tag in subvalue:\n",
1510
+ " print(f\"🔄 Moving {tag} from {key}.{subkey} to subject.{category}.{subcategory}\")\n",
1511
+ " subvalue.remove(tag)\n",
1512
+ " found_in_wrong_place = True\n",
1513
+ "\n",
1514
+ " elif isinstance(subvalue, dict):\n",
1515
+ " for deepkey, deepvalue in list(subvalue.items()):\n",
1516
+ " if isinstance(deepvalue, list) and tag in deepvalue:\n",
1517
+ " print(f\"🔄 Moving {tag} from {key}.{subkey}.{deepkey} to subject.{category}.{subcategory}\")\n",
1518
+ " deepvalue.remove(tag)\n",
1519
+ " found_in_wrong_place = True\n",
1520
+ "\n",
1521
+ " # ✅ Add the tag to the correct subject category if missing\n",
1522
+ " if tag not in target_list:\n",
1523
+ " print(f\"✅ Adding {tag} to subject.{category}.{subcategory}\")\n",
1524
+ " target_list.append(tag)\n",
1525
+ "\n",
1526
+ "# ✅ Save the updated JSON\n",
1527
+ "with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
1528
+ " json.dump(manual_hierarchy, f, indent=4, ensure_ascii=False)\n",
1529
+ "\n",
1530
+ "print(f\"✅ Updated JSON saved as `{output_file}`\")\n",
1531
+ "\n"
1532
+ ]
1533
+ },
1534
+ {
1535
+ "cell_type": "markdown",
1536
+ "id": "aff1a30d-17e7-4e55-b881-b5ce622a533a",
1537
+ "metadata": {},
1538
+ "source": [
1539
+ "### Compare with wd-14 vocabulary"
1540
+ ]
1541
+ },
1542
+ {
1543
+ "cell_type": "code",
1544
+ "execution_count": 16,
1545
+ "id": "2a8f986c-0ccd-4235-ac50-be3d6e495f9f",
1546
+ "metadata": {
1547
+ "execution": {
1548
+ "iopub.execute_input": "2025-03-17T12:49:10.749271Z",
1549
+ "iopub.status.busy": "2025-03-17T12:49:10.748481Z",
1550
+ "iopub.status.idle": "2025-03-17T12:49:11.628872Z",
1551
+ "shell.execute_reply": "2025-03-17T12:49:11.628200Z",
1552
+ "shell.execute_reply.started": "2025-03-17T12:49:10.749248Z"
1553
+ }
1554
+ },
1555
+ "outputs": [
1556
+ {
1557
+ "name": "stdout",
1558
+ "output_type": "stream",
1559
+ "text": [
1560
+ "\n",
1561
+ "📌 Tags in CSV but NOT in JSON:\n",
1562
+ "\n",
1563
+ "✅ Missing tags saved to `missing_tags.txt`\n"
1564
+ ]
1565
+ }
1566
+ ],
1567
+ "source": [
1568
+ "import json\n",
1569
+ "import pandas as pd\n",
1570
+ "\n",
1571
+ "# Load CSV file\n",
1572
+ "csv_file = current_dir.parent / \"misc/autotagging-vocabularies/danbooru.csv\" # Update with the actual CSV file name\n",
1573
+ "df = pd.read_csv(csv_file)\n",
1574
+ "\n",
1575
+ "# Load JSON file\n",
1576
+ "json_file = input_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_03.json\"\n",
1577
+ "with open(json_file, \"r\", encoding=\"utf-8\") as f:\n",
1578
+ " manual_hierarchy = json.load(f)\n",
1579
+ "\n",
1580
+ "# ✅ Extract all tags from the CSV\n",
1581
+ "csv_tags = set(df[\"name\"].astype(str).str.replace(\"_\", \" \").str.lower()) # Normalize tags\n",
1582
+ "\n",
1583
+ "# ✅ Extract all tags from the JSON recursively\n",
1584
+ "def extract_tags_from_json(data):\n",
1585
+ " tags = set()\n",
1586
+ " if isinstance(data, dict):\n",
1587
+ " for value in data.values():\n",
1588
+ " tags.update(extract_tags_from_json(value))\n",
1589
+ " elif isinstance(data, list):\n",
1590
+ " tags.update(str(tag).replace(\"_\", \" \").lower() for tag in data)\n",
1591
+ " return tags\n",
1592
+ "\n",
1593
+ "json_tags = extract_tags_from_json(manual_hierarchy)\n",
1594
+ "\n",
1595
+ "# ✅ Find tags in CSV but NOT in JSON\n",
1596
+ "missing_tags = csv_tags - json_tags\n",
1597
+ "\n",
1598
+ "# ✅ Print the missing tags\n",
1599
+ "print(\"\\n📌 Tags in CSV but NOT in JSON:\")\n",
1600
+ "for tag in sorted(missing_tags):\n",
1601
+ " print(tag)\n",
1602
+ "\n",
1603
+ "# ✅ Save missing tags to a file for review (optional)\n",
1604
+ "missing_tags_file = \"missing_tags.txt\"\n",
1605
+ "with open(missing_tags_file, \"w\", encoding=\"utf-8\") as f:\n",
1606
+ " f.write(\"\\n\".join(sorted(missing_tags)))\n",
1607
+ "\n",
1608
+ "print(f\"\\n✅ Missing tags saved to `{missing_tags_file}`\")\n"
1609
+ ]
1610
+ },
1611
+ {
1612
+ "cell_type": "markdown",
1613
+ "id": "706317c1-498d-40ab-a29a-ee658e1735e2",
1614
+ "metadata": {
1615
+ "execution": {
1616
+ "iopub.execute_input": "2025-03-17T10:53:52.106543Z",
1617
+ "iopub.status.busy": "2025-03-17T10:53:52.105944Z",
1618
+ "iopub.status.idle": "2025-03-17T10:53:52.108812Z",
1619
+ "shell.execute_reply": "2025-03-17T10:53:52.108451Z",
1620
+ "shell.execute_reply.started": "2025-03-17T10:53:52.106526Z"
1621
+ }
1622
+ },
1623
+ "source": [
1624
+ "### Fetch wikidata for collected tags"
1625
+ ]
1626
+ },
1627
+ {
1628
+ "cell_type": "code",
1629
+ "execution_count": null,
1630
+ "id": "ec696e5b-be39-4407-a04f-6ce6b0b08855",
1631
+ "metadata": {
1632
+ "execution": {
1633
+ "execution_failed": "2025-03-17T12:46:52.921Z",
1634
+ "iopub.execute_input": "2025-03-17T12:45:32.542088Z",
1635
+ "iopub.status.busy": "2025-03-17T12:45:32.541554Z"
1636
+ }
1637
+ },
1638
+ "outputs": [
1639
+ {
1640
+ "name": "stdout",
1641
+ "output_type": "stream",
1642
+ "text": [
1643
+ "✅ Extracted 35559 unique tags.\n",
1644
+ "🚀 Fetching data for amamiya_elena (1/35559)...\n",
1645
+ "✅ Saved amamiya_elena to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/amamiya_elena.json\n",
1646
+ "🚀 Fetching data for gilles_de_rais (2/35559)...\n",
1647
+ "✅ Saved gilles_de_rais to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/gilles_de_rais.json\n",
1648
+ "🚀 Fetching data for blue_scrunchie (3/35559)...\n",
1649
+ "✅ Saved blue_scrunchie to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/blue_scrunchie.json\n",
1650
+ "🚀 Fetching data for playing (4/35559)...\n",
1651
+ "✅ Saved playing to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/playing.json\n",
1652
+ "🚀 Fetching data for album_cover_redraw (5/35559)...\n",
1653
+ "✅ Saved album_cover_redraw to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/album_cover_redraw.json\n",
1654
+ "🚀 Fetching data for >3< (6/35559)...\n",
1655
+ "✅ Saved >3< to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/>3<.json\n",
1656
+ "🚀 Fetching data for damom (7/35559)...\n",
1657
+ "🚀 Fetching data for shoulder_cannon (8/35559)...\n",
1658
+ "✅ Saved shoulder_cannon to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/shoulder_cannon.json\n",
1659
+ "🚀 Fetching data for cover_image (9/35559)...\n",
1660
+ "✅ Saved cover_image to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/cover_image.json\n",
1661
+ "🚀 Fetching data for ultraman_legend (10/35559)...\n",
1662
+ "✅ Saved ultraman_legend to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ultraman_legend.json\n",
1663
+ "🚀 Fetching data for mastiff (11/35559)...\n",
1664
+ "✅ Saved mastiff to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/mastiff.json\n",
1665
+ "🚀 Fetching data for oingo (12/35559)...\n",
1666
+ "✅ Saved oingo to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/oingo.json\n",
1667
+ "🚀 Fetching data for mario_strikers_(series) (13/35559)...\n",
1668
+ "✅ Saved mario_strikers_(series) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/mario_strikers_(series).json\n",
1669
+ "🚀 Fetching data for phallic_symbol (14/35559)...\n",
1670
+ "✅ Saved phallic_symbol to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/phallic_symbol.json\n",
1671
+ "🚀 Fetching data for 4girls (15/35559)...\n",
1672
+ "✅ Saved 4girls to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/4girls.json\n",
1673
+ "🚀 Fetching data for nice_nature_(racehorse) (16/35559)...\n",
1674
+ "✅ Saved nice_nature_(racehorse) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/nice_nature_(racehorse).json\n",
1675
+ "🚀 Fetching data for hair_beads (17/35559)...\n",
1676
+ "✅ Saved hair_beads to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/hair_beads.json\n",
1677
+ "🚀 Fetching data for fu_po_(azur_lane) (18/35559)...\n",
1678
+ "✅ Saved fu_po_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/fu_po_(azur_lane).json\n",
1679
+ "🚀 Fetching data for togepi (19/35559)...\n",
1680
+ "✅ Saved togepi to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/togepi.json\n",
1681
+ "🚀 Fetching data for yasopp (20/35559)...\n",
1682
+ "✅ Saved yasopp to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/yasopp.json\n",
1683
+ "🚀 Fetching data for oyafune_suama (21/35559)...\n",
1684
+ "✅ Saved oyafune_suama to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/oyafune_suama.json\n",
1685
+ "🚀 Fetching data for phantasy_star_iii (22/35559)...\n",
1686
+ "✅ Saved phantasy_star_iii to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/phantasy_star_iii.json\n",
1687
+ "🚀 Fetching data for qingdai_guanmao (23/35559)...\n",
1688
+ "✅ Saved qingdai_guanmao to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/qingdai_guanmao.json\n",
1689
+ "🚀 Fetching data for kufei (24/35559)...\n",
1690
+ "✅ Saved kufei to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/kufei.json\n",
1691
+ "🚀 Fetching data for stefan_(atelier) (25/35559)...\n",
1692
+ "✅ Saved stefan_(atelier) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/stefan_(atelier).json\n",
1693
+ "🚀 Fetching data for dille_blood (26/35559)...\n",
1694
+ "✅ Saved dille_blood to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/dille_blood.json\n",
1695
+ "🚀 Fetching data for vivillon_(modern) (27/35559)...\n",
1696
+ "✅ Saved vivillon_(modern) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/vivillon_(modern).json\n",
1697
+ "🚀 Fetching data for sweetie_(ragnarok_online) (28/35559)...\n",
1698
+ "✅ Saved sweetie_(ragnarok_online) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/sweetie_(ragnarok_online).json\n",
1699
+ "🚀 Fetching data for whisking (29/35559)...\n",
1700
+ "✅ Saved whisking to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/whisking.json\n",
1701
+ "🚀 Fetching data for h&k_hk33 (30/35559)...\n",
1702
+ "✅ Saved h&k_hk33 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/h&k_hk33.json\n",
1703
+ "🚀 Fetching data for winx_club (31/35559)...\n",
1704
+ "✅ Saved winx_club to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/winx_club.json\n",
1705
+ "🚀 Fetching data for anti-tank_grenade (32/35559)...\n",
1706
+ "✅ Saved anti-tank_grenade to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/anti-tank_grenade.json\n",
1707
+ "🚀 Fetching data for devin_booker (33/35559)...\n",
1708
+ "✅ Saved devin_booker to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/devin_booker.json\n",
1709
+ "🚀 Fetching data for scylla_(azur_lane) (34/35559)...\n",
1710
+ "✅ Saved scylla_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/scylla_(azur_lane).json\n",
1711
+ "🚀 Fetching data for penance_(arknights) (35/35559)...\n",
1712
+ "✅ Saved penance_(arknights) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/penance_(arknights).json\n",
1713
+ "🚀 Fetching data for toba_(oshiro_project) (36/35559)...\n",
1714
+ "✅ Saved toba_(oshiro_project) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/toba_(oshiro_project).json\n",
1715
+ "🚀 Fetching data for scott_adams_(style) (37/35559)...\n",
1716
+ "✅ Saved scott_adams_(style) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/scott_adams_(style).json\n",
1717
+ "🚀 Fetching data for 502nd_joint_fighter_wing (38/35559)...\n",
1718
+ "✅ Saved 502nd_joint_fighter_wing to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/502nd_joint_fighter_wing.json\n",
1719
+ "🚀 Fetching data for komica_wiki (39/35559)...\n",
1720
+ "✅ Saved komica_wiki to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/komica_wiki.json\n",
1721
+ "🚀 Fetching data for final_fantasy_vi (40/35559)...\n",
1722
+ "✅ Saved final_fantasy_vi to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/final_fantasy_vi.json\n",
1723
+ "🚀 Fetching data for h&k_hk45 (41/35559)...\n",
1724
+ "✅ Saved h&k_hk45 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/h&k_hk45.json\n",
1725
+ "🚀 Fetching data for saint_seiya (42/35559)...\n",
1726
+ "✅ Saved saint_seiya to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/saint_seiya.json\n",
1727
+ "🚀 Fetching data for ike_(fire_emblem) (43/35559)...\n",
1728
+ "✅ Saved ike_(fire_emblem) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ike_(fire_emblem).json\n",
1729
+ "🚀 Fetching data for cooperative_breast_smother (44/35559)...\n",
1730
+ "✅ Saved cooperative_breast_smother to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/cooperative_breast_smother.json\n",
1731
+ "🚀 Fetching data for pamiat_merkuria_(azur_lane) (45/35559)...\n",
1732
+ "✅ Saved pamiat_merkuria_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/pamiat_merkuria_(azur_lane).json\n",
1733
+ "🚀 Fetching data for ribbon-trimmed_skirt (46/35559)...\n",
1734
+ "✅ Saved ribbon-trimmed_skirt to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ribbon-trimmed_skirt.json\n"
1735
+ ]
1736
+ }
1737
+ ],
1738
+ "source": [
1739
+ "import json\n",
1740
+ "import os\n",
1741
+ "import requests\n",
1742
+ "import time\n",
1743
+ "import urllib.parse\n",
1744
+ "\n",
1745
+ "# Load the cleaned hierarchy JSON\n",
1746
+ "json_file_path = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_03.json\"\n",
1747
+ "with open(json_file_path, \"r\", encoding=\"utf-8\") as file:\n",
1748
+ " tag_hierarchy = json.load(file)\n",
1749
+ "\n",
1750
+ "# Authentication (Replace with your credentials)\n",
1751
+ "USERNAME = username\n",
1752
+ "API_KEY = api_key\n",
1753
+ "\n",
1754
+ "# Base API URL\n",
1755
+ "BASE_URL = \"https://danbooru.donmai.us\"\n",
1756
+ "\n",
1757
+ "# Output folder for JSON results\n",
1758
+ "output_folder = current_dir.parent / \"misc/danbooru_donmai/tags_wiki\"\n",
1759
+ "os.makedirs(output_folder, exist_ok=True) # Ensure the directory exists\n",
1760
+ "\n",
1761
+ "# Function to extract all unique tags from nested dictionaries/lists\n",
1762
+ "def extract_tags(data):\n",
1763
+ " tags = set()\n",
1764
+ " if isinstance(data, dict):\n",
1765
+ " for key, value in data.items():\n",
1766
+ " tags.update(extract_tags(value)) # Recursively extract tags\n",
1767
+ " elif isinstance(data, list):\n",
1768
+ " for item in data:\n",
1769
+ " if isinstance(item, str):\n",
1770
+ " formatted_tag = item.lower().replace(\" \", \"_\") # ✅ Convert to lowercase and replace spaces\n",
1771
+ " tags.add(formatted_tag)\n",
1772
+ " else:\n",
1773
+ " tags.update(extract_tags(item)) # Handle nested lists\n",
1774
+ " return tags\n",
1775
+ "\n",
1776
+ "# Extract all unique tags\n",
1777
+ "all_tags = extract_tags(tag_hierarchy)\n",
1778
+ "print(f\"✅ Extracted {len(all_tags)} unique tags.\")\n",
1779
+ "\n",
1780
+ "# API query function\n",
1781
+ "def fetch_tag_data(tag):\n",
1782
+ " \"\"\"Fetch tag details from Danbooru API.\"\"\"\n",
1783
+ " encoded_tag = urllib.parse.quote(tag, safe=\"\")\n",
1784
+ " api_url = f\"{BASE_URL}/tags.json?search[name]={encoded_tag}&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\"\n",
1785
+ "\n",
1786
+ " try:\n",
1787
+ " response = requests.get(api_url, auth=(USERNAME, API_KEY))\n",
1788
+ " response.raise_for_status()\n",
1789
+ " return response.json()\n",
1790
+ " except requests.exceptions.RequestException as e:\n",
1791
+ " print(f\"⚠️ Error fetching data for '{tag}': {e}\")\n",
1792
+ " return None\n",
1793
+ "\n",
1794
+ "# Process each tag\n",
1795
+ "for idx, tag in enumerate(all_tags):\n",
1796
+ " tag_filename = os.path.join(output_folder, f\"{tag}.json\")\n",
1797
+ "\n",
1798
+ " # Skip if the file already exists to avoid redundant API calls\n",
1799
+ " if os.path.exists(tag_filename):\n",
1800
+ " print(f\"🔄 Skipping {tag}, already saved.\")\n",
1801
+ " continue\n",
1802
+ "\n",
1803
+ " print(f\"🚀 Fetching data for {tag} ({idx+1}/{len(all_tags)})...\")\n",
1804
+ "\n",
1805
+ " tag_data = fetch_tag_data(tag)\n",
1806
+ "\n",
1807
+ " if tag_data:\n",
1808
+ " with open(tag_filename, \"w\", encoding=\"utf-8\") as f:\n",
1809
+ " json.dump(tag_data, f, indent=4, ensure_ascii=False)\n",
1810
+ " print(f\"✅ Saved {tag} to {tag_filename}\")\n",
1811
+ "\n",
1812
+ " # Respect API rate limits\n",
1813
+ " time.sleep(1.5) # Adjust delay if necessary\n",
1814
+ "\n",
1815
+ "print(\"\\n✅ All tags processed and saved in the 'tags/' folder.\")\n"
1816
+ ]
1817
+ },
1818
+ {
1819
+ "cell_type": "code",
1820
+ "execution_count": null,
1821
+ "id": "8e364060-36e9-4418-8bdd-6018c9edcc33",
1822
+ "metadata": {},
1823
+ "outputs": [],
1824
+ "source": []
1825
+ }
1826
+ ],
1827
+ "metadata": {
1828
+ "kernelspec": {
1829
+ "display_name": "latm",
1830
+ "language": "python",
1831
+ "name": "python3"
1832
+ },
1833
+ "language_info": {
1834
+ "codemirror_mode": {
1835
+ "name": "ipython",
1836
+ "version": 3
1837
+ },
1838
+ "file_extension": ".py",
1839
+ "mimetype": "text/x-python",
1840
+ "name": "python",
1841
+ "nbconvert_exporter": "python",
1842
+ "pygments_lexer": "ipython3",
1843
+ "version": "3.10.15"
1844
+ }
1845
+ },
1846
+ "nbformat": 4,
1847
+ "nbformat_minor": 5
1848
+ }
jupyter_notebooks/SuppM_Figure_S14_co-occurence_training_data.ipynb ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Prepare *.json for Figure 13"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": null,
13
+ "metadata": {},
14
+ "outputs": [],
15
+ "source": [
16
+ "import pandas as pd\n",
17
+ "import json\n",
18
+ "from itertools import combinations\n",
19
+ "from collections import Counter\n",
20
+ "\n",
21
+ "# Load the data with only needed columns\n",
22
+ "csv_file_path = \"all_models_with_tags.csv\" # Update with your actual file path\n",
23
+ "\n",
24
+ "df = pd.read_csv(csv_file_path, low_memory=False)\n",
25
+ "\n",
26
+ "# Identify the starting point of tag columns (after 'civitai_id')\n",
27
+ "civitai_index = df.columns.get_loc(\"civitai_id\") + 1\n",
28
+ "\n",
29
+ "df_tags = df.iloc[:, civitai_index:]\n",
30
+ "\n",
31
+ "# Extract tag columns (tag01 to tag199) and corresponding counts\n",
32
+ "tag_columns = [col for col in df_tags.columns if col.startswith(\"tag\") and col[3:].isdigit() and int(col[3:]) <= 199]\n",
33
+ "tag_no_columns = [col for col in df_tags.columns if col.startswith(\"tag\") and col.endswith(\"_no\")]\n",
34
+ "\n",
35
+ "if not tag_columns:\n",
36
+ " print(\"No tag columns found in the dataset.\")\n",
37
+ " exit()\n",
38
+ "\n",
39
+ "df_tags = df[tag_columns]\n",
40
+ "df_tag_counts = df[tag_no_columns] if tag_no_columns else None\n",
41
+ "\n",
42
+ "# Load tag categories from JSON\n",
43
+ "json_file_path = \"danbooru_tags_step_03.json\" # Update with your actual file path\n",
44
+ "with open(json_file_path, \"r\", encoding=\"utf-8\") as f:\n",
45
+ " tag_categories = json.load(f)\n",
46
+ "\n",
47
+ "# Function to normalize tags (lowercase and replace underscores)\n",
48
+ "def normalize_tag(tag):\n",
49
+ " return tag.lower().replace(\"_\", \" \") if isinstance(tag, str) else tag\n",
50
+ "\n",
51
+ "# Flatten JSON structure to map tags to top-level categories\n",
52
+ "def extract_tag_categories(json_data, current_category=None):\n",
53
+ " tag_mapping = {}\n",
54
+ " for key, value in json_data.items():\n",
55
+ " if isinstance(value, dict):\n",
56
+ " tag_mapping.update(extract_tag_categories(value, key))\n",
57
+ " elif isinstance(value, list):\n",
58
+ " for tag in value:\n",
59
+ " normalized_tag = normalize_tag(tag)\n",
60
+ " if normalized_tag:\n",
61
+ " tag_mapping[normalized_tag] = current_category\n",
62
+ " return tag_mapping\n",
63
+ "\n",
64
+ "# Create mapping of tags to categories\n",
65
+ "tag_category_mapping = extract_tag_categories(tag_categories)\n",
66
+ "\n",
67
+ "# Flatten and count occurrences of individual tags\n",
68
+ "all_tags = []\n",
69
+ "tag_counts = Counter()\n",
70
+ "\n",
71
+ "for i, row in df_tags.iterrows():\n",
72
+ " tags = [normalize_tag(tag) for tag in row if pd.notna(tag)]\n",
73
+ " if df_tag_counts is not None:\n",
74
+ " counts = df_tag_counts.iloc[i].fillna(1).tolist()\n",
75
+ " else:\n",
76
+ " counts = [1] * len(tags)\n",
77
+ " for tag, count in zip(tags, counts):\n",
78
+ " all_tags.append(tag)\n",
79
+ " tag_counts[tag] += count\n",
80
+ "\n",
81
+ "if not all_tags:\n",
82
+ " print(\"No valid tags found in the dataset.\")\n",
83
+ " exit()\n",
84
+ "print(f\"Extracted {len(all_tags)} total tags.\")\n",
85
+ "\n",
86
+ "# Compute co-occurrence frequencies efficiently\n",
87
+ "co_occurrences = Counter()\n",
88
+ "for i, row in df_tags.iterrows():\n",
89
+ " tags = [normalize_tag(tag) for tag in row if pd.notna(tag)]\n",
90
+ " if len(tags) < 2:\n",
91
+ " continue # Skip rows with fewer than two tags\n",
92
+ " for tag1, tag2 in combinations(tags, 2):\n",
93
+ " co_occurrences[frozenset([tag1, tag2])] += 1\n",
94
+ "\n",
95
+ "if not co_occurrences:\n",
96
+ " print(\"No co-occurrence data found.\")\n",
97
+ " exit()\n",
98
+ "print(f\"Computed {len(co_occurrences)} co-occurrence pairs.\")\n",
99
+ "\n",
100
+ "# Convert co-occurrence counts to a weighted edge list\n",
101
+ "edges = [(tuple(pair)[0], tuple(pair)[1], weight) for pair, weight in co_occurrences.items() if len(pair) == 2]\n",
102
+ "\n",
103
+ "# Create a set of connected tags with a minimum connection threshold\n",
104
+ "min_connections = 5 # Increased threshold to reduce memory usage\n",
105
+ "connected_tags = Counter()\n",
106
+ "for tag1, tag2, _ in edges:\n",
107
+ " connected_tags[tag1] += 1\n",
108
+ " connected_tags[tag2] += 1\n",
109
+ "\n",
110
+ "# Filter nodes to keep only those that meet the minimum connection threshold\n",
111
+ "nodes = [\n",
112
+ " {\"id\": tag, \"size\": count, \"category\": tag_category_mapping.get(tag, \"unknown\")}\n",
113
+ " for tag, count in tag_counts.items()\n",
114
+ " if connected_tags[tag] >= min_connections\n",
115
+ "]\n",
116
+ "\n",
117
+ "if not nodes:\n",
118
+ " print(\"No nodes meet the connection threshold.\")\n",
119
+ " exit()\n",
120
+ "print(f\"Final node count: {len(nodes)}\")\n",
121
+ "\n",
122
+ "links = [\n",
123
+ " {\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
124
+ " for tag1, tag2, weight in edges\n",
125
+ " if connected_tags[tag1] >= min_connections and connected_tags[tag2] >= min_connections\n",
126
+ "]\n",
127
+ "\n",
128
+ "if not links:\n",
129
+ " print(\"No links meet the connection threshold.\")\n",
130
+ " exit()\n",
131
+ "print(f\"Final link count: {len(links)}\")\n",
132
+ "\n",
133
+ "# Prepare JSON output\n",
134
+ "d3_data = {\"nodes\": nodes, \"links\": links}\n",
135
+ "\n",
136
+ "# Save as JSON for visualization\n",
137
+ "output_file = \"co_occurrence_network.json\"\n",
138
+ "with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
139
+ " json.dump(d3_data, f, indent=4)\n",
140
+ "\n",
141
+ "print(f\"D3.js data saved to {output_file}\")"
142
+ ]
143
+ }
144
+ ],
145
+ "metadata": {
146
+ "language_info": {
147
+ "name": "python"
148
+ }
149
+ },
150
+ "nbformat": 4,
151
+ "nbformat_minor": 2
152
+ }
md/DEEPFAKE_PIPELINE_GUIDE.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deepfake Adapter Dataset Processing - Quick Start Guide
2
+
3
+ ## Overview
4
+
5
+ This pipeline processes the `real_person_adapters.csv` dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: **Qwen**, **Llama**, and **Mistral**.
6
+
7
+ ## Quick Start
8
+
9
+ ### 1. Prerequisites
10
+
11
+ ```bash
12
+ # Install required packages
13
+ pip install pandas numpy emoji requests tqdm spacy
14
+
15
+ # Download spaCy English model (for NER)
16
+ python -m spacy download en_core_web_sm
17
+ ```
18
+
19
+ **Note**: The spaCy model will be automatically downloaded when you run the notebook if not already installed.
20
+
21
+ ### 2. Set Up API Keys
22
+
23
+ Choose at least ONE LLM provider and get an API key:
24
+
25
+ | Provider | Model | Sign Up Link | Est. Cost (10k entries) |
26
+ |----------|-------|--------------|-------------------------|
27
+ | **Qwen** | Qwen-Max | https://dashscope.aliyun.com/ | Varies |
28
+ | **Llama** | Llama-3.1-70B | https://www.together.ai/ | ~$5-10 |
29
+ | **Mistral** | Mistral Large | https://mistral.ai/ | ~$40-80 |
30
+
31
+ Create your API key file in `misc/credentials/`:
32
+
33
+ ```bash
34
+ # For Qwen
35
+ echo "your-api-key-here" > misc/credentials/qwen_api_key.txt
36
+
37
+ # For Llama (via Together AI)
38
+ echo "your-api-key-here" > misc/credentials/together_api_key.txt
39
+
40
+ # For Mistral
41
+ echo "your-api-key-here" > misc/credentials/mistral_api_key.txt
42
+ ```
43
+
44
+ ### 3. Run the Notebook
45
+
46
+ Open `Section_2-3-4_Figure_8_deepfake_adapters.ipynb` and:
47
+
48
+ 1. **Run all cells sequentially** from top to bottom
49
+ 2. The default configuration uses Qwen in test mode (10 samples)
50
+ 3. Review the test results
51
+ 4. To process the full dataset, change in the LLM annotation cell:
52
+ ```python
53
+ TEST_MODE = False
54
+ ```
55
+
56
+ ## Pipeline Stages
57
+
58
+ ### Stage 1: NER & Name Cleaning
59
+ - **Input**: `data/CSV/real_person_adapters.csv`
60
+ - **Output**: `data/CSV/NER_POI_step01_pre_annotation.csv`
61
+ - **Function**: Cleans adapter names to extract real person names
62
+ - Removes: emoji, "lora", "v1", special characters
63
+ - Example: "IU LoRA v2 🎤" → "IU"
64
+
65
+ ### Stage 2: Country/Nationality Mapping
66
+ - **Input**: Step 1 output + `misc/lists/countries.csv`
67
+ - **Output**: `data/CSV/NER_POI_step02_annotated.csv`
68
+ - **Function**: Maps tags to standardized countries
69
+ - Example: "korean" → "South Korea"
70
+ - Excludes uninhabited territories
71
+
72
+ ### Stage 3: LLM Profession Annotation
73
+ - **Input**: Step 2 output + `misc/lists/professions.csv`
74
+ - **Output**: `data/CSV/{llm}_annotated_POI_test.csv` (test) or `{llm}_annotated_POI.csv` (full)
75
+ - **Function**: Uses LLM to identify:
76
+ - Full name
77
+ - Gender
78
+ - Up to 3 professions (from profession list)
79
+ - Country
80
+ - **Progress**: Automatically saves every 10 rows
81
+ - **Resumable**: Can continue from last saved progress if interrupted
82
+
83
+ ## Configuration Options
84
+
85
+ In the LLM annotation cell, you can configure:
86
+
87
+ ```python
88
+ # Choose LLM provider
89
+ SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral'
90
+
91
+ # Test mode (recommended for first run)
92
+ TEST_MODE = True # True = test on small sample
93
+ TEST_SIZE = 10 # Number of rows for testing
94
+
95
+ # Processing limits
96
+ MAX_ROWS = 20000 # Maximum rows to process (None = all)
97
+ SAVE_INTERVAL = 10 # Save progress every N rows
98
+ ```
99
+
100
+ ## Expected Output Format
101
+
102
+ The final dataset will include all original columns plus:
103
+
104
+ | Column | Description | Example |
105
+ |--------|-------------|---------|
106
+ | `real_name` | Cleaned name | "IU" |
107
+ | `full_name` | Full name from LLM | "Lee Ji-eun (IU)" |
108
+ | `gender` | Gender from LLM | "Female" |
109
+ | `profession_llm` | Up to 3 professions | "singer, actor, celebrity" |
110
+ | `country` | Country from LLM | "South Korea" |
111
+ | `likely_country` | Country from tags | "South Korea" |
112
+ | `likely_nationality` | Nationality from tags | "South Korean" |
113
+ | `tags` | Combined tags | "['korean', 'celebrity', 'singer']" |
114
+
115
+ ## Troubleshooting
116
+
117
+ ### API Key Errors
118
+ ```
119
+ Warning: No API key for qwen
120
+ ```
121
+ **Solution**: Ensure your API key file exists and contains only the key (no extra whitespace)
122
+
123
+ ### Rate Limiting
124
+ ```
125
+ Qwen API error (attempt 1/3): 429 Too Many Requests
126
+ ```
127
+ **Solution**: The code automatically retries with exponential backoff. You can also:
128
+ - Increase `time.sleep(0.5)` to a higher value
129
+ - Process in smaller batches
130
+
131
+ ### Progress Lost
132
+ **Solution**: The pipeline saves progress automatically. Check:
133
+ - `data/CSV/{llm}_annotated_POI_test.csv` - your partial results
134
+ - `misc/{llm}_query_index.txt` - last processed index
135
+ - Just re-run the cell and it will resume from the last saved progress
136
+
137
+ ### JSON Parse Errors from LLM
138
+ ```
139
+ Qwen API error: JSONDecodeError
140
+ ```
141
+ **Solution**: This is usually temporary. The code:
142
+ - Returns "Unknown" for failed queries
143
+ - Continues processing
144
+ - You can manually review/reprocess failed entries later
145
+
146
+ ## Cost Management
147
+
148
+ ### Estimate Costs Before Processing
149
+
150
+ For a dataset with N entries:
151
+ - **Qwen**: Contact Alibaba Cloud for pricing
152
+ - **Llama**: ~N × $0.0005 = ~$5 per 10k entries
153
+ - **Mistral**: ~N × $0.004 = ~$40 per 10k entries
154
+
155
+ ### Best Practices
156
+
157
+ 1. **Always test first**: Run with `TEST_MODE = True` on 10 samples
158
+ 2. **Monitor API usage**: Check your API provider's dashboard
159
+ 3. **Use cheaper models first**: Try Llama before Mistral
160
+ 4. **Process in batches**: Set `MAX_ROWS` to process incrementally
161
+ 5. **Save intermediate results**: The automatic saving feature helps prevent data loss
162
+
163
+ ## Comparing Multiple LLMs
164
+
165
+ To compare results from different LLMs:
166
+
167
+ 1. Run the pipeline with `SELECTED_LLM = 'qwen'`
168
+ 2. Change to `SELECTED_LLM = 'llama'` and run again
169
+ 3. Change to `SELECTED_LLM = 'mistral'` and run again
170
+ 4. Compare the three output files:
171
+ - `qwen_annotated_POI.csv`
172
+ - `llama_annotated_POI.csv`
173
+ - `mistral_annotated_POI.csv`
174
+
175
+ ## Files Created
176
+
177
+ The pipeline creates these files:
178
+
179
+ ```
180
+ data/CSV/
181
+ ├── NER_POI_step01_pre_annotation.csv # After name cleaning
182
+ ├── NER_POI_step02_annotated.csv # After country mapping
183
+ ├── qwen_annotated_POI_test.csv # Test results (Qwen)
184
+ ├── qwen_annotated_POI.csv # Full results (Qwen)
185
+ ├── llama_annotated_POI.csv # Full results (Llama)
186
+ └── mistral_annotated_POI.csv # Full results (Mistral)
187
+
188
+ misc/
189
+ ├── qwen_query_index.txt # Progress tracking
190
+ ├── llama_query_index.txt # Progress tracking
191
+ └── mistral_query_index.txt # Progress tracking
192
+ ```
193
+
194
+ ## Support
195
+
196
+ For issues or questions:
197
+ 1. Check this guide for common problems
198
+ 2. Review `misc/credentials/README.md` for API setup
199
+ 3. Read the notebook documentation (first cell)
200
+ 4. Check API provider documentation for service-specific issues
201
+
202
+ ## Ethical Considerations
203
+
204
+ This research documents ethical problems with AI deepfake models. The dataset and analysis help:
205
+ - Understand the scope of unauthorized person likeness usage
206
+ - Document professions/demographics most affected
207
+ - Inform policy and technical solutions
208
+ - Raise awareness about deepfake technology misuse
209
+
210
+ Use this data responsibly and respect individual privacy and consent.
md/LLM_MODELS_COMPARISON.md ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM Models for Deepfake Annotation
2
+
3
+ ## Overview
4
+
5
+ The pipeline now includes **6 LLM options** in individual cells for easy comparison:
6
+
7
+ 1. **Deepseek** - Testing (use first!)
8
+ 2. **Qwen (API)** - Chinese (Alibaba Cloud)
9
+ 3. **Llama** - American (Meta)
10
+ 4. **Mixtral** - French (Mistral AI)
11
+ 5. **Gemma** - American Open Source (Google)
12
+ 6. **Qwen-2.5-32B Local** - FREE local inference (NEW!)
13
+
14
+ ## The 6 LLMs
15
+
16
+ ### 1. Deepseek (Testing)
17
+ **Cell 10**
18
+
19
+ - **Model**: deepseek-chat
20
+ - **Provider**: DeepSeek
21
+ - **API**: https://platform.deepseek.com/
22
+ - **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries)
23
+ - **Use case**: **Test this first!** Cheapest option to verify pipeline works
24
+ - **API Key**: `misc/credentials/deepseek_api_key.txt`
25
+
26
+ ---
27
+
28
+ ### 2. Qwen API (Chinese)
29
+ **Cells 11-12**
30
+
31
+ - **Model**: qwen-max (automatically uses Qwen3-Max)
32
+ - **Provider**: Alibaba Cloud DashScope
33
+ - **API**: https://dashscope.aliyun.com/
34
+ - **Cost**: Variable (check Alibaba pricing)
35
+ - **Use case**: Chinese company, strong multilingual support
36
+ - **API Key**: `misc/credentials/qwen_api_key.txt`
37
+ - **Note**: Uses latest Qwen3-Max when you specify `qwen-max`
38
+
39
+ ---
40
+
41
+ ### 6. Qwen-2.5-32B Local (FREE!)
42
+ **Cells 19-20** (NEW!)
43
+
44
+ - **Model**: qwen2.5:32b-instruct
45
+ - **Provider**: Ollama (local inference)
46
+ - **Setup**: https://ollama.com/
47
+ - **Cost**: **$0** (FREE - no API costs!)
48
+ - **Requirements**:
49
+ - A100 80GB GPU (or similar)
50
+ - ~25GB VRAM during inference
51
+ - ~20GB storage for model download
52
+ - Ollama installed
53
+ - **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour)
54
+ - **Use case**:
55
+ - ✅ Large datasets (>1000 samples) where cost matters
56
+ - ✅ Privacy-sensitive research data
57
+ - ✅ Offline processing
58
+ - ✅ Strong multilingual support
59
+ - **Setup guide**: See `QWEN_LOCAL_SETUP.md`
60
+
61
+ ---
62
+
63
+ ### 3. Llama (American)
64
+ **Cells 13-14**
65
+
66
+ - **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
67
+ - **Provider**: Together AI (hosting Meta's model)
68
+ - **Developer**: Meta (American)
69
+ - **API**: https://www.together.ai/
70
+ - **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries)
71
+ - **Use case**: Open-source American model, good quality
72
+ - **API Key**: `misc/credentials/together_api_key.txt`
73
+
74
+ ---
75
+
76
+ ### 4. Mixtral (French)
77
+ **Cells 15-16**
78
+
79
+ - **Model**: open-mixtral-8x22b
80
+ - **Provider**: Mistral AI
81
+ - **Developer**: Mistral AI (French)
82
+ - **API**: https://mistral.ai/
83
+ - **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries)
84
+ - **Use case**: European alternative, Mixture-of-Experts architecture
85
+ - **API Key**: `misc/credentials/mistral_api_key.txt`
86
+ - **Note**: Using open-mixtral-8x22b (cheaper than mistral-large)
87
+
88
+ ---
89
+
90
+ ### 5. Gemma (American Open Source)
91
+ **Cells 17-18**
92
+
93
+ - **Model**: google/gemma-2-27b-it
94
+ - **Provider**: Together AI (hosting Google's model)
95
+ - **Developer**: Google (American)
96
+ - **API**: https://www.together.ai/ (same as Llama)
97
+ - **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries)
98
+ - **Use case**: American open-source alternative, competitive quality
99
+ - **API Key**: `misc/credentials/together_api_key.txt` (same as Llama)
100
+ - **Note**: Fully open-source, can be self-hosted
101
+
102
+ ---
103
+
104
+ ## Cost Comparison (10,000 entries)
105
+
106
+ | Model | Provider | Cost | Time | Origin |
107
+ |-------|----------|------|------|--------|
108
+ | **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | 🇨🇳 Chinese |
109
+ | **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | 🇨🇳 Chinese |
110
+ | **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | 🇺🇸 American (open) |
111
+ | **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | 🇺🇸 American (open) |
112
+ | **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | 🇫🇷 French (open) |
113
+ | **Qwen API** | Alibaba | Variable | ~5-10 hrs | 🇨🇳 Chinese |
114
+
115
+ **Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time.
116
+
117
+ ## Recommended Testing Order
118
+
119
+ ### 1. Start with Deepseek
120
+ ```python
121
+ # Cell 10
122
+ TEST_MODE = True
123
+ TEST_SIZE = 10
124
+ ```
125
+ - **Why**: Cheapest, verify pipeline works
126
+ - **Cost**: Pennies for 10 samples
127
+
128
+ ### 2. Compare on Small Sample
129
+ Pick 2-3 models and run on same 100 samples:
130
+ ```python
131
+ # In each cell:
132
+ TEST_MODE = True
133
+ TEST_SIZE = 100
134
+ ```
135
+
136
+ **Good combinations:**
137
+ - Budget: Deepseek + Gemma
138
+ - Quality: Llama + Mixtral
139
+ - Geographic: Qwen + Llama + Mixtral
140
+
141
+ ### 3. Production Run
142
+ Choose best model from testing and run full dataset:
143
+ ```python
144
+ TEST_MODE = False
145
+ MAX_ROWS = None # or 20000
146
+ ```
147
+
148
+ ## API Key Setup
149
+
150
+ ### For Deepseek & Qwen (separate keys):
151
+ ```bash
152
+ echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
153
+ echo "your-qwen-key" > misc/credentials/qwen_api_key.txt
154
+ ```
155
+
156
+ ### For Llama & Gemma (same Together AI key):
157
+ ```bash
158
+ echo "your-together-key" > misc/credentials/together_api_key.txt
159
+ ```
160
+ Both Llama and Gemma use the same Together AI key!
161
+
162
+ ### For Mixtral:
163
+ ```bash
164
+ echo "your-mistral-key" > misc/credentials/mistral_api_key.txt
165
+ ```
166
+
167
+ ## Output Files
168
+
169
+ Each LLM saves to a separate file:
170
+
171
+ ```
172
+ data/CSV/
173
+ ├── deepseek_annotated_POI_test.csv # Deepseek test
174
+ ├── deepseek_annotated_POI.csv # Deepseek full
175
+ ├── qwen_annotated_POI_test.csv # Qwen API test
176
+ ├── qwen_annotated_POI.csv # Qwen API full
177
+ ├── qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!)
178
+ ├── qwen_local_annotated_POI.csv # Qwen Local full (NEW!)
179
+ ├── llama_annotated_POI_test.csv # Llama test
180
+ ├── llama_annotated_POI.csv # Llama full
181
+ ├── mixtral_annotated_POI_test.csv # Mixtral test
182
+ ├── mixtral_annotated_POI.csv # Mixtral full
183
+ ├── gemma_annotated_POI_test.csv # Gemma test
184
+ └── gemma_annotated_POI.csv # Gemma full
185
+ ```
186
+
187
+ ## Comparing Results
188
+
189
+ After running multiple LLMs, compare results:
190
+
191
+ ```python
192
+ import pandas as pd
193
+
194
+ # Load results from different models
195
+ deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
196
+ qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
197
+ qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW!
198
+ llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
199
+ mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
200
+ gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')
201
+
202
+ # Compare profession distributions
203
+ print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
204
+ print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
205
+ print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW!
206
+ print("Llama professions:", llama_df['profession_llm'].value_counts().head())
207
+ print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
208
+ print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())
209
+
210
+ # Compare specific cases
211
+ print("\nIrene identification:")
212
+ print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
213
+ print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
214
+ print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
215
+ print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
216
+ print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
217
+ print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)
218
+ ```
219
+
220
+ ## Model Characteristics
221
+
222
+ ### Deepseek
223
+ - ✅ Very cheap
224
+ - ✅ Good for testing
225
+ - ⚠️ Less documentation
226
+ - 🇨🇳 Chinese company
227
+
228
+ ### Qwen (Qwen3-Max)
229
+ - ✅ Latest version automatically used
230
+ - ✅ Strong multilingual
231
+ - ✅ Good Asian name recognition
232
+ - 💰 Variable cost
233
+ - 🇨🇳 Chinese company (Alibaba)
234
+
235
+ ### Llama 3.1 70B
236
+ - ✅ Open-source
237
+ - ✅ Strong overall performance
238
+ - ✅ Well-documented
239
+ - ✅ American (Meta)
240
+ - 💰 Mid-range cost
241
+
242
+ ### Mixtral 8x22B
243
+ - ✅ Open-source
244
+ - ✅ MoE architecture (efficient)
245
+ - ✅ European alternative
246
+ - 💰 Mid-range cost
247
+ - 🇫🇷 French company
248
+
249
+ ### Gemma 2 27B
250
+ - ✅ Fully open-source
251
+ - ✅ Can self-host
252
+ - ✅ American (Google)
253
+ - ✅ Cheap via API
254
+ - ✅ Good quality for size
255
+
256
+ ### Qwen-2.5-32B Local (NEW!)
257
+ - ✅ **FREE** - $0 cost (no API fees)
258
+ - ✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
259
+ - ✅ **PRIVATE** - Data never leaves your machine
260
+ - ✅ **OFFLINE** - Works without internet
261
+ - ✅ **HIGH QUALITY** - 32B parameter model
262
+ - ✅ Strong multilingual support
263
+ - ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
264
+ - 🇨🇳 Chinese company (Alibaba)
265
+ - 📦 Model size: ~20GB download
266
+
267
+ ## Decision Matrix
268
+
269
+ ### If you prioritize...
270
+
271
+ **FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!)
272
+
273
+ **Cost** (with API): Use **Deepseek** or **Gemma**
274
+
275
+ **Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral**
276
+
277
+ **Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine)
278
+
279
+ **American/Open Source**: Use **Gemma** or **Llama**
280
+
281
+ **Asian Names**: Use **Qwen** (API or Local - strong multilingual)
282
+
283
+ **European Provider**: Use **Mixtral**
284
+
285
+ **Testing**: Use **Deepseek** first, always!
286
+
287
+ ## Running Multiple Models
288
+
289
+ You can run all 6 models in sequence:
290
+
291
+ ```python
292
+ # 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
293
+ # 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
294
+ # 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
295
+ # 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
296
+ # 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
297
+ # 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)
298
+ ```
299
+
300
+ Each saves to its own file, so you can compare results!
301
+
302
+ ## Notes
303
+
304
+ - **Llama and Gemma use the same API key** (Together AI)
305
+ - All models use the **same 9 profession categories**
306
+ - All models have **automatic retries** with exponential backoff
307
+ - All models **save progress** every 10 rows
308
+ - All models are **resumable** if interrupted
309
+
310
+ ## Summary
311
+
312
+ You now have **6 LLM options** to choose from:
313
+
314
+ 1. 🧪 **Deepseek** - Test first (cheapest API)
315
+ 2. ����🇳 **Qwen3-Max API** - Chinese, strong multilingual
316
+ 3. 🇺🇸 **Llama 3.1 70B** - American, open-source
317
+ 4. 🇫🇷 **Mixtral 8x22B** - French, open-source MoE
318
+ 5. 🇺🇸 **Gemma 2 27B** - American open-source (Google)
319
+ 6. 💰 **Qwen-2.5-32B Local** - FREE local inference (NEW!)
320
+
321
+ Each in its own cell, easy to run and compare! 🎉
322
+
323
+ **Recommended workflow**:
324
+ 1. Test with Deepseek (Cell 10) - verify pipeline works
325
+ 2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
326
+ 3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!
md/QUICK_START_LOCAL.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Start: Running Qwen-2.5-32B Locally
2
+
3
+ This is a quick guide to get you started with FREE local LLM inference using your A100 GPU.
4
+
5
+ ## Why Local?
6
+
7
+ ✅ **$0 cost** - No API fees
8
+ ✅ **Privacy** - Data stays on your machine
9
+ ✅ **Quality** - 32B parameter model with strong performance
10
+
11
+ ## Setup (One-time)
12
+
13
+ ### 1. Pull the Model (~10-30 minutes)
14
+
15
+ ```bash
16
+ # Pull Qwen-2.5-32B-Instruct
17
+ ollama pull qwen2.5:32b-instruct
18
+
19
+ # Wait for download to complete (~20GB)
20
+ # Model will be cached at: ~/.ollama/models/
21
+ ```
22
+
23
+ ### 2. Verify Model is Ready
24
+
25
+ ```bash
26
+ # List installed models
27
+ ollama list
28
+
29
+ # Should show: qwen2.5:32b-instruct
30
+
31
+ # Test it
32
+ ollama run qwen2.5:32b-instruct "Hello, who are you?"
33
+ ```
34
+
35
+ If you see a response, you're ready! ✅
36
+
37
+ ## Running the Notebook
38
+
39
+ ### Open the Notebook
40
+
41
+ ```bash
42
+ cd jupyter_notebooks
43
+ jupyter notebook Section_2-3-4_Figure_8_deepfake_adapters.ipynb
44
+ ```
45
+
46
+ ### Run the Cells
47
+
48
+ 1. **Cell 5**: NER & Name Cleaning (processes names)
49
+ 2. **Cell 7**: Country/Nationality Mapping
50
+ 3. **Cell 20**: Qwen-2.5-32B Local Annotation 👈 **This is the new one!**
51
+
52
+ ### Configure Cell 20
53
+
54
+ ```python
55
+ # Start with test mode
56
+ TEST_MODE = True
57
+ TEST_SIZE = 10
58
+
59
+ # Then run full dataset
60
+ TEST_MODE = False
61
+ MAX_ROWS = 20000 # or None for all
62
+ ```
63
+
64
+ ### Run Cell 20
65
+
66
+ Just click "Run" or press Shift+Enter. The cell will:
67
+ 1. Check if Ollama is installed ✅
68
+ 2. Check if model is available ✅
69
+ 3. Start annotating
70
+ 4. Save progress every 10 rows
71
+ 5. Show completion stats
72
+
73
+ ### Monitor Progress
74
+
75
+ ```
76
+ Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
77
+ ✅ Saved after 10 rows (~24.0 samples/hour)
78
+
79
+ ✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
80
+ Total time: 2.5 minutes
81
+ Average speed: 240.0 samples/hour
82
+ ```
83
+
84
+ ## Performance
85
+
86
+ On your A100 80GB:
87
+ - **Speed**: ~5-10 tokens/second
88
+ - **Throughput**: ~100-200 samples/hour
89
+ - **Memory**: ~22-25GB VRAM
90
+ - **Cost**: $0
91
+
92
+ ### Time Estimates
93
+
94
+ | Dataset Size | Time |
95
+ |-------------|------|
96
+ | 10 samples (test) | ~2-3 minutes |
97
+ | 100 samples | ~20-30 minutes |
98
+ | 1,000 samples | ~5-10 hours |
99
+ | 10,000 samples | ~50-100 hours |
100
+
101
+ **Tip**: Run overnight or over the weekend for large datasets!
102
+
103
+ ## Troubleshooting
104
+
105
+ ### "Model not found"
106
+
107
+ ```bash
108
+ ollama pull qwen2.5:32b-instruct
109
+ ```
110
+
111
+ ### "Ollama not running"
112
+
113
+ ```bash
114
+ ollama serve
115
+ ```
116
+
117
+ ### Out of Memory
118
+
119
+ Your A100 has 80GB VRAM - this should NOT happen with the 32B model (~25GB VRAM).
120
+
121
+ If it does, try the quantized version:
122
+ ```bash
123
+ ollama pull qwen2.5:32b-instruct-q4_0 # Only ~12GB VRAM
124
+ ```
125
+
126
+ ## Output
127
+
128
+ Results saved to:
129
+ - Test: `data/CSV/qwen_local_annotated_POI_test.csv`
130
+ - Full: `data/CSV/qwen_local_annotated_POI.csv`
131
+
132
+ Same format as API results - easy to compare!
133
+
134
+ ## Custom Model Cache Location
135
+
136
+ To store models in `data/models/`:
137
+
138
+ ```bash
139
+ export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
140
+ ollama pull qwen2.5:32b-instruct
141
+ ```
142
+
143
+ ## Comparing API vs Local
144
+
145
+ After running both:
146
+
147
+ ```python
148
+ import pandas as pd
149
+
150
+ qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
151
+ qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
152
+
153
+ # Check agreement
154
+ agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
155
+ print(f"Agreement: {agreement*100:.1f}%")
156
+ ```
157
+
158
+ ## Full Documentation
159
+
160
+ For more details, see:
161
+ - `QWEN_LOCAL_SETUP.md` - Complete setup guide
162
+ - `LLM_MODELS_COMPARISON.md` - All 6 LLM options compared
163
+
164
+ ## Summary
165
+
166
+ ✅ Ollama already installed
167
+ ✅ A100 80GB GPU - perfect for Qwen-2.5-32B
168
+ ✅ FREE inference - no API costs
169
+ ✅ Privacy - data stays local
170
+
171
+ **Next step**: Run Cell 20 in the notebook! 🚀
md/QWEN_LOCAL_SETUP.md ADDED
@@ -0,0 +1,321 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Running Qwen-2.5-32B Locally with Ollama
2
+
3
+ This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.
4
+
5
+ ## Why Run Locally?
6
+
7
+ ✅ **FREE** - No API costs ($0 per query)
8
+ ✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
9
+ ✅ **PRIVATE** - Data never leaves your machine
10
+ ✅ **OFFLINE** - Works without internet (after model download)
11
+ ✅ **HIGH QUALITY** - 32B parameter model with strong multilingual support
12
+
13
+ ## System Requirements
14
+
15
+ ### Minimum Specs
16
+ - **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
17
+ - **VRAM**: 22-25GB during inference
18
+ - **RAM**: 32GB system RAM (you have 265GB - more than enough!)
19
+ - **Storage**: ~20GB for model download
20
+ - **OS**: Linux (you're on Ubuntu)
21
+
22
+ ### Your Setup
23
+ ✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
24
+ ✅ 265GB RAM - Excellent
25
+ ✅ Linux (Ubuntu) - Supported
26
+ ✅ Ollama already installed at `/usr/local/bin/ollama`
27
+
28
+ ## Installation Steps
29
+
30
+ ### 1. Verify Ollama Installation
31
+
32
+ ```bash
33
+ # Check if Ollama is installed
34
+ which ollama
35
+ # Should output: /usr/local/bin/ollama
36
+
37
+ # Check Ollama version
38
+ ollama --version
39
+ ```
40
+
41
+ If not installed, install with:
42
+ ```bash
43
+ curl -fsSL https://ollama.com/install.sh | sh
44
+ ```
45
+
46
+ ### 2. Pull Qwen-2.5-32B-Instruct Model
47
+
48
+ ```bash
49
+ # This will download ~20GB
50
+ ollama pull qwen2.5:32b-instruct
51
+
52
+ # Alternative: Use the base model (not instruct-tuned)
53
+ # ollama pull qwen2.5:32b
54
+ ```
55
+
56
+ **Download time**: ~10-30 minutes depending on your internet speed.
57
+
58
+ **Model cache location**: By default, models are cached at:
59
+ - Linux: `~/.ollama/models/`
60
+
61
+ To use custom cache location (e.g., `data/models/`):
62
+ ```bash
63
+ export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
64
+ ollama pull qwen2.5:32b-instruct
65
+ ```
66
+
67
+ ### 3. Verify Model is Ready
68
+
69
+ ```bash
70
+ # List all installed models
71
+ ollama list
72
+
73
+ # Test the model
74
+ ollama run qwen2.5:32b-instruct "Hello, who are you?"
75
+ ```
76
+
77
+ You should see a response from Qwen!
78
+
79
+ ### 4. Start Ollama Server (if needed)
80
+
81
+ Ollama runs as a background service by default. If you need to start it manually:
82
+
83
+ ```bash
84
+ # Start Ollama server
85
+ ollama serve
86
+
87
+ # Or run in background
88
+ nohup ollama serve > /dev/null 2>&1 &
89
+ ```
90
+
91
+ ## Using Qwen-2.5-32B in the Notebook
92
+
93
+ ### Cell 20: Qwen-2.5-32B Local Annotation
94
+
95
+ The notebook cell handles everything automatically:
96
+
97
+ 1. **Checks Ollama installation**
98
+ 2. **Verifies model availability**
99
+ 3. **Runs inference locally**
100
+ 4. **Saves progress every 10 rows**
101
+
102
+ ### Configuration
103
+
104
+ ```python
105
+ # In Cell 20
106
+ TEST_MODE = True # Start with small test
107
+ TEST_SIZE = 10 # Test on 10 samples first
108
+ MAX_ROWS = 20000 # Full dataset size
109
+ SAVE_INTERVAL = 10 # Save every 10 rows
110
+
111
+ MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
112
+ OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
113
+ ```
114
+
115
+ ### Running the Pipeline
116
+
117
+ 1. **Test run first** (recommended):
118
+ ```python
119
+ TEST_MODE = True
120
+ TEST_SIZE = 10
121
+ ```
122
+ Run Cell 20 to test on 10 samples (~1-2 minutes)
123
+
124
+ 2. **Check results**:
125
+ ```python
126
+ # Output saved to:
127
+ data/CSV/qwen_local_annotated_POI_test.csv
128
+ ```
129
+
130
+ 3. **Full run**:
131
+ ```python
132
+ TEST_MODE = False
133
+ MAX_ROWS = 20000 # or None for all rows
134
+ ```
135
+ Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)
136
+
137
+ ### Performance Expectations
138
+
139
+ On NVIDIA A100 80GB:
140
+ - **Speed**: 5-10 tokens/second
141
+ - **Throughput**: 100-200 samples/hour (depends on prompt length)
142
+ - **Memory**: ~22-25GB VRAM during inference
143
+ - **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)
144
+
145
+ ### Monitoring
146
+
147
+ The cell shows progress updates:
148
+ ```
149
+ Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
150
+ ✅ Saved after 10 rows (~24.0 samples/hour)
151
+
152
+ ✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
153
+ Total time: 2.5 minutes
154
+ Average speed: 240.0 samples/hour
155
+ ```
156
+
157
+ ## Troubleshooting
158
+
159
+ ### Model Not Found
160
+
161
+ ```bash
162
+ # Check if model is installed
163
+ ollama list
164
+
165
+ # If not listed, pull it
166
+ ollama pull qwen2.5:32b-instruct
167
+ ```
168
+
169
+ ### Ollama Server Not Running
170
+
171
+ ```bash
172
+ # Check if Ollama is running
173
+ ps aux | grep ollama
174
+
175
+ # If not running, start it
176
+ ollama serve
177
+ ```
178
+
179
+ ### GPU Not Detected
180
+
181
+ ```bash
182
+ # Check NVIDIA GPU
183
+ nvidia-smi
184
+
185
+ # Check CUDA
186
+ nvcc --version
187
+
188
+ # Ollama should automatically detect GPU
189
+ # If not, check Ollama logs
190
+ journalctl -u ollama
191
+ ```
192
+
193
+ ### Out of Memory (OOM)
194
+
195
+ If you get OOM errors:
196
+
197
+ 1. **Check VRAM usage**:
198
+ ```bash
199
+ watch -n 1 nvidia-smi
200
+ ```
201
+
202
+ 2. **Try smaller batch size** (not applicable here - we process 1 at a time)
203
+
204
+ 3. **Try quantized version** (smaller model):
205
+ ```bash
206
+ # 4-bit quantized version (~12GB VRAM)
207
+ ollama pull qwen2.5:32b-instruct-q4_0
208
+
209
+ # Update MODEL_NAME in notebook
210
+ MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
211
+ ```
212
+
213
+ ### Slow Inference
214
+
215
+ If inference is very slow (<1 token/sec):
216
+
217
+ 1. **Check GPU utilization**:
218
+ ```bash
219
+ nvidia-smi
220
+ ```
221
+ GPU should show ~90%+ utilization during inference
222
+
223
+ 2. **Check CPU vs GPU**:
224
+ Ollama might be using CPU instead of GPU
225
+ ```bash
226
+ # Force GPU usage
227
+ OLLAMA_GPU=1 ollama serve
228
+ ```
229
+
230
+ ## Model Variants
231
+
232
+ Ollama provides several Qwen-2.5 variants:
233
+
234
+ | Model | Size | VRAM | Speed | Quality |
235
+ |-------|------|------|-------|---------|
236
+ | `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
237
+ | `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
238
+ | `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
239
+ | `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |
240
+
241
+ For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).
242
+
243
+ ## Custom Model Cache Location
244
+
245
+ To store models in `data/models/` directory:
246
+
247
+ ```bash
248
+ # Set environment variable
249
+ export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
250
+
251
+ # Add to ~/.bashrc for persistence
252
+ echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc
253
+
254
+ # Pull model (will download to data/models/)
255
+ ollama pull qwen2.5:32b-instruct
256
+
257
+ # Verify
258
+ ls -lh $OLLAMA_MODELS/
259
+ ```
260
+
261
+ ## Comparing Results
262
+
263
+ After running both API and local versions, compare results:
264
+
265
+ ```python
266
+ import pandas as pd
267
+
268
+ # Load results
269
+ qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
270
+ qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
271
+
272
+ # Compare professions
273
+ print("API professions:", qwen_api['profession_llm'].value_counts().head())
274
+ print("Local professions:", qwen_local['profession_llm'].value_counts().head())
275
+
276
+ # Check agreement
277
+ agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
278
+ print(f"Agreement rate: {agreement*100:.1f}%")
279
+ ```
280
+
281
+ ## Cost Comparison (10,000 samples)
282
+
283
+ | Method | Cost | Time | Privacy |
284
+ |--------|------|------|---------|
285
+ | **Qwen Local (A100)** | **$0** | ~50-100 hours | ✅ Full |
286
+ | Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba |
287
+ | Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI |
288
+ | Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek |
289
+
290
+ **Recommendation**:
291
+ - For **small tests** (<100 samples): Use API (faster)
292
+ - For **large datasets** (>1000 samples): Use local (free, private)
293
+ - For **research papers**: Use local to avoid data privacy concerns
294
+
295
+ ## Advanced: Parallel Processing
296
+
297
+ For faster processing on multi-GPU setup:
298
+
299
+ ```python
300
+ # Not implemented yet, but possible with:
301
+ # - Multiple Ollama instances on different GPUs
302
+ # - Ray or Dask for parallel processing
303
+ # - ~4x speedup with 4 GPUs
304
+ ```
305
+
306
+ ## Summary
307
+
308
+ ✅ **Ollama** already installed
309
+ ✅ **A100 80GB** GPU - perfect for Qwen-2.5-32B
310
+ ✅ **Free inference** - no API costs
311
+ ✅ **Privacy** - data stays local
312
+
313
+ **Next steps:**
314
+ 1. Pull model: `ollama pull qwen2.5:32b-instruct`
315
+ 2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
316
+ 3. Run full dataset: `TEST_MODE = False`
317
+
318
+ **Estimated time for 10,000 samples**: ~50-100 hours
319
+ **Cost**: $0
320
+
321
+ Good luck! 🚀
md/SPACY_NER_EXPLANATION.md ADDED
@@ -0,0 +1,316 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # spaCy NER Implementation
2
+
3
+ ## Why spaCy for NER?
4
+
5
+ Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:
6
+
7
+ 1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
8
+ 2. **Context-aware**: Understands sentence structure and context
9
+ 3. **Robust**: Handles various name formats (first, last, full, stage names)
10
+ 4. **Language support**: Works with multiple languages and scripts
11
+ 5. **Industry standard**: Used in production NLP applications
12
+
13
+ ## How It Works
14
+
15
+ ### Pipeline Overview
16
+
17
+ ```
18
+ Original Name
19
+
20
+ 1. Translate Leetspeak (4→a, 3→e, 1→i)
21
+
22
+ 2. Remove Noise (emoji, LoRA terms, versions)
23
+
24
+ 3. spaCy NER - Extract PERSON entities
25
+
26
+ 4. Fallback to capitalized words if needed
27
+
28
+ Cleaned Name
29
+ ```
30
+
31
+ ### Detailed Steps
32
+
33
+ #### Step 1: Leetspeak Translation
34
+ ```python
35
+ "4kira LoRA v2" → "akira LoRA v2"
36
+ "1rene Model" → "irene Model"
37
+ "3mma Watson" → "emma Watson"
38
+ ```
39
+
40
+ #### Step 2: Noise Removal
41
+ ```python
42
+ "akira LoRA v2" → "akira"
43
+ "irene Model" → "irene"
44
+ "emma Watson" → "emma Watson"
45
+ ```
46
+
47
+ #### Step 3: spaCy NER
48
+ ```python
49
+ nlp("akira")
50
+ # Entities: [("akira", PERSON)]
51
+ # Result: "akira"
52
+
53
+ nlp("emma Watson")
54
+ # Entities: [("emma Watson", PERSON)]
55
+ # Result: "emma Watson"
56
+ ```
57
+
58
+ #### Step 4: Fallback
59
+ If spaCy doesn't find a PERSON entity:
60
+ - Extract capitalized words (likely names)
61
+ - Or return cleaned text as-is
62
+
63
+ ## Examples
64
+
65
+ ### Case 1: Simple Name
66
+ ```
67
+ Input: "IU"
68
+ Output: "IU"
69
+
70
+ Process:
71
+ - Preprocess: "IU" (no noise)
72
+ - spaCy NER: Recognizes "IU" as PERSON
73
+ - Result: "IU"
74
+ ```
75
+
76
+ ### Case 2: Name with LoRA Terms
77
+ ```
78
+ Input: "Scarlett Johansson「LoRa」"
79
+ Output: "Scarlett Johansson"
80
+
81
+ Process:
82
+ - Preprocess: "Scarlett Johansson" (removed 「LoRa」)
83
+ - spaCy NER: Recognizes "Scarlett Johansson" as PERSON
84
+ - Result: "Scarlett Johansson"
85
+ ```
86
+
87
+ ### Case 3: Leetspeak Name
88
+ ```
89
+ Input: "4kira Anime Character v1"
90
+ Output: "akira"
91
+
92
+ Process:
93
+ - Leetspeak: "akira Anime Character v1"
94
+ - Preprocess: "akira Anime Character"
95
+ - spaCy NER: Recognizes "akira" as PERSON
96
+ - Result: "akira"
97
+ ```
98
+
99
+ ### Case 4: Complex Format
100
+ ```
101
+ Input: "Gakki | Aragaki Yui | 新垣結衣"
102
+ Output: "Gakki"
103
+
104
+ Process:
105
+ - Preprocess: "Gakki" (kept first part before |)
106
+ - spaCy NER: Recognizes "Gakki" as PERSON
107
+ - Result: "Gakki"
108
+ ```
109
+
110
+ ### Case 5: With Metadata
111
+ ```
112
+ Input: "Emma Watson (JG) v3.5"
113
+ Output: "Emma Watson"
114
+
115
+ Process:
116
+ - Preprocess: "Emma Watson" (removed (JG) and v3.5)
117
+ - spaCy NER: Recognizes "Emma Watson" as PERSON
118
+ - Result: "Emma Watson"
119
+ ```
120
+
121
+ ## Advantages Over Regex-Only
122
+
123
+ ### Old Approach (Regex Only)
124
+ ```python
125
+ # Just remove noise and hope for the best
126
+ name = remove_noise(name)
127
+ name = name.strip()
128
+ # Result: May include non-name words
129
+ ```
130
+
131
+ Problems:
132
+ - Can't distinguish names from other capitalized words
133
+ - May include words like "Model", "Anime", "Character"
134
+ - No context awareness
135
+ - Language-dependent regex patterns needed
136
+
137
+ ### New Approach (spaCy NER)
138
+ ```python
139
+ # Intelligent entity extraction
140
+ preprocessed = remove_noise(name)
141
+ doc = nlp(preprocessed)
142
+ person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
143
+ # Result: Only actual person names
144
+ ```
145
+
146
+ Benefits:
147
+ - ✅ Identifies actual person entities
148
+ - ✅ Ignores non-person words
149
+ - ✅ Context-aware (understands "Emma Watson" is one entity)
150
+ - ✅ Multi-language support
151
+ - ✅ Handles various name formats
152
+
153
+ ## Comparison Examples
154
+
155
+ | Input | Regex Only | spaCy NER |
156
+ |-------|------------|-----------|
157
+ | `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ |
158
+ | `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ |
159
+ | `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ |
160
+ | `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ |
161
+ | `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ |
162
+
163
+ ## spaCy Model Information
164
+
165
+ ### Model Used
166
+ - **Name**: `en_core_web_sm`
167
+ - **Language**: English (but works reasonably with romanized names)
168
+ - **Size**: ~13 MB
169
+ - **Entities**: Recognizes PERSON, ORG, GPE, etc.
170
+
171
+ ### Installation
172
+ ```bash
173
+ # Install spaCy
174
+ pip install spacy
175
+
176
+ # Download model
177
+ python -m spacy download en_core_web_sm
178
+ ```
179
+
180
+ The notebook automatically downloads the model if not found.
181
+
182
+ ### Performance
183
+ - **Speed**: ~1000-5000 docs/second
184
+ - **Accuracy**: High for common names
185
+ - **Memory**: Low (~100MB loaded)
186
+
187
+ ## Fallback Strategy
188
+
189
+ If spaCy doesn't recognize a PERSON entity:
190
+
191
+ 1. **Extract capitalized words**:
192
+ ```python
193
+ "unknown name here" → ["unknown"]
194
+ ```
195
+
196
+ 2. **Return first few capitalized words**:
197
+ ```python
198
+ "Celebrity Model Actor" → "Celebrity Model Actor"
199
+ ```
200
+
201
+ 3. **Last resort**: Return cleaned text as-is
202
+
203
+ This ensures we always get something, even for:
204
+ - Uncommon/rare names
205
+ - Nicknames
206
+ - Non-English names
207
+ - Stage names
208
+
209
+ ## Testing
210
+
211
+ ### How to Verify spaCy is Working
212
+
213
+ Run Cell 5 and check the output:
214
+
215
+ ```
216
+ ✅ spaCy model loaded: en_core_web_sm
217
+
218
+ 📊 Name cleaning examples (with spaCy NER):
219
+ ===================================================================================================
220
+ Original Name | Cleaned Name
221
+ ===================================================================================================
222
+ Scarlett Johansson「LoRa」 | Scarlett Johansson
223
+ Emma Watson (JG) | Emma Watson
224
+ IU | IU
225
+ Belle Delphine | Belle Delphine
226
+ ...
227
+ ```
228
+
229
+ ### Key Indicators
230
+
231
+ ✅ **Good signs**:
232
+ - Person names cleanly extracted
233
+ - No extra words like "Model", "LoRA", "Celebrity"
234
+ - Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
235
+
236
+ ❌ **Issues to watch**:
237
+ - Empty results (increase fallback logic)
238
+ - Partial names (e.g., only first name)
239
+ - Non-names included (tune preprocessing)
240
+
241
+ ## Customization
242
+
243
+ ### Add More Languages
244
+
245
+ For better support of non-English names:
246
+
247
+ ```python
248
+ # Download multilingual model
249
+ python -m spacy download xx_ent_wiki_sm
250
+
251
+ # Use in code
252
+ nlp = spacy.load("xx_ent_wiki_sm")
253
+ ```
254
+
255
+ ### Adjust Entity Extraction
256
+
257
+ To extract other entities:
258
+
259
+ ```python
260
+ # Extract organizations too
261
+ entities = [ent.text for ent in doc.ents
262
+ if ent.label_ in ["PERSON", "ORG"]]
263
+ ```
264
+
265
+ ### Custom Entity Rules
266
+
267
+ Add custom patterns for names spaCy might miss:
268
+
269
+ ```python
270
+ from spacy.matcher import Matcher
271
+
272
+ matcher = Matcher(nlp.vocab)
273
+ # Add patterns for specific name formats
274
+ ```
275
+
276
+ ## Benefits for This Project
277
+
278
+ ### Better Person Identification
279
+
280
+ With cleaner names:
281
+ - LLMs receive recognizable names
282
+ - "Emma Watson" instead of "Emma Watson Model LoRA v3"
283
+ - Better identification accuracy
284
+
285
+ ### Reduced Ambiguity
286
+
287
+ spaCy helps distinguish:
288
+ - Person names vs. descriptive words
289
+ - "Celebrity IU" → "IU" (person)
290
+ - "Model Bella" → "Bella" (person)
291
+
292
+ ### Improved Context for LLMs
293
+
294
+ Cleaner input = better prompts:
295
+ ```
296
+ Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
297
+ After: "Given 'Emma Watson' (actress)..."
298
+ ```
299
+
300
+ The LLM can now focus on identifying the person, not parsing the noise.
301
+
302
+ ## Summary
303
+
304
+ ✅ **spaCy NER** provides intelligent, context-aware name extraction
305
+ ✅ **Better than regex** for handling complex name formats
306
+ ✅ **Fallback strategy** ensures we always get a result
307
+ ✅ **Industry standard** tool used in production NLP
308
+ ✅ **Easy to use** with minimal code
309
+
310
+ The combination of:
311
+ 1. Leetspeak translation
312
+ 2. Noise removal
313
+ 3. spaCy NER
314
+ 4. Smart fallbacks
315
+
316
+ ...results in clean, accurate person names ready for LLM annotation!
md/TESTING_INSTRUCTIONS.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Quick Testing Instructions
2
+
3
+ ## Start Here! 🚀
4
+
5
+ You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs.
6
+
7
+ ## Step-by-Step Testing
8
+
9
+ ### 1. Make sure your Deepseek API key is in place
10
+
11
+ Check if this file exists:
12
+ ```bash
13
+ cat misc/credentials/deepseek_api_key.txt
14
+ ```
15
+
16
+ If not, create it:
17
+ ```bash
18
+ echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt
19
+ ```
20
+
21
+ ### 2. Open the notebook
22
+
23
+ ```bash
24
+ jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb
25
+ ```
26
+
27
+ ### 3. Run the cells in order
28
+
29
+ 1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed)
30
+ 2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`)
31
+ 3. **Cell 7**: Country/Nationality Mapping
32
+ 4. **Cell 10**: 🌟 **DEEPSEEK ANNOTATION** (TEST THIS FIRST!)
33
+ - Default: `TEST_MODE = True` (10 samples)
34
+ - Will create: `data/CSV/deepseek_annotated_POI_test.csv`
35
+ 5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works)
36
+
37
+ ### 4. Review Deepseek Results
38
+
39
+ After Cell 10 completes, check:
40
+ - Console output shows summary statistics
41
+ - Output file: `data/CSV/deepseek_annotated_POI_test.csv`
42
+
43
+ Example output should look like:
44
+ ```
45
+ ✅ Progress saved after 10 rows
46
+ ✅ Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv
47
+
48
+ === Summary Statistics ===
49
+ Total processed: 10
50
+
51
+ Gender distribution:
52
+ Female 8
53
+ Male 2
54
+ ...
55
+ ```
56
+
57
+ ### 5. If Deepseek Works Well
58
+
59
+ Once you're satisfied with the Deepseek results:
60
+
61
+ **Option A: Process full dataset with Deepseek**
62
+ ```python
63
+ # In Cell 10, change:
64
+ TEST_MODE = False
65
+ ```
66
+
67
+ **Option B: Try other LLMs for comparison**
68
+ 1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`)
69
+ 2. Run Cell 12 with your chosen LLM:
70
+ ```python
71
+ SELECTED_LLM = 'qwen' # or 'llama' or 'mistral'
72
+ TEST_MODE = True # Test first!
73
+ ```
74
+
75
+ ## Expected Cost (Deepseek)
76
+
77
+ - **10 samples** (test): ~$0.01 or less
78
+ - **1,000 entries**: ~$0.10-0.20
79
+ - **10,000 entries**: ~$1-2
80
+
81
+ Much cheaper than the other options, making it perfect for testing!
82
+
83
+ ## Troubleshooting
84
+
85
+ ### "deepseek_api_key.txt not found"
86
+ ```bash
87
+ # Create the file with your key
88
+ echo "your-api-key" > misc/credentials/deepseek_api_key.txt
89
+ ```
90
+
91
+ ### "File does not exist: real_person_adapters.csv"
92
+ Make sure the input file exists:
93
+ ```bash
94
+ ls -lh data/CSV/real_person_adapters.csv
95
+ ```
96
+
97
+ ### API Rate Limiting
98
+ The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited:
99
+ - Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)`
100
+
101
+ ### Pipeline Interrupted
102
+ No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.
103
+
104
+ ## What's Next?
105
+
106
+ After testing with Deepseek:
107
+
108
+ 1. **If results look good**: Scale up to full dataset with Deepseek
109
+ 2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results
110
+ 3. **Production run**: Choose your preferred LLM and process the full dataset
111
+
112
+ ## File Outputs
113
+
114
+ The pipeline creates these files:
115
+
116
+ ```
117
+ data/CSV/
118
+ ├── NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning)
119
+ ├── NER_POI_step02_annotated.csv # After Cell 7 (country mapping)
120
+ ├── deepseek_annotated_POI_test.csv # After Cell 10 (test mode)
121
+ ├── deepseek_annotated_POI.csv # After Cell 10 (full mode)
122
+ ├── qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen)
123
+ └── ...
124
+
125
+ misc/
126
+ ├── deepseek_query_index.txt # Progress tracking
127
+ └── ...
128
+ ```
129
+
130
+ ## Quick Commands
131
+
132
+ ```bash
133
+ # View first few results
134
+ head -20 data/CSV/deepseek_annotated_POI_test.csv
135
+
136
+ # Count processed rows
137
+ wc -l data/CSV/deepseek_annotated_POI_test.csv
138
+
139
+ # Check progress
140
+ cat misc/deepseek_query_index.txt
141
+
142
+ # Reset progress (start from scratch)
143
+ rm misc/deepseek_query_index.txt
144
+ ```
145
+
146
+ ---
147
+
148
+ **Ready to start?** Open the notebook and run Cell 5 → Cell 7 → Cell 10! 🎉
md/UPDATES_AND_FIXES.md ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Recent Updates and Fixes
2
+
3
+ ## Overview
4
+
5
+ Two important fixes have been implemented based on testing feedback:
6
+
7
+ 1. **Leetspeak Translation** (before NER)
8
+ 2. **Improved Country Mapping** (check ALL tags)
9
+
10
+ ---
11
+
12
+ ## Fix 1: Leetspeak Translation
13
+
14
+ ### Problem
15
+ Names with leetspeak (numbers replacing letters) weren't being properly cleaned:
16
+ - `4kira` should be `Akira`
17
+ - `1rene` should be `Irene`
18
+ - `3mma` should be `Emma`
19
+
20
+ ### Solution
21
+ Added leetspeak translation **before** other NER processing in Cell 5.
22
+
23
+ ### Mapping Table
24
+ | Leetspeak | Letter |
25
+ |-----------|--------|
26
+ | 4 | A |
27
+ | 3 | E |
28
+ | 1 | I |
29
+ | 0 | O |
30
+ | 7 | T |
31
+ | 5 | S |
32
+ | 8 | B |
33
+ | 9 | G |
34
+ | @ | A |
35
+ | $ | S |
36
+ | ! | I |
37
+
38
+ ### Examples
39
+ ```
40
+ 4kira -> akira
41
+ 3mma -> emma
42
+ 1rene -> irene
43
+ L3vi -> Levi
44
+ S4sha -> Sasha
45
+ K4te -> Kate
46
+ J3ssica -> Jessica
47
+ ```
48
+
49
+ ### Implementation
50
+ The `translate_leetspeak()` function runs FIRST in `clean_name()`, before emoji removal and other cleaning steps. This ensures leetspeak is converted to proper letters before any other processing.
51
+
52
+ ---
53
+
54
+ ## Fix 2: Improved Country Mapping
55
+
56
+ ### Problem
57
+ The country mapping was stopping at the first match, which meant:
58
+ - **Irene** with tags `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']`
59
+ - The `'korean'` tag wasn't being properly mapped to `'South Korea'`
60
+ - This resulted in incomplete hints being sent to the LLM
61
+ - **Expected**: Deepseek should identify **Bae Joo-hyun (Irene)** from Red Velvet
62
+
63
+ ### Solution
64
+ Updated Cell 7 to:
65
+ 1. **Check ALL tags** (not just stop at first match)
66
+ 2. **Use a priority system** to select the best match:
67
+ - Priority 3: Exact country name match (highest)
68
+ - Priority 2: Nationality match (medium)
69
+ - Priority 1: Word parts (lowest)
70
+
71
+ ### How It Works
72
+
73
+ #### Before (Broken)
74
+ ```python
75
+ def infer_country_and_nationality(tags):
76
+ for tag in tags:
77
+ if tag in mapping:
78
+ return mapping[tag] # ❌ Stops at first match!
79
+ return ("", "")
80
+ ```
81
+
82
+ #### After (Fixed)
83
+ ```python
84
+ def infer_country_and_nationality(tags):
85
+ best_match = None
86
+ best_priority = 0
87
+
88
+ for tag in tags: # ✅ Check ALL tags
89
+ if tag in mapping:
90
+ country, nationality, priority = mapping[tag]
91
+ if priority > best_priority:
92
+ best_match = (country, nationality)
93
+ best_priority = priority
94
+
95
+ return best_match or ("", "")
96
+ ```
97
+
98
+ ### Example: Irene Case
99
+
100
+ **Input Tags**: `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']`
101
+
102
+ **Processing**:
103
+ 1. Check `'girl'` → no match
104
+ 2. Check `'photorealistic'` → no match
105
+ 3. Check `'asian'` → no match (too generic)
106
+ 4. Check `'woman'` → no match
107
+ 5. Check `'beautiful'` → no match
108
+ 6. Check `'celebrity'` → no match
109
+ 7. Check `'korean'` → ✅ **MATCH!**
110
+ - Maps to nationality: `'South Korean'`
111
+ - Which maps to country: `'South Korea'`
112
+ - Priority: 2 (nationality match)
113
+
114
+ **Output**:
115
+ - `likely_country`: `'South Korea'`
116
+ - `likely_nationality`: `'South Korean'`
117
+
118
+ **Sent to Deepseek**:
119
+ ```
120
+ Given 'Irene' (celebrity, South Korea), provide:
121
+ 1. Full legal name
122
+ 2. Aliases
123
+ 3. Gender
124
+ 4. Top 3 professions
125
+ 5. Country
126
+ ```
127
+
128
+ **Expected Result**: Deepseek can now identify this as **Bae Joo-hyun (Irene)**, a South Korean singer/actress from the K-pop group Red Velvet.
129
+
130
+ ---
131
+
132
+ ## Impact on Results
133
+
134
+ ### Better Name Recognition
135
+ - Leetspeak names are now properly translated
136
+ - LLMs receive cleaner, more recognizable names
137
+
138
+ ### Better Country Context
139
+ - All tags are now considered for country mapping
140
+ - More accurate country/nationality hints sent to LLMs
141
+ - Better identification of international celebrities
142
+
143
+ ### Example Improvements
144
+
145
+ | Name | Tags | Before | After |
146
+ |------|------|--------|-------|
147
+ | `4kira LoRA` | `['japanese', 'actress']` | `'4kira'` + no country | `'Akira'` + `'Japan'` |
148
+ | `Irene` | `['korean', 'celebrity']` | `'Irene'` + no country | `'Irene'` + `'South Korea'` |
149
+ | `1U` | `['korean', 'singer']` | `'1U'` + no country | `'IU'` + `'South Korea'` |
150
+ | `3lsa` | `['model']` | `'3lsa'` + no country | `'Elsa'` + country if tagged |
151
+
152
+ ---
153
+
154
+ ## Testing Recommendations
155
+
156
+ ### Before Running Full Pipeline
157
+
158
+ 1. **Test Leetspeak Translation** (Cell 5):
159
+ ```python
160
+ # Look for names with numbers in the output
161
+ # Verify they're properly translated
162
+ ```
163
+
164
+ 2. **Test Country Mapping** (Cell 7):
165
+ ```python
166
+ # Check the debug output at the end:
167
+ # "🔍 Checking 'Irene' entries:"
168
+ # Verify country is properly mapped
169
+ ```
170
+
171
+ 3. **Test Deepseek Results** (Cell 10):
172
+ ```python
173
+ # Look for Irene in the results
174
+ # Should now identify as Bae Joo-hyun
175
+ ```
176
+
177
+ ### Validation Checklist
178
+
179
+ - [ ] Leetspeak names are translated (check console output in Cell 5)
180
+ - [ ] Country mapping shows high success rate (check stats in Cell 7)
181
+ - [ ] Irene is correctly identified as Bae Joo-hyun (check results in Cell 10)
182
+ - [ ] Other K-pop/Korean celebrities are properly identified
183
+ - [ ] Japanese/Chinese celebrities also benefit from better country mapping
184
+
185
+ ---
186
+
187
+ ## Notes
188
+
189
+ ### Why Check ALL Tags?
190
+
191
+ Some entries have many tags, and the most informative tag might not be first:
192
+ ```
193
+ tags = ['girl', 'sexy', 'beautiful', 'asian', 'korean', 'celebrity', 'kpop']
194
+ ^^^^ Most informative!
195
+ ```
196
+
197
+ The old code might stop at `'girl'` or `'asian'` (no country info), missing the `'korean'` tag.
198
+
199
+ ### Why Use Priority?
200
+
201
+ Some tags might match multiple countries. Priority ensures we get the best match:
202
+ - `'american'` → exact nationality match (priority 2) → USA
203
+ - `'america'` → could be North/South/Central America (priority 1)
204
+
205
+ The system picks the higher priority match.
206
+
207
+ ### Word Length Filter
208
+
209
+ Word parts only match if >4 characters to avoid false positives:
210
+ - ✅ `'china'` → matches China (5 chars)
211
+ - ❌ `'us'` → too short, might be part of other words
212
+
213
+ ---
214
+
215
+ ## Future Improvements
216
+
217
+ Potential enhancements:
218
+ 1. **More leetspeak patterns**: `|\/|` for M, `(_)` for U, etc.
219
+ 2. **Fuzzy country matching**: Handle typos like `'corean'` → `'korean'`
220
+ 3. **Multi-country support**: Some celebrities work in multiple countries
221
+ 4. **Language detection**: Use name structure to infer origin
222
+
223
+ ---
224
+
225
+ ## Summary
226
+
227
+ ✅ **Leetspeak translation** ensures names are readable before NER
228
+ ✅ **ALL tags checked** ensures no country hints are missed
229
+ ✅ **Priority system** ensures best match is selected
230
+ ✅ **Better LLM results** from improved name quality and country context
231
+
232
+ These fixes should significantly improve the accuracy of person identification, especially for:
233
+ - International celebrities (K-pop, J-pop, C-pop)
234
+ - Names with leetspeak
235
+ - Entries where country info appears later in tag list
misc/assets/fonts/DejaVuSans.ttf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7da195a74c55bef988d0d48f9508bd5d849425c1770dba5d7bfc6ce9ed848954
3
+ size 757076
misc/assets/fonts/Noto_Sans.zip ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7801615f2c5a7a107313cd0a88a15c3b1f15a2da9d4a3648cf49711d8be44da
3
+ size 47636998
misc/assets/fonts/Noto_Sans/NotoSans-Italic-VariableFont_wdth,wght.ttf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe5a1fafb96618733aa4ea4c14a2e76ee65fee0d042b040e374f17575467e433
3
+ size 2637272
misc/assets/fonts/Noto_Sans/NotoSans-VariableFont_wdth,wght.ttf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74df1f61ab9d4bfaa961c65f8dc991deaae2885b0a6a6d6a60ed23980b3c8554
3
+ size 2490816
misc/assets/fonts/Noto_Sans/OFL.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be2f3f8727ac2e18b714ad1c4336d4ddb3f3adbeb9a7f70bfab74d21f4d2b3fb
3
+ size 4489