Laura Wagner commited on
Commit ·
5f5806d
1
Parent(s): e15d7a4
to commit or not commit that is the question
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitignore +12 -0
- README.md +73 -24
- gitlab-ci.yml +11 -0
- jupyter_notebooks/.ipynb_checkpoints/GEMMA_3-checkpoint.ipynb +400 -0
- jupyter_notebooks/.ipynb_checkpoints/MISTRAL-checkpoint.ipynb +6 -0
- jupyter_notebooks/.ipynb_checkpoints/QWEN-checkpoint.ipynb +645 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images-checkpoint.ipynb +1795 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-1_Tag_occurences-checkpoint.ipynb +6 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Bloomz_query-checkpoint.ipynb +370 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_1_LLM_annotation-checkpoint.ipynb +1941 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction-checkpoint.ipynb +0 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-Copy1-checkpoint.ipynb +0 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-checkpoint.ipynb +0 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8_Deepfake_victims-checkpoint.ipynb +668 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8b_sunburst_profession-checkpoint.ipynb +123 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_compare-models-checkpoint.ipynb +237 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-3_Figure_5_co-occurence_promotional_tags-checkpoint.ipynb +314 -0
- jupyter_notebooks/.ipynb_checkpoints/Section_2-4_Figure_9_ectract_LoRA_metadata_v2-checkpoint.ipynb +400 -0
- jupyter_notebooks/0_Scraping_image_metadata.ipynb +1345 -0
- jupyter_notebooks/0_Scraping_model_metadata.ipynb +643 -0
- jupyter_notebooks/Section_1_Figure_1_image_grid.ipynb +417 -0
- jupyter_notebooks/Section_2-2-2_Figure_3_histogram_monthly_images_nsfw_levels.ipynb +0 -0
- jupyter_notebooks/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images.ipynb +1795 -0
- jupyter_notebooks/Section_2-3-1_Tag_occurences.ipynb +801 -0
- jupyter_notebooks/Section_2-3-2_top_10_most_popular_checkpoints.ipynb +210 -0
- jupyter_notebooks/Section_2-3-3_Figure_7_top_30_adapters.ipynb +0 -0
- jupyter_notebooks/Section_2-3-4_Figure_8_Step_1_LLM_annotation.ipynb +1941 -0
- jupyter_notebooks/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction.ipynb +0 -0
- jupyter_notebooks/Section_2-3-4__Figure_8a_sunburst_gender.ipynb +129 -0
- jupyter_notebooks/Section_2-3-4__Figure_8b_sunburst_profession.ipynb +332 -0
- jupyter_notebooks/Section_2-3_Figure_5_co-occurence_promotional_tags.ipynb +314 -0
- jupyter_notebooks/Section_2-4_Figure_9_Training_tags_Sankey.ipynb +203 -0
- jupyter_notebooks/Section_2-4_Figure_9_ectract_LoRA_metadata_v2.ipynb +414 -0
- jupyter_notebooks/Section_2-4_Figure_9_extract_LoRA_metadata.ipynb +557 -0
- jupyter_notebooks/SuppM_Figure_13_Danbooru_categories.ipynb +141 -0
- jupyter_notebooks/SuppM_Figure_S12_asset_types.ipynb +129 -0
- jupyter_notebooks/SuppM_Figure_S13_Danbooru_Taxonomy.ipynb +1848 -0
- jupyter_notebooks/SuppM_Figure_S14_co-occurence_training_data.ipynb +152 -0
- md/DEEPFAKE_PIPELINE_GUIDE.md +210 -0
- md/LLM_MODELS_COMPARISON.md +326 -0
- md/QUICK_START_LOCAL.md +171 -0
- md/QWEN_LOCAL_SETUP.md +321 -0
- md/SPACY_NER_EXPLANATION.md +316 -0
- md/TESTING_INSTRUCTIONS.md +148 -0
- md/UPDATES_AND_FIXES.md +235 -0
- misc/assets/fonts/DejaVuSans.ttf +3 -0
- misc/assets/fonts/Noto_Sans.zip +3 -0
- misc/assets/fonts/Noto_Sans/NotoSans-Italic-VariableFont_wdth,wght.ttf +3 -0
- misc/assets/fonts/Noto_Sans/NotoSans-VariableFont_wdth,wght.ttf +3 -0
- misc/assets/fonts/Noto_Sans/OFL.txt +3 -0
.gitignore
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
data
|
| 2 |
+
ext
|
| 3 |
+
ARCHIVE
|
| 4 |
+
misc/api_keys.txt
|
| 5 |
+
misc/training_tags_categories
|
| 6 |
+
misc/credentials/*
|
| 7 |
+
misc/credentials
|
| 8 |
+
scripts/ARCHIVE
|
| 9 |
+
scripts/CEMETARY
|
| 10 |
+
cemetary
|
| 11 |
+
.venv
|
| 12 |
+
logs
|
README.md
CHANGED
|
@@ -1,29 +1,78 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
This repository contains the code for [Perpetuating Misogyny with Generative AI: How Model Personalization Normalizes Gendered Harm](https://arxiv.org/abs/2505.04600).
|
| 4 |
|
| 5 |
-
## Related Datasets
|
| 6 |
-
|
| 7 |
-
This project uses three datasets, all hosted on Hugging Face:
|
| 8 |
-
- Dataset 1: [dataset-name-1](https://huggingface.co/datasets/username/dataset-1)
|
| 9 |
-
- Dataset 2: [dataset-name-2](https://huggingface.co/datasets/username/dataset-2)
|
| 10 |
-
- Dataset 3: [dataset-name-3](https://huggingface.co/datasets/username/dataset-3)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
```
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Code for the Paper titled ["Perpetual Misogyny: How Gendered Tropes Shape Text-To-Image AI Personalization"](http://arxiv.org/pdf/2505.04600)
|
| 2 |
|
|
|
|
| 3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
+
|
| 6 |
+
### Repository Structure
|
| 7 |
+
|
| 8 |
+
```
|
| 9 |
+
CIVITAI_VISUALIZATIONS/
|
| 10 |
+
├── .virtual_documents/ # Temporary files from Jupyter
|
| 11 |
+
├── data/ # Final curated datasets
|
| 12 |
+
│ ├── subset1/ # Specific data splits or versions
|
| 13 |
+
│ ├── subset2/
|
| 14 |
+
│ └── ...
|
| 15 |
+
├── misc/
|
| 16 |
+
│ └── credentials/ # API keys and sensitive config (excluded from versioning)
|
| 17 |
+
├── plots/ # Output plots and figures used in the paper
|
| 18 |
+
├── ├── jupyter_notebooks/ # Main analysis notebooks
|
| 19 |
+
│ ├── 0_Scraping_image_metadata.ipynb # Scrapes CivitAI metadata via API
|
| 20 |
+
│ ├── Section_1_Figure_1_image_grid.ipynb # Grid of sample images
|
| 21 |
+
│ ├── Section_3-2-1_Figure_3_histogram.ipynb # Histogram of upload trends
|
| 22 |
+
│ ├── Section_3-2-1_Figure_4_Mivolo.ipynb # Model activity plot (Mivolo-focused)
|
| 23 |
+
│ ├── Section_3-3-1_Figure_5_tags.ipynb # Tag frequency and usage visualizations
|
| 24 |
+
│ ├── Section_3-3-3_download_popular_models.ipynb # Download models for analysis
|
| 25 |
+
│ ├── Section_3-3-3_Figure_6.ipynb # Promotional tag usage patterns
|
| 26 |
+
│ ├── Section_3-3-4_Figure_8a.ipynb # Ranking of popular models
|
| 27 |
+
│ ├── Section_3-3-4_Figure_8b.ipynb # Continuation of model rankings
|
| 28 |
+
│ ├── Section_3-3-4_Figure_9_Sankey.ipynb # Sankey diagram: user-model contributions
|
| 29 |
+
│ ├── Section_3-3-4_LLM_annotation.ipynb # Annotations using large language models
|
| 30 |
+
│ ├── Section_3-4_extract_LoRA_metadata.ipynb # LoRA metadata extraction
|
| 31 |
+
│ ├── SuppM_Figure_12_Danbooru_Taxonomy.ipynb # Danbooru tag taxonomy: visualization
|
| 32 |
+
│ ├── SuppM_Figure_13_Danbooru_taxonomy.ipynb # Tag grouping and structure
|
| 33 |
+
│ └── SuppM_Figure_13.ipynb # Supplementary figure generation
|
| 34 |
+
├── misc/ # Utility scripts and API credentials (excluded)
|
| 35 |
+
├── plots/ # Output plots and visualizations
|
| 36 |
+
├── public/ # Optional public-facing files
|
| 37 |
+
├── .gitignore
|
| 38 |
+
├── .gitmodules
|
| 39 |
+
├── README.md # This file
|
| 40 |
+
└── requirements.txt # Python dependencies
|
| 41 |
```
|
| 42 |
|
| 43 |
+
# Project Setup
|
| 44 |
+
|
| 45 |
+
## Requirements
|
| 46 |
+
This project requires Python 3.8 or higher. Ensure you have it installed before proceeding.
|
| 47 |
+
|
| 48 |
+
## Installation
|
| 49 |
+
|
| 50 |
+
1. **Clone the Repository**
|
| 51 |
+
```sh
|
| 52 |
+
git clone https://gitlab.uzh.ch/latent-canon/pm-paper.git
|
| 53 |
+
cd pm-paper
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
2. **Create a Virtual Environment**
|
| 57 |
+
(Recommended to avoid dependency conflicts)
|
| 58 |
+
```sh
|
| 59 |
+
python -m venv venv
|
| 60 |
+
source venv/bin/activate # On macOS/Linux
|
| 61 |
+
venv\Scripts\activate # On Windows
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
3. **Install Dependencies**
|
| 65 |
+
```sh
|
| 66 |
+
pip install -r requirements.txt
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
4. **Jupyter Notebook Setup (Optional)**
|
| 70 |
+
If running Jupyter notebooks, ensure the environment is linked:
|
| 71 |
+
```sh
|
| 72 |
+
python -m ipykernel install --user --name=venv --display-name "Python (venv)"
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## Notes
|
| 76 |
+
- Ensure you have necessary system dependencies installed (e.g., `opencv` may require additional system libraries).
|
| 77 |
+
- If you encounter any issues, ensure you're using the correct Python environment (`venv` activated).
|
| 78 |
+
|
gitlab-ci.yml
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
# .gitlab-ci.yml
|
| 3 |
+
pages:
|
| 4 |
+
stage: deploy
|
| 5 |
+
script:
|
| 6 |
+
- echo "Nothing to build, serving public/"
|
| 7 |
+
artifacts:
|
| 8 |
+
paths:
|
| 9 |
+
- public
|
| 10 |
+
only:
|
| 11 |
+
- main
|
jupyter_notebooks/.ipynb_checkpoints/GEMMA_3-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,400 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": null,
|
| 6 |
+
"id": "471e0cef-678e-4403-8eca-f8e1991d86de",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [],
|
| 9 |
+
"source": [
|
| 10 |
+
"import pandas as pd\n",
|
| 11 |
+
"import json\n",
|
| 12 |
+
"import time\n",
|
| 13 |
+
"import re\n",
|
| 14 |
+
"from pathlib import Path\n",
|
| 15 |
+
"from tqdm import tqdm\n",
|
| 16 |
+
"import torch\n",
|
| 17 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"current_dir = Path.cwd()\n",
|
| 20 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 21 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 22 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 23 |
+
"# === PROCESS DATA ===\n",
|
| 24 |
+
"\n",
|
| 25 |
+
"\n",
|
| 26 |
+
"# === CONFIGURATION ===\n",
|
| 27 |
+
"TEST_MODE = False\n",
|
| 28 |
+
"TEST_SIZE = 100\n",
|
| 29 |
+
"MAX_ROWS = 50862\n",
|
| 30 |
+
"SAVE_INTERVAL = 10\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"output_file = current_dir.parent / f\"data/CSV/gemma_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 33 |
+
"index_file = current_dir.parent / \"misc/query_indicies/gemma_local_query_index.txt\"\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"# Model settings\n",
|
| 36 |
+
"MODEL_NAME = MODEL_NAME = \"google/gemma-3-27b-it\"\n",
|
| 37 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 38 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 39 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# Define the SPECIFIC profession categories\n",
|
| 42 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 43 |
+
" \"actor\",\n",
|
| 44 |
+
" \"adult performer\",\n",
|
| 45 |
+
" \"singer/musician\",\n",
|
| 46 |
+
" \"model\",\n",
|
| 47 |
+
" \"online personality\",\n",
|
| 48 |
+
" \"public figure\",\n",
|
| 49 |
+
" \"voice actor/ASMR\",\n",
|
| 50 |
+
" \"sports professional\",\n",
|
| 51 |
+
" \"tv personality\"\n",
|
| 52 |
+
"]\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"# === LOAD MODEL ===\n",
|
| 55 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 56 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 57 |
+
"print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
|
| 58 |
+
"\n",
|
| 59 |
+
"# Check GPU availability\n",
|
| 60 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 61 |
+
"print(f\"Device: {device}\")\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"if device == \"cpu\":\n",
|
| 64 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 65 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"# Load tokenizer\n",
|
| 68 |
+
"print(\"Loading tokenizer...\")\n",
|
| 69 |
+
"try:\n",
|
| 70 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 71 |
+
" MODEL_NAME,\n",
|
| 72 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 73 |
+
" use_fast=True\n",
|
| 74 |
+
" )\n",
|
| 75 |
+
"except Exception as e:\n",
|
| 76 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 77 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 78 |
+
" MODEL_NAME,\n",
|
| 79 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 80 |
+
" use_fast=False\n",
|
| 81 |
+
" )\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"# Ensure pad token is set\n",
|
| 84 |
+
"if tokenizer.pad_token is None:\n",
|
| 85 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"# Load model with optimizations\n",
|
| 90 |
+
"print(\"Loading model (this may take several minutes)...\")\n",
|
| 91 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 92 |
+
" MODEL_NAME,\n",
|
| 93 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 94 |
+
" torch_dtype=torch.bfloat16,\n",
|
| 95 |
+
" device_map=\"auto\",\n",
|
| 96 |
+
" trust_remote_code=False\n",
|
| 97 |
+
")\n",
|
| 98 |
+
"model.eval()\n",
|
| 99 |
+
"print(\"✅ Model loaded\")\n",
|
| 100 |
+
"\n",
|
| 101 |
+
"# Check VRAM usage\n",
|
| 102 |
+
"if torch.cuda.is_available():\n",
|
| 103 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 104 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 105 |
+
"\n",
|
| 106 |
+
"# === LOAD DATA ===\n",
|
| 107 |
+
"print(\"Loading raw input CSV...\")\n",
|
| 108 |
+
"df = pd.read_csv(input_file) # ALWAYS load the full input\n",
|
| 109 |
+
"print(f\"Loaded {len(df)} rows from input file\")\n",
|
| 110 |
+
"\n",
|
| 111 |
+
"# If we have previous annotations, merge them\n",
|
| 112 |
+
"if output_file.exists():\n",
|
| 113 |
+
" print(\"Found existing annotations, merging...\")\n",
|
| 114 |
+
" existing_df = pd.read_csv(output_file)\n",
|
| 115 |
+
" print(f\"Existing annotations has {len(existing_df)} rows\")\n",
|
| 116 |
+
" \n",
|
| 117 |
+
" # Update df with existing annotations\n",
|
| 118 |
+
" # Only update the columns that were annotated\n",
|
| 119 |
+
" annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
|
| 120 |
+
" for col in annotation_cols:\n",
|
| 121 |
+
" if col in existing_df.columns:\n",
|
| 122 |
+
" df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
|
| 123 |
+
" \n",
|
| 124 |
+
" print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"\n",
|
| 127 |
+
"# Try to load profession mapping files\n",
|
| 128 |
+
"try:\n",
|
| 129 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 130 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 131 |
+
"except:\n",
|
| 132 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 133 |
+
"\n",
|
| 134 |
+
"try:\n",
|
| 135 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 136 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 137 |
+
"except:\n",
|
| 138 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 141 |
+
"\n",
|
| 142 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 143 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 144 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 145 |
+
" print(f\" - {cat}\")\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"if TEST_MODE:\n",
|
| 148 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 149 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 150 |
+
"elif MAX_ROWS:\n",
|
| 151 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 152 |
+
"\n",
|
| 153 |
+
"# === CREATE PROMPTS ===\n",
|
| 154 |
+
"def create_prompt(row):\n",
|
| 155 |
+
" \"\"\"Create prompt for Gemma annotation with specific profession categories.\"\"\"\n",
|
| 156 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 157 |
+
" \n",
|
| 158 |
+
" # Gather hints\n",
|
| 159 |
+
" hints = []\n",
|
| 160 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 161 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 162 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 163 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 164 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 165 |
+
" hints.append(str(row['likely_country']))\n",
|
| 166 |
+
" \n",
|
| 167 |
+
" # Add tags if we don't have enough hints\n",
|
| 168 |
+
" if len(hints) < 3:\n",
|
| 169 |
+
" for i in range(1, 8):\n",
|
| 170 |
+
" tag_col = f'tag_{i}'\n",
|
| 171 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 172 |
+
" tag_val = str(row[tag_col])\n",
|
| 173 |
+
" if tag_val not in hints:\n",
|
| 174 |
+
" hints.append(tag_val)\n",
|
| 175 |
+
" if len(hints) >= 5:\n",
|
| 176 |
+
" break\n",
|
| 177 |
+
" \n",
|
| 178 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 179 |
+
" \n",
|
| 180 |
+
" return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
|
| 181 |
+
"1. Full legal name (Western order if non-latin script)\n",
|
| 182 |
+
"2. Any stage names/aliases (comma separated)\n",
|
| 183 |
+
"3. Gender (Male/Female/Other/Unknown)\n",
|
| 184 |
+
"4. Top 3 most likely professions from ONLY these categories:\n",
|
| 185 |
+
" - actor\n",
|
| 186 |
+
" - adult performer\n",
|
| 187 |
+
" - singer/musician\n",
|
| 188 |
+
" - model\n",
|
| 189 |
+
" - online personality (includes streamers, cosplayers, influencers)\n",
|
| 190 |
+
" - public figure (includes politicians, activists, journalists, authors)\n",
|
| 191 |
+
" - voice actor/ASMR\n",
|
| 192 |
+
" - sports professional\n",
|
| 193 |
+
" - tv personality (includes hosts, presenters, reality TV)\n",
|
| 194 |
+
"\n",
|
| 195 |
+
"5. Primary country associated\n",
|
| 196 |
+
"\n",
|
| 197 |
+
"IMPORTANT:\n",
|
| 198 |
+
"- Choose professions ONLY from the 9 categories above\n",
|
| 199 |
+
"- Provide up to 3 professions, comma-separated, ordered by relevance\n",
|
| 200 |
+
"- Be SPECIFIC: choose the most accurate category for each role\n",
|
| 201 |
+
"- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
|
| 202 |
+
"- Use 'Unknown' when uncertain or for fictional characters/places\n",
|
| 203 |
+
"- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
|
| 204 |
+
"- For country respond with one word only, for example China or Columbia\n",
|
| 205 |
+
"- actress = actor\n",
|
| 206 |
+
"\n",
|
| 207 |
+
"Respond with exactly 5 numbered lines.\"\"\"\n",
|
| 208 |
+
"\n",
|
| 209 |
+
"# Create prompts\n",
|
| 210 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 211 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 212 |
+
"print(\"✅ Prompts created\")\n",
|
| 213 |
+
"\n",
|
| 214 |
+
"# === QUERY Gemma LOCAL ===\n",
|
| 215 |
+
"def query_gemma_local(prompt: str) -> str:\n",
|
| 216 |
+
" \"\"\"Query Gemma locally via transformers.\"\"\"\n",
|
| 217 |
+
" try:\n",
|
| 218 |
+
" # Format as chat message for GEMMA\n",
|
| 219 |
+
" messages = [\n",
|
| 220 |
+
" {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
|
| 221 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 222 |
+
" ]\n",
|
| 223 |
+
" \n",
|
| 224 |
+
" # Tokenize\n",
|
| 225 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 226 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 227 |
+
" messages,\n",
|
| 228 |
+
" tokenize=False,\n",
|
| 229 |
+
" add_generation_prompt=True\n",
|
| 230 |
+
" )\n",
|
| 231 |
+
" else:\n",
|
| 232 |
+
" # Fallback for older tokenizers\n",
|
| 233 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 234 |
+
" \n",
|
| 235 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 236 |
+
" \n",
|
| 237 |
+
" # Generate\n",
|
| 238 |
+
" with torch.no_grad():\n",
|
| 239 |
+
" outputs = model.generate(\n",
|
| 240 |
+
" **inputs,\n",
|
| 241 |
+
" max_new_tokens=512,\n",
|
| 242 |
+
" temperature=0.1,\n",
|
| 243 |
+
" do_sample=True,\n",
|
| 244 |
+
" top_p=0.9,\n",
|
| 245 |
+
" pad_token_id=tokenizer.eos_token_id\n",
|
| 246 |
+
" )\n",
|
| 247 |
+
" \n",
|
| 248 |
+
" # Decode\n",
|
| 249 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 250 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 251 |
+
" \n",
|
| 252 |
+
" return response.strip()\n",
|
| 253 |
+
" \n",
|
| 254 |
+
" except Exception as e:\n",
|
| 255 |
+
" print(f\"Generation error: {e}\")\n",
|
| 256 |
+
" return None\n",
|
| 257 |
+
"\n",
|
| 258 |
+
"output_file = current_dir.parent / f\"data/CSV/gemma_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 259 |
+
"index_file = current_dir.parent / \"misc/query_indicies/gemma_local_query_index.txt\"\n",
|
| 260 |
+
"\n",
|
| 261 |
+
"\n",
|
| 262 |
+
"# === PARSE RESPONSE ===\n",
|
| 263 |
+
"def parse_response(response):\n",
|
| 264 |
+
" \"\"\"Parse Gemma response into structured fields.\"\"\"\n",
|
| 265 |
+
" if not response:\n",
|
| 266 |
+
" return {\n",
|
| 267 |
+
" 'full_name': 'Unknown',\n",
|
| 268 |
+
" 'aliases': 'Unknown',\n",
|
| 269 |
+
" 'gender': 'Unknown',\n",
|
| 270 |
+
" 'profession_llm': 'Unknown',\n",
|
| 271 |
+
" 'country': 'Unknown'\n",
|
| 272 |
+
" }\n",
|
| 273 |
+
" \n",
|
| 274 |
+
" # Split into lines and clean\n",
|
| 275 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 276 |
+
" \n",
|
| 277 |
+
" # Initialize with Unknown values\n",
|
| 278 |
+
" fields = {\n",
|
| 279 |
+
" 'full_name': 'Unknown',\n",
|
| 280 |
+
" 'aliases': 'Unknown',\n",
|
| 281 |
+
" 'gender': 'Unknown',\n",
|
| 282 |
+
" 'profession_llm': 'Unknown',\n",
|
| 283 |
+
" 'country': 'Unknown'\n",
|
| 284 |
+
" }\n",
|
| 285 |
+
" \n",
|
| 286 |
+
" # Extract information from each numbered line\n",
|
| 287 |
+
" for line in lines:\n",
|
| 288 |
+
" if line.startswith('1.'):\n",
|
| 289 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 290 |
+
" elif line.startswith('2.'):\n",
|
| 291 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 292 |
+
" elif line.startswith('3.'):\n",
|
| 293 |
+
" fields['gender'] = line[2:].strip()\n",
|
| 294 |
+
" elif line.startswith('4.'):\n",
|
| 295 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 296 |
+
" elif line.startswith('5.'):\n",
|
| 297 |
+
" fields['country'] = line[2:].strip()\n",
|
| 298 |
+
" \n",
|
| 299 |
+
" return fields\n",
|
| 300 |
+
"\n",
|
| 301 |
+
"\n",
|
| 302 |
+
"# === PROCESS DATA ===\n",
|
| 303 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 304 |
+
"\n",
|
| 305 |
+
"# Load index\n",
|
| 306 |
+
"current_index = 0\n",
|
| 307 |
+
"if index_file.exists():\n",
|
| 308 |
+
" try:\n",
|
| 309 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 310 |
+
" except:\n",
|
| 311 |
+
" current_index = 0\n",
|
| 312 |
+
"\n",
|
| 313 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 314 |
+
"\n",
|
| 315 |
+
"start_time = time.time()\n",
|
| 316 |
+
"\n",
|
| 317 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"Gemma Local\"):\n",
|
| 318 |
+
"\n",
|
| 319 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 320 |
+
"\n",
|
| 321 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 322 |
+
" response = None\n",
|
| 323 |
+
" for attempt in range(3):\n",
|
| 324 |
+
" response = query_gemma_local(prompt)\n",
|
| 325 |
+
" \n",
|
| 326 |
+
" # Valid response?\n",
|
| 327 |
+
" if response and len(response.strip()) > 10:\n",
|
| 328 |
+
" break\n",
|
| 329 |
+
" \n",
|
| 330 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 331 |
+
" time.sleep(0.5)\n",
|
| 332 |
+
"\n",
|
| 333 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 334 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 335 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 336 |
+
" continue\n",
|
| 337 |
+
"\n",
|
| 338 |
+
" parsed = parse_response(response)\n",
|
| 339 |
+
"\n",
|
| 340 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 341 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 342 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 343 |
+
" continue\n",
|
| 344 |
+
"\n",
|
| 345 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 346 |
+
" for key, value in parsed.items():\n",
|
| 347 |
+
" df.at[i, key] = value\n",
|
| 348 |
+
"\n",
|
| 349 |
+
" # Advance progress ONLY after successful write\n",
|
| 350 |
+
" current_index = i + 1\n",
|
| 351 |
+
"\n",
|
| 352 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 353 |
+
" if torch.cuda.is_available():\n",
|
| 354 |
+
" torch.cuda.empty_cache()\n",
|
| 355 |
+
" torch.cuda.synchronize()\n",
|
| 356 |
+
"\n",
|
| 357 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 358 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 359 |
+
" df.to_csv(output_file, index=False)\n",
|
| 360 |
+
" with open(index_file, \"w\") as f:\n",
|
| 361 |
+
" f.write(str(current_index))\n",
|
| 362 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 363 |
+
"\n",
|
| 364 |
+
"# Final save\n",
|
| 365 |
+
"df.to_csv(output_file, index=False)\n",
|
| 366 |
+
"index_file.write_text(str(current_index))\n",
|
| 367 |
+
"print(\"✅ Finished full dataset.\")\n"
|
| 368 |
+
]
|
| 369 |
+
},
|
| 370 |
+
{
|
| 371 |
+
"cell_type": "code",
|
| 372 |
+
"execution_count": null,
|
| 373 |
+
"id": "7a1be7c0-ce54-4445-8534-bf3ab5e70197",
|
| 374 |
+
"metadata": {},
|
| 375 |
+
"outputs": [],
|
| 376 |
+
"source": []
|
| 377 |
+
}
|
| 378 |
+
],
|
| 379 |
+
"metadata": {
|
| 380 |
+
"kernelspec": {
|
| 381 |
+
"display_name": "pm-paper",
|
| 382 |
+
"language": "python",
|
| 383 |
+
"name": "pm-paper"
|
| 384 |
+
},
|
| 385 |
+
"language_info": {
|
| 386 |
+
"codemirror_mode": {
|
| 387 |
+
"name": "ipython",
|
| 388 |
+
"version": 3
|
| 389 |
+
},
|
| 390 |
+
"file_extension": ".py",
|
| 391 |
+
"mimetype": "text/x-python",
|
| 392 |
+
"name": "python",
|
| 393 |
+
"nbconvert_exporter": "python",
|
| 394 |
+
"pygments_lexer": "ipython3",
|
| 395 |
+
"version": "3.11.13"
|
| 396 |
+
}
|
| 397 |
+
},
|
| 398 |
+
"nbformat": 4,
|
| 399 |
+
"nbformat_minor": 5
|
| 400 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/MISTRAL-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [],
|
| 3 |
+
"metadata": {},
|
| 4 |
+
"nbformat": 4,
|
| 5 |
+
"nbformat_minor": 5
|
| 6 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/QWEN-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,645 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "0543d9c4-055b-49a3-a8d0-9dfb622b2b8c",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# QWEN 2.5-32B Local Inference\n",
|
| 9 |
+
"\n",
|
| 10 |
+
"## Hardware Requirements\n",
|
| 11 |
+
"- **GPU**: NVIDIA A100 (40GB or 80GB recommended)\n",
|
| 12 |
+
"- **VRAM Usage**: \n",
|
| 13 |
+
" - 8-bit quantization: ~32GB\n",
|
| 14 |
+
" - 4-bit quantization: ~16-20GB\n",
|
| 15 |
+
" - bfloat16 (no quantization): ~64GB\n",
|
| 16 |
+
"- **System RAM**: 32GB minimum, 64GB recommended\n",
|
| 17 |
+
"- **Storage**: ~65GB for model download\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"## Configuration\n",
|
| 20 |
+
"This notebook uses **8-bit quantization** via `bitsandbytes` for optimal performance on A100 GPUs:\n",
|
| 21 |
+
"- Reduces VRAM usage from 64GB to ~32GB\n",
|
| 22 |
+
"- Minimal quality degradation\n",
|
| 23 |
+
"- Faster inference than bfloat16\n",
|
| 24 |
+
"\n",
|
| 25 |
+
"## Model Details\n",
|
| 26 |
+
"- **Model**: Qwen/Qwen2.5-32B-Instruct\n",
|
| 27 |
+
"- **Task**: Entity annotation and profession classification\n",
|
| 28 |
+
"- **Quantization**: LLM.int8() (8-bit)\n",
|
| 29 |
+
"- **Device**: CUDA (auto device mapping)\n",
|
| 30 |
+
"\n",
|
| 31 |
+
"## Dependencies\n",
|
| 32 |
+
"Make sure to install:\n",
|
| 33 |
+
"```bash\n",
|
| 34 |
+
"pip install transformers>=4.35.0 bitsandbytes>=0.41.0 accelerate torch pandas tqdm\n",
|
| 35 |
+
"```"
|
| 36 |
+
]
|
| 37 |
+
},
|
| 38 |
+
{
|
| 39 |
+
"cell_type": "code",
|
| 40 |
+
"execution_count": 1,
|
| 41 |
+
"id": "fe6ba282-896b-4272-b82b-ef24810732fb",
|
| 42 |
+
"metadata": {
|
| 43 |
+
"execution": {
|
| 44 |
+
"iopub.execute_input": "2025-11-29T20:14:34.671159Z",
|
| 45 |
+
"iopub.status.busy": "2025-11-29T20:14:34.671015Z",
|
| 46 |
+
"iopub.status.idle": "2025-11-29T20:26:40.952267Z",
|
| 47 |
+
"shell.execute_reply": "2025-11-29T20:26:40.951486Z",
|
| 48 |
+
"shell.execute_reply.started": "2025-11-29T20:14:34.671146Z"
|
| 49 |
+
}
|
| 50 |
+
},
|
| 51 |
+
"outputs": [
|
| 52 |
+
{
|
| 53 |
+
"name": "stderr",
|
| 54 |
+
"output_type": "stream",
|
| 55 |
+
"text": [
|
| 56 |
+
"/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 57 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 58 |
+
]
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"name": "stdout",
|
| 62 |
+
"output_type": "stream",
|
| 63 |
+
"text": [
|
| 64 |
+
"Loading model: Qwen/Qwen2.5-32B-Instruct\n",
|
| 65 |
+
"Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
|
| 66 |
+
"This may take a while on first run (~65GB download)...\n",
|
| 67 |
+
"\n",
|
| 68 |
+
"Device: cuda\n",
|
| 69 |
+
"Loading tokenizer...\n",
|
| 70 |
+
"✅ Tokenizer loaded\n",
|
| 71 |
+
"Configuring 8-bit quantization...\n",
|
| 72 |
+
"Loading model with 8-bit quantization (this may take several minutes)...\n"
|
| 73 |
+
]
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"name": "stderr",
|
| 77 |
+
"output_type": "stream",
|
| 78 |
+
"text": [
|
| 79 |
+
"Loading checkpoint shards: 100%|██████████| 17/17 [03:49<00:00, 13.50s/it]\n"
|
| 80 |
+
]
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"name": "stdout",
|
| 84 |
+
"output_type": "stream",
|
| 85 |
+
"text": [
|
| 86 |
+
"✅ Model loaded with 8-bit quantization\n",
|
| 87 |
+
"VRAM used: 32.72 GB\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"Loading raw input CSV...\n",
|
| 90 |
+
"✅ Loaded professions.csv\n",
|
| 91 |
+
"✅ Loaded profession mapping with 9 categories\n",
|
| 92 |
+
"Loaded 50861 rows\n",
|
| 93 |
+
"\n",
|
| 94 |
+
"Profession categories (9):\n",
|
| 95 |
+
" - actor\n",
|
| 96 |
+
" - adult performer\n",
|
| 97 |
+
" - singer/musician\n",
|
| 98 |
+
" - model\n",
|
| 99 |
+
" - online personality\n",
|
| 100 |
+
" - public figure\n",
|
| 101 |
+
" - voice actor/ASMR\n",
|
| 102 |
+
" - sports professional\n",
|
| 103 |
+
" - tv personality\n",
|
| 104 |
+
"\n",
|
| 105 |
+
"Creating prompts...\n",
|
| 106 |
+
"✅ Prompts created\n",
|
| 107 |
+
"Resuming from index 0\n"
|
| 108 |
+
]
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"name": "stderr",
|
| 112 |
+
"output_type": "stream",
|
| 113 |
+
"text": [
|
| 114 |
+
"Qwen Local: 0%| | 0/50861 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.\n",
|
| 115 |
+
"Qwen Local: 0%| | 10/50861 [01:35<115:51:08, 8.20s/it]"
|
| 116 |
+
]
|
| 117 |
+
},
|
| 118 |
+
{
|
| 119 |
+
"name": "stdout",
|
| 120 |
+
"output_type": "stream",
|
| 121 |
+
"text": [
|
| 122 |
+
"💾 Progress saved after row 10\n"
|
| 123 |
+
]
|
| 124 |
+
},
|
| 125 |
+
{
|
| 126 |
+
"name": "stderr",
|
| 127 |
+
"output_type": "stream",
|
| 128 |
+
"text": [
|
| 129 |
+
"Qwen Local: 0%| | 20/50861 [03:01<143:54:04, 10.19s/it]"
|
| 130 |
+
]
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"name": "stdout",
|
| 134 |
+
"output_type": "stream",
|
| 135 |
+
"text": [
|
| 136 |
+
"💾 Progress saved after row 20\n"
|
| 137 |
+
]
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"name": "stderr",
|
| 141 |
+
"output_type": "stream",
|
| 142 |
+
"text": [
|
| 143 |
+
"Qwen Local: 0%| | 24/50861 [03:52<151:28:05, 10.73s/it]"
|
| 144 |
+
]
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"name": "stdout",
|
| 148 |
+
"output_type": "stream",
|
| 149 |
+
"text": [
|
| 150 |
+
"❌ Row 23: parsed as all Unknown (likely model crash); skipping.\n"
|
| 151 |
+
]
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"name": "stderr",
|
| 155 |
+
"output_type": "stream",
|
| 156 |
+
"text": [
|
| 157 |
+
"Qwen Local: 0%| | 30/50861 [04:36<116:38:03, 8.26s/it]"
|
| 158 |
+
]
|
| 159 |
+
},
|
| 160 |
+
{
|
| 161 |
+
"name": "stdout",
|
| 162 |
+
"output_type": "stream",
|
| 163 |
+
"text": [
|
| 164 |
+
"💾 Progress saved after row 30\n"
|
| 165 |
+
]
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"name": "stderr",
|
| 169 |
+
"output_type": "stream",
|
| 170 |
+
"text": [
|
| 171 |
+
"Qwen Local: 0%| | 40/50861 [05:55<142:16:40, 10.08s/it]"
|
| 172 |
+
]
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"name": "stdout",
|
| 176 |
+
"output_type": "stream",
|
| 177 |
+
"text": [
|
| 178 |
+
"💾 Progress saved after row 40\n"
|
| 179 |
+
]
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"name": "stderr",
|
| 183 |
+
"output_type": "stream",
|
| 184 |
+
"text": [
|
| 185 |
+
"Qwen Local: 0%| | 45/50861 [06:45<127:20:20, 9.02s/it]\n"
|
| 186 |
+
]
|
| 187 |
+
},
|
| 188 |
+
{
|
| 189 |
+
"ename": "KeyboardInterrupt",
|
| 190 |
+
"evalue": "",
|
| 191 |
+
"output_type": "error",
|
| 192 |
+
"traceback": [
|
| 193 |
+
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
|
| 194 |
+
"\u001b[31mKeyboardInterrupt\u001b[39m Traceback (most recent call last)",
|
| 195 |
+
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 330\u001b[39m\n\u001b[32m 328\u001b[39m response = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 329\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m attempt \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mrange\u001b[39m(\u001b[32m3\u001b[39m):\n\u001b[32m--> \u001b[39m\u001b[32m330\u001b[39m response = \u001b[43mquery_qwen_local\u001b[49m\u001b[43m(\u001b[49m\u001b[43mprompt\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 332\u001b[39m \u001b[38;5;66;03m# Valid response?\u001b[39;00m\n\u001b[32m 333\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m response \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(response.strip()) > \u001b[32m10\u001b[39m:\n",
|
| 196 |
+
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 245\u001b[39m, in \u001b[36mquery_qwen_local\u001b[39m\u001b[34m(prompt)\u001b[39m\n\u001b[32m 243\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m timeout(\u001b[32m60\u001b[39m):\n\u001b[32m 244\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m torch.no_grad():\n\u001b[32m--> \u001b[39m\u001b[32m245\u001b[39m outputs = \u001b[43mmodel\u001b[49m\u001b[43m.\u001b[49m\u001b[43mgenerate\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 246\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 247\u001b[39m \u001b[43m \u001b[49m\u001b[43mmax_new_tokens\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m100\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 248\u001b[39m \u001b[43m \u001b[49m\u001b[43mtemperature\u001b[49m\u001b[43m=\u001b[49m\u001b[32;43m0.1\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 249\u001b[39m \u001b[43m \u001b[49m\u001b[43mdo_sample\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m,\u001b[49m\n\u001b[32m 250\u001b[39m \u001b[43m \u001b[49m\u001b[43mpad_token_id\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtokenizer\u001b[49m\u001b[43m.\u001b[49m\u001b[43meos_token_id\u001b[49m\n\u001b[32m 251\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 253\u001b[39m \u001b[38;5;66;03m# Decode\u001b[39;00m\n\u001b[32m 254\u001b[39m generated_ids = outputs[\u001b[32m0\u001b[39m][inputs[\u001b[33m'\u001b[39m\u001b[33minput_ids\u001b[39m\u001b[33m'\u001b[39m].shape[\u001b[32m1\u001b[39m]:]\n",
|
| 197 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py:120\u001b[39m, in \u001b[36mcontext_decorator.<locals>.decorate_context\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 117\u001b[39m \u001b[38;5;129m@functools\u001b[39m.wraps(func)\n\u001b[32m 118\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mdecorate_context\u001b[39m(*args, **kwargs):\n\u001b[32m 119\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m ctx_factory():\n\u001b[32m--> \u001b[39m\u001b[32m120\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 198 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:2564\u001b[39m, in \u001b[36mGenerationMixin.generate\u001b[39m\u001b[34m(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, use_model_defaults, custom_generate, **kwargs)\u001b[39m\n\u001b[32m 2561\u001b[39m model_kwargs[\u001b[33m\"\u001b[39m\u001b[33muse_cache\u001b[39m\u001b[33m\"\u001b[39m] = generation_config.use_cache\n\u001b[32m 2563\u001b[39m \u001b[38;5;66;03m# 9. Call generation mode\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m2564\u001b[39m result = \u001b[43mdecoding_method\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 2565\u001b[39m \u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[32m 2566\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2567\u001b[39m \u001b[43m \u001b[49m\u001b[43mlogits_processor\u001b[49m\u001b[43m=\u001b[49m\u001b[43mprepared_logits_processor\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2568\u001b[39m \u001b[43m \u001b[49m\u001b[43mstopping_criteria\u001b[49m\u001b[43m=\u001b[49m\u001b[43mprepared_stopping_criteria\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2569\u001b[39m \u001b[43m \u001b[49m\u001b[43mgeneration_config\u001b[49m\u001b[43m=\u001b[49m\u001b[43mgeneration_config\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2570\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mgeneration_mode_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2571\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mmodel_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 2572\u001b[39m \u001b[43m\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 2574\u001b[39m \u001b[38;5;66;03m# Convert to legacy cache format if requested\u001b[39;00m\n\u001b[32m 2575\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m (\n\u001b[32m 2576\u001b[39m generation_config.return_legacy_cache \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[32m 2577\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(result, \u001b[33m\"\u001b[39m\u001b[33mpast_key_values\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m 2578\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mgetattr\u001b[39m(result.past_key_values, \u001b[33m\"\u001b[39m\u001b[33mto_legacy_cache\u001b[39m\u001b[33m\"\u001b[39m) \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 2579\u001b[39m ):\n",
|
| 199 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:2787\u001b[39m, in \u001b[36mGenerationMixin._sample\u001b[39m\u001b[34m(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)\u001b[39m\n\u001b[32m 2785\u001b[39m is_prefill = \u001b[38;5;28;01mFalse\u001b[39;00m\n\u001b[32m 2786\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m2787\u001b[39m outputs = \u001b[43mmodel_forward\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mmodel_inputs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mreturn_dict\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[32m 2789\u001b[39m \u001b[38;5;66;03m# synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping\u001b[39;00m\n\u001b[32m 2790\u001b[39m model_kwargs = \u001b[38;5;28mself\u001b[39m._update_model_kwargs_for_generation(\n\u001b[32m 2791\u001b[39m outputs,\n\u001b[32m 2792\u001b[39m model_kwargs,\n\u001b[32m 2793\u001b[39m is_encoder_decoder=\u001b[38;5;28mself\u001b[39m.config.is_encoder_decoder,\n\u001b[32m 2794\u001b[39m )\n",
|
| 200 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 201 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 202 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/generic.py:918\u001b[39m, in \u001b[36mcan_return_tuple.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 916\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m return_dict_passed \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 917\u001b[39m return_dict = return_dict_passed\n\u001b[32m--> \u001b[39m\u001b[32m918\u001b[39m output = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 919\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m return_dict \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28misinstance\u001b[39m(output, \u001b[38;5;28mtuple\u001b[39m):\n\u001b[32m 920\u001b[39m output = output.to_tuple()\n",
|
| 203 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:449\u001b[39m, in \u001b[36mQwen2ForCausalLM.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, cache_position, logits_to_keep, **kwargs)\u001b[39m\n\u001b[32m 417\u001b[39m \u001b[38;5;129m@can_return_tuple\u001b[39m\n\u001b[32m 418\u001b[39m \u001b[38;5;129m@auto_docstring\u001b[39m\n\u001b[32m 419\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\n\u001b[32m (...)\u001b[39m\u001b[32m 430\u001b[39m **kwargs: Unpack[TransformersKwargs],\n\u001b[32m 431\u001b[39m ) -> CausalLMOutputWithPast:\n\u001b[32m 432\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33mr\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 433\u001b[39m \u001b[33;03m Example:\u001b[39;00m\n\u001b[32m 434\u001b[39m \n\u001b[32m (...)\u001b[39m\u001b[32m 447\u001b[39m \u001b[33;03m \"Hey, are you conscious? Can you talk to me?\\nI'm not conscious, but I can talk to you.\"\u001b[39;00m\n\u001b[32m 448\u001b[39m \u001b[33;03m ```\"\"\"\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m449\u001b[39m outputs: BaseModelOutputWithPast = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 450\u001b[39m \u001b[43m \u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43minput_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 451\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 452\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 453\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 454\u001b[39m \u001b[43m \u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m=\u001b[49m\u001b[43minputs_embeds\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 455\u001b[39m \u001b[43m \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m=\u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 456\u001b[39m \u001b[43m \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 457\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 458\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 460\u001b[39m hidden_states = outputs.last_hidden_state\n\u001b[32m 461\u001b[39m \u001b[38;5;66;03m# Only compute necessary logits, and do not upcast them to float if we are not computing the loss\u001b[39;00m\n",
|
| 204 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 205 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 206 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/generic.py:1064\u001b[39m, in \u001b[36mcheck_model_inputs.<locals>.wrapper\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1061\u001b[39m monkey_patched_layers.append((module, original_forward))\n\u001b[32m 1063\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1064\u001b[39m outputs = \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1065\u001b[39m \u001b[38;5;28;01mexcept\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m \u001b[38;5;28;01mas\u001b[39;00m original_exception:\n\u001b[32m 1066\u001b[39m \u001b[38;5;66;03m# If we get a TypeError, it's possible that the model is not receiving the recordable kwargs correctly.\u001b[39;00m\n\u001b[32m 1067\u001b[39m \u001b[38;5;66;03m# Get a TypeError even after removing the recordable kwargs -> re-raise the original exception\u001b[39;00m\n\u001b[32m 1068\u001b[39m \u001b[38;5;66;03m# Otherwise -> we're probably missing `**kwargs` in the decorated function\u001b[39;00m\n\u001b[32m 1069\u001b[39m kwargs_without_recordable = {k: v \u001b[38;5;28;01mfor\u001b[39;00m k, v \u001b[38;5;129;01min\u001b[39;00m kwargs.items() \u001b[38;5;28;01mif\u001b[39;00m k \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;129;01min\u001b[39;00m recordable_keys}\n",
|
| 207 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:384\u001b[39m, in \u001b[36mQwen2Model.forward\u001b[39m\u001b[34m(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, cache_position, **kwargs)\u001b[39m\n\u001b[32m 381\u001b[39m position_embeddings = \u001b[38;5;28mself\u001b[39m.rotary_emb(hidden_states, position_ids)\n\u001b[32m 383\u001b[39m \u001b[38;5;28;01mfor\u001b[39;00m decoder_layer \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mself\u001b[39m.layers[: \u001b[38;5;28mself\u001b[39m.config.num_hidden_layers]:\n\u001b[32m--> \u001b[39m\u001b[32m384\u001b[39m hidden_states = \u001b[43mdecoder_layer\u001b[49m\u001b[43m(\u001b[49m\n\u001b[32m 385\u001b[39m \u001b[43m \u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 386\u001b[39m \u001b[43m \u001b[49m\u001b[43mattention_mask\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcausal_mask_mapping\u001b[49m\u001b[43m[\u001b[49m\u001b[43mdecoder_layer\u001b[49m\u001b[43m.\u001b[49m\u001b[43mattention_type\u001b[49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 387\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_ids\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 388\u001b[39m \u001b[43m \u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m=\u001b[49m\u001b[43mpast_key_values\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 389\u001b[39m \u001b[43m \u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m=\u001b[49m\u001b[43muse_cache\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 390\u001b[39m \u001b[43m \u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m=\u001b[49m\u001b[43mcache_position\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 391\u001b[39m \u001b[43m \u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m=\u001b[49m\u001b[43mposition_embeddings\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 392\u001b[39m \u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[32m 393\u001b[39m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 395\u001b[39m hidden_states = \u001b[38;5;28mself\u001b[39m.norm(hidden_states)\n\u001b[32m 396\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m BaseModelOutputWithPast(\n\u001b[32m 397\u001b[39m last_hidden_state=hidden_states,\n\u001b[32m 398\u001b[39m past_key_values=past_key_values \u001b[38;5;28;01mif\u001b[39;00m use_cache \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[32m 399\u001b[39m )\n",
|
| 208 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/modeling_layers.py:94\u001b[39m, in \u001b[36mGradientCheckpointingLayer.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 91\u001b[39m logger.warning_once(message)\n\u001b[32m 93\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._gradient_checkpointing_func(partial(\u001b[38;5;28msuper\u001b[39m().\u001b[34m__call__\u001b[39m, **kwargs), *args)\n\u001b[32m---> \u001b[39m\u001b[32m94\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[34;43m__call__\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 209 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 210 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 211 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py:172\u001b[39m, in \u001b[36mdeprecate_kwarg.<locals>.wrapper.<locals>.wrapped_func\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 168\u001b[39m \u001b[38;5;28;01melif\u001b[39;00m minimum_action \u001b[38;5;129;01min\u001b[39;00m (Action.NOTIFY, Action.NOTIFY_ALWAYS) \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_torchdynamo_compiling():\n\u001b[32m 169\u001b[39m \u001b[38;5;66;03m# DeprecationWarning is ignored by default, so we use FutureWarning instead\u001b[39;00m\n\u001b[32m 170\u001b[39m warnings.warn(message, \u001b[38;5;167;01mFutureWarning\u001b[39;00m, stacklevel=\u001b[32m2\u001b[39m)\n\u001b[32m--> \u001b[39m\u001b[32m172\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 212 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:249\u001b[39m, in \u001b[36mQwen2DecoderLayer.forward\u001b[39m\u001b[34m(self, hidden_states, attention_mask, position_ids, past_key_values, use_cache, cache_position, position_embeddings, **kwargs)\u001b[39m\n\u001b[32m 247\u001b[39m residual = hidden_states\n\u001b[32m 248\u001b[39m hidden_states = \u001b[38;5;28mself\u001b[39m.post_attention_layernorm(hidden_states)\n\u001b[32m--> \u001b[39m\u001b[32m249\u001b[39m hidden_states = \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mmlp\u001b[49m\u001b[43m(\u001b[49m\u001b[43mhidden_states\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 250\u001b[39m hidden_states = residual + hidden_states\n\u001b[32m 251\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m hidden_states\n",
|
| 213 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 214 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 215 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py:46\u001b[39m, in \u001b[36mQwen2MLP.forward\u001b[39m\u001b[34m(self, x)\u001b[39m\n\u001b[32m 45\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, x):\n\u001b[32m---> \u001b[39m\u001b[32m46\u001b[39m down_proj = \u001b[38;5;28mself\u001b[39m.down_proj(\u001b[38;5;28mself\u001b[39m.act_fn(\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mgate_proj\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m)\u001b[49m) * \u001b[38;5;28mself\u001b[39m.up_proj(x))\n\u001b[32m 47\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m down_proj\n",
|
| 216 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1775\u001b[39m, in \u001b[36mModule._wrapped_call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1773\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._compiled_call_impl(*args, **kwargs) \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 1774\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1775\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_call_impl\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 217 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1786\u001b[39m, in \u001b[36mModule._call_impl\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 1781\u001b[39m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[32m 1782\u001b[39m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[32m 1783\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m._backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m._forward_pre_hooks\n\u001b[32m 1784\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[32m 1785\u001b[39m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[32m-> \u001b[39m\u001b[32m1786\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1788\u001b[39m result = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 1789\u001b[39m called_always_called_hooks = \u001b[38;5;28mset\u001b[39m()\n",
|
| 218 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:1071\u001b[39m, in \u001b[36mLinear8bitLt.forward\u001b[39m\u001b[34m(self, x)\u001b[39m\n\u001b[32m 1068\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m.bias \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.bias.dtype != x.dtype:\n\u001b[32m 1069\u001b[39m \u001b[38;5;28mself\u001b[39m.bias.data = \u001b[38;5;28mself\u001b[39m.bias.data.to(x.dtype)\n\u001b[32m-> \u001b[39m\u001b[32m1071\u001b[39m out = \u001b[43mbnb\u001b[49m\u001b[43m.\u001b[49m\u001b[43mmatmul\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mweight\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbias\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mbias\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m=\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1073\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mself\u001b[39m.state.has_fp16_weights \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m.state.CB \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m 1074\u001b[39m \u001b[38;5;28mself\u001b[39m.weight.data = \u001b[38;5;28mself\u001b[39m.state.CB\n",
|
| 219 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:424\u001b[39m, in \u001b[36mmatmul\u001b[39m\u001b[34m(A, B, out, state, threshold, bias)\u001b[39m\n\u001b[32m 422\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m A.device.type \u001b[38;5;129;01min\u001b[39;00m (\u001b[33m\"\u001b[39m\u001b[33mcpu\u001b[39m\u001b[33m\"\u001b[39m, \u001b[33m\"\u001b[39m\u001b[33mxpu\u001b[39m\u001b[33m\"\u001b[39m):\n\u001b[32m 423\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m MatMul8bitFp.apply(A, B, out, bias, state)\n\u001b[32m--> \u001b[39m\u001b[32m424\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mMatMul8bitLt\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mB\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mout\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mbias\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mstate\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 220 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/autograd/function.py:581\u001b[39m, in \u001b[36mFunction.apply\u001b[39m\u001b[34m(cls, *args, **kwargs)\u001b[39m\n\u001b[32m 578\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m torch._C._are_functorch_transforms_active():\n\u001b[32m 579\u001b[39m \u001b[38;5;66;03m# See NOTE: [functorch vjp and autograd interaction]\u001b[39;00m\n\u001b[32m 580\u001b[39m args = _functorch.utils.unwrap_dead_wrappers(args)\n\u001b[32m--> \u001b[39m\u001b[32m581\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m.\u001b[49m\u001b[43mapply\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[misc]\u001b[39;00m\n\u001b[32m 583\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m is_setup_ctx_defined:\n\u001b[32m 584\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\n\u001b[32m 585\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mIn order to use an autograd.Function with functorch transforms \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 586\u001b[39m \u001b[33m\"\u001b[39m\u001b[33m(vmap, grad, jvp, jacrev, ...), it must override the setup_context \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 587\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mstaticmethod. For more details, please see \u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 588\u001b[39m \u001b[33m\"\u001b[39m\u001b[33mhttps://pytorch.org/docs/main/notes/extending.func.html\u001b[39m\u001b[33m\"\u001b[39m\n\u001b[32m 589\u001b[39m )\n",
|
| 221 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:192\u001b[39m, in \u001b[36mMatMul8bitLt.forward\u001b[39m\u001b[34m(ctx, A, B, out, bias, state)\u001b[39m\n\u001b[32m 189\u001b[39m CA, CAt, SCA, SCAt, outlier_cols = F.int8_double_quant(A.to(torch.float16), threshold=state.threshold)\n\u001b[32m 190\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 191\u001b[39m \u001b[38;5;66;03m# Fast path\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m192\u001b[39m CA, SCA, outlier_cols = \u001b[43mF\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint8_vectorwise_quant\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m.\u001b[49m\u001b[43mto\u001b[49m\u001b[43m(\u001b[49m\u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mfloat16\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m=\u001b[49m\u001b[43mstate\u001b[49m\u001b[43m.\u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 193\u001b[39m CAt = SCAt = \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 195\u001b[39m has_grad = \u001b[38;5;28;01mFalse\u001b[39;00m\n",
|
| 222 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/functional.py:2058\u001b[39m, in \u001b[36mint8_vectorwise_quant\u001b[39m\u001b[34m(A, threshold)\u001b[39m\n\u001b[32m 2040\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mint8_vectorwise_quant\u001b[39m(A: torch.Tensor, threshold=\u001b[32m0.0\u001b[39m):\n\u001b[32m 2041\u001b[39m \u001b[38;5;250m \u001b[39m\u001b[33;03m\"\"\"Quantizes a tensor with dtype `torch.float16` to `torch.int8` in accordance to the `LLM.int8()` algorithm.\u001b[39;00m\n\u001b[32m 2042\u001b[39m \n\u001b[32m 2043\u001b[39m \u001b[33;03m For more information, see the [LLM.int8() paper](https://arxiv.org/abs/2208.07339).\u001b[39;00m\n\u001b[32m (...)\u001b[39m\u001b[32m 2056\u001b[39m \u001b[33;03m - `torch.Tensor` with dtype `torch.int32`, *optional*: A list of column indices which contain outlier features.\u001b[39;00m\n\u001b[32m 2057\u001b[39m \u001b[33;03m \"\"\"\u001b[39;00m\n\u001b[32m-> \u001b[39m\u001b[32m2058\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mops\u001b[49m\u001b[43m.\u001b[49m\u001b[43mbitsandbytes\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint8_vectorwise_quant\u001b[49m\u001b[43m.\u001b[49m\u001b[43mdefault\u001b[49m\u001b[43m(\u001b[49m\u001b[43mA\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mthreshold\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 223 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_ops.py:841\u001b[39m, in \u001b[36mOpOverload.__call__\u001b[39m\u001b[34m(self, *args, **kwargs)\u001b[39m\n\u001b[32m 840\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34m__call__\u001b[39m(\u001b[38;5;28mself\u001b[39m, /, *args: _P.args, **kwargs: _P.kwargs) -> _T:\n\u001b[32m--> \u001b[39m\u001b[32m841\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;43mself\u001b[39;49m\u001b[43m.\u001b[49m\u001b[43m_op\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 224 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_compile.py:53\u001b[39m, in \u001b[36m_disable_dynamo.<locals>.inner\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 50\u001b[39m disable_fn = torch._dynamo.disable(fn, recursive, wrapping=\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[32m 51\u001b[39m fn.__dynamo_disable = disable_fn \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m53\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mdisable_fn\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 225 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:1044\u001b[39m, in \u001b[36mDisableContext.__call__.<locals>._fn\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 1042\u001b[39m _maybe_set_eval_frame(_callback_from_stance(\u001b[38;5;28mself\u001b[39m.callback))\n\u001b[32m 1043\u001b[39m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[32m-> \u001b[39m\u001b[32m1044\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfn\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 1045\u001b[39m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[32m 1046\u001b[39m set_eval_frame(\u001b[38;5;28;01mNone\u001b[39;00m)\n",
|
| 226 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/torch/library.py:732\u001b[39m, in \u001b[36m_impl.<locals>.register_.<locals>.func_no_dynamo\u001b[39m\u001b[34m(*args, **kwargs)\u001b[39m\n\u001b[32m 730\u001b[39m \u001b[38;5;129m@torch\u001b[39m._disable_dynamo\n\u001b[32m 731\u001b[39m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34mfunc_no_dynamo\u001b[39m(*args, **kwargs):\n\u001b[32m--> \u001b[39m\u001b[32m732\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43m*\u001b[49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43m*\u001b[49m\u001b[43m*\u001b[49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 227 |
+
"\u001b[36mFile \u001b[39m\u001b[32m/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/backends/cuda/ops.py:148\u001b[39m, in \u001b[36m_\u001b[39m\u001b[34m(A, threshold)\u001b[39m\n\u001b[32m 145\u001b[39m outlier_cols = torch.argwhere(outliers.any(dim=\u001b[32m0\u001b[39m)).view(-\u001b[32m1\u001b[39m)\n\u001b[32m 146\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m 147\u001b[39m \u001b[38;5;66;03m# Needed for torch.compile support.\u001b[39;00m\n\u001b[32m--> \u001b[39m\u001b[32m148\u001b[39m outlier_cols = \u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mempty\u001b[49m\u001b[43m(\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m=\u001b[49m\u001b[43mA\u001b[49m\u001b[43m.\u001b[49m\u001b[43mdevice\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdtype\u001b[49m\u001b[43m=\u001b[49m\u001b[43mtorch\u001b[49m\u001b[43m.\u001b[49m\u001b[43mint64\u001b[49m\u001b[43m)\u001b[49m\n\u001b[32m 150\u001b[39m \u001b[38;5;28;01mwith\u001b[39;00m _cuda_device_of(A):\n\u001b[32m 151\u001b[39m lib.cint8_vector_quant(\n\u001b[32m 152\u001b[39m get_ptr(A),\n\u001b[32m 153\u001b[39m get_ptr(out_row),\n\u001b[32m (...)\u001b[39m\u001b[32m 158\u001b[39m _get_tensor_stream(A),\n\u001b[32m 159\u001b[39m )\n",
|
| 228 |
+
"\u001b[31mKeyboardInterrupt\u001b[39m: "
|
| 229 |
+
]
|
| 230 |
+
}
|
| 231 |
+
],
|
| 232 |
+
"source": [
|
| 233 |
+
"import pandas as pd\n",
|
| 234 |
+
"import json\n",
|
| 235 |
+
"import time\n",
|
| 236 |
+
"import re\n",
|
| 237 |
+
"from pathlib import Path\n",
|
| 238 |
+
"from tqdm import tqdm\n",
|
| 239 |
+
"import torch\n",
|
| 240 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 241 |
+
"import signal\n",
|
| 242 |
+
"from contextlib import contextmanager\n",
|
| 243 |
+
"\n",
|
| 244 |
+
"current_dir = Path.cwd()\n",
|
| 245 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 246 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 247 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 248 |
+
"# === PROCESS DATA ===\n",
|
| 249 |
+
"\n",
|
| 250 |
+
"\n",
|
| 251 |
+
"# === CONFIGURATION ===\n",
|
| 252 |
+
"TEST_MODE = False\n",
|
| 253 |
+
"TEST_SIZE = 100\n",
|
| 254 |
+
"MAX_ROWS = 50862\n",
|
| 255 |
+
"SAVE_INTERVAL = 10\n",
|
| 256 |
+
"\n",
|
| 257 |
+
"\n",
|
| 258 |
+
"index_file = current_dir.parent / \"misc/query_indicies/qwen_local_query_index.txt\"\n",
|
| 259 |
+
"output_file = current_dir.parent / f\"data/CSV/qwen_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 260 |
+
"\n",
|
| 261 |
+
"# Model settings\n",
|
| 262 |
+
"MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
|
| 263 |
+
"#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
|
| 264 |
+
"#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
|
| 265 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 266 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 267 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 268 |
+
"\n",
|
| 269 |
+
"# Define the SPECIFIC profession categories\n",
|
| 270 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 271 |
+
" \"actor\",\n",
|
| 272 |
+
" \"adult performer\",\n",
|
| 273 |
+
" \"singer/musician\",\n",
|
| 274 |
+
" \"model\",\n",
|
| 275 |
+
" \"online personality\",\n",
|
| 276 |
+
" \"public figure\",\n",
|
| 277 |
+
" \"voice actor/ASMR\",\n",
|
| 278 |
+
" \"sports professional\",\n",
|
| 279 |
+
" \"tv personality\"\n",
|
| 280 |
+
"]\n",
|
| 281 |
+
"\n",
|
| 282 |
+
"# === LOAD MODEL ===\n",
|
| 283 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 284 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 285 |
+
"print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
|
| 286 |
+
"\n",
|
| 287 |
+
"# Check GPU availability\n",
|
| 288 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 289 |
+
"print(f\"Device: {device}\")\n",
|
| 290 |
+
"\n",
|
| 291 |
+
"if device == \"cpu\":\n",
|
| 292 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 293 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 294 |
+
"\n",
|
| 295 |
+
"# Load tokenizer\n",
|
| 296 |
+
"print(\"Loading tokenizer...\")\n",
|
| 297 |
+
"try:\n",
|
| 298 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 299 |
+
" MODEL_NAME,\n",
|
| 300 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 301 |
+
" use_fast=True\n",
|
| 302 |
+
" )\n",
|
| 303 |
+
"except Exception as e:\n",
|
| 304 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 305 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 306 |
+
" MODEL_NAME,\n",
|
| 307 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 308 |
+
" use_fast=False\n",
|
| 309 |
+
" )\n",
|
| 310 |
+
"\n",
|
| 311 |
+
"# Ensure pad token is set\n",
|
| 312 |
+
"if tokenizer.pad_token is None:\n",
|
| 313 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 314 |
+
"\n",
|
| 315 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 316 |
+
"\n",
|
| 317 |
+
"# Configure 8-bit quantization for A100\n",
|
| 318 |
+
"print(\"Configuring 8-bit quantization...\")\n",
|
| 319 |
+
"quantization_config = BitsAndBytesConfig(\n",
|
| 320 |
+
" load_in_8bit=True,\n",
|
| 321 |
+
" llm_int8_threshold=6.0,\n",
|
| 322 |
+
" llm_int8_has_fp16_weight=False\n",
|
| 323 |
+
")\n",
|
| 324 |
+
"\n",
|
| 325 |
+
"# Load model with 8-bit quantization\n",
|
| 326 |
+
"print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
|
| 327 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 328 |
+
" MODEL_NAME,\n",
|
| 329 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 330 |
+
" quantization_config=quantization_config,\n",
|
| 331 |
+
" device_map=\"auto\",\n",
|
| 332 |
+
" trust_remote_code=False\n",
|
| 333 |
+
")\n",
|
| 334 |
+
"model.eval()\n",
|
| 335 |
+
"print(\"✅ Model loaded with 8-bit quantization\")\n",
|
| 336 |
+
"\n",
|
| 337 |
+
"# Check VRAM usage\n",
|
| 338 |
+
"if torch.cuda.is_available():\n",
|
| 339 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 340 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 341 |
+
"\n",
|
| 342 |
+
"# === LOAD DATA ===\n",
|
| 343 |
+
"if output_file.exists():\n",
|
| 344 |
+
" print(\"Loading annotated CSV...\")\n",
|
| 345 |
+
" df = pd.read_csv(output_file)\n",
|
| 346 |
+
"else:\n",
|
| 347 |
+
" print(\"Loading raw input CSV...\")\n",
|
| 348 |
+
" df = pd.read_csv(input_file)\n",
|
| 349 |
+
"\n",
|
| 350 |
+
"\n",
|
| 351 |
+
"# Try to load profession mapping files\n",
|
| 352 |
+
"try:\n",
|
| 353 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 354 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 355 |
+
"except:\n",
|
| 356 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 357 |
+
"\n",
|
| 358 |
+
"try:\n",
|
| 359 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 360 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 361 |
+
"except:\n",
|
| 362 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 363 |
+
"\n",
|
| 364 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 365 |
+
"\n",
|
| 366 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 367 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 368 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 369 |
+
" print(f\" - {cat}\")\n",
|
| 370 |
+
"\n",
|
| 371 |
+
"if TEST_MODE:\n",
|
| 372 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 373 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 374 |
+
"elif MAX_ROWS:\n",
|
| 375 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 376 |
+
"\n",
|
| 377 |
+
"# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
|
| 378 |
+
"def create_prompt(row):\n",
|
| 379 |
+
" \"\"\"Create prompt for Qwen annotation with strict formatting requirements.\"\"\"\n",
|
| 380 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 381 |
+
" \n",
|
| 382 |
+
" # Gather hints\n",
|
| 383 |
+
" hints = []\n",
|
| 384 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 385 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 386 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 387 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 388 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 389 |
+
" hints.append(str(row['likely_country']))\n",
|
| 390 |
+
" \n",
|
| 391 |
+
" # Add tags if we don't have enough hints\n",
|
| 392 |
+
" if len(hints) < 3:\n",
|
| 393 |
+
" for i in range(1, 8):\n",
|
| 394 |
+
" tag_col = f'tag_{i}'\n",
|
| 395 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 396 |
+
" tag_val = str(row[tag_col])\n",
|
| 397 |
+
" if tag_val not in hints:\n",
|
| 398 |
+
" hints.append(tag_val)\n",
|
| 399 |
+
" if len(hints) >= 5:\n",
|
| 400 |
+
" break\n",
|
| 401 |
+
" \n",
|
| 402 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 403 |
+
" \n",
|
| 404 |
+
" return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
|
| 405 |
+
"\n",
|
| 406 |
+
"Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
|
| 407 |
+
"\n",
|
| 408 |
+
"FORMAT REQUIREMENTS:\n",
|
| 409 |
+
"1. Full legal name in Western order (first last). VALUE ONLY.\n",
|
| 410 |
+
"2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
|
| 411 |
+
"3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
|
| 412 |
+
"4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
|
| 413 |
+
"5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
|
| 414 |
+
"\n",
|
| 415 |
+
"RULES:\n",
|
| 416 |
+
"- Professions MUST match the exact categories listed (actress = actor)\n",
|
| 417 |
+
"- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
|
| 418 |
+
"- \"public figure\" includes politicians, activists, journalists, authors\n",
|
| 419 |
+
"- Use \"Unknown\" when uncertain or for fictional characters\n",
|
| 420 |
+
"- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
|
| 421 |
+
"- For multi-role people, list up to 3 categories by relevance\n",
|
| 422 |
+
"\n",
|
| 423 |
+
"EXAMPLE FORMAT:\n",
|
| 424 |
+
"1. Taylor Swift\n",
|
| 425 |
+
"2. None\n",
|
| 426 |
+
"3. Female\n",
|
| 427 |
+
"4. singer/musician, public figure\n",
|
| 428 |
+
"5. United States\"\"\"\n",
|
| 429 |
+
"\n",
|
| 430 |
+
"# Create prompts\n",
|
| 431 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 432 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 433 |
+
"print(\"✅ Prompts created\")\n",
|
| 434 |
+
"\n",
|
| 435 |
+
"@contextmanager\n",
|
| 436 |
+
"def timeout(duration):\n",
|
| 437 |
+
" def handler(signum, frame):\n",
|
| 438 |
+
" raise TimeoutError(\"Operation timed out\")\n",
|
| 439 |
+
" \n",
|
| 440 |
+
" # Set the signal handler and alarm\n",
|
| 441 |
+
" signal.signal(signal.SIGALRM, handler)\n",
|
| 442 |
+
" signal.alarm(duration)\n",
|
| 443 |
+
" try:\n",
|
| 444 |
+
" yield\n",
|
| 445 |
+
" finally:\n",
|
| 446 |
+
" signal.alarm(0) # Disable the alarm\n",
|
| 447 |
+
"\n",
|
| 448 |
+
"\n",
|
| 449 |
+
"def query_qwen_local(prompt: str) -> str:\n",
|
| 450 |
+
" \"\"\"Query Qwen locally via transformers.\"\"\"\n",
|
| 451 |
+
" try:\n",
|
| 452 |
+
" # Format as chat message for Qwen with strict system prompt\n",
|
| 453 |
+
" messages = [\n",
|
| 454 |
+
" {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
|
| 455 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 456 |
+
" ]\n",
|
| 457 |
+
" \n",
|
| 458 |
+
" # Tokenize\n",
|
| 459 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 460 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 461 |
+
" messages,\n",
|
| 462 |
+
" tokenize=False,\n",
|
| 463 |
+
" add_generation_prompt=True\n",
|
| 464 |
+
" )\n",
|
| 465 |
+
" else:\n",
|
| 466 |
+
" # Fallback for older tokenizers\n",
|
| 467 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 468 |
+
" \n",
|
| 469 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 470 |
+
" \n",
|
| 471 |
+
" # Generate with timeout\n",
|
| 472 |
+
" with timeout(60):\n",
|
| 473 |
+
" with torch.no_grad():\n",
|
| 474 |
+
" outputs = model.generate(\n",
|
| 475 |
+
" **inputs,\n",
|
| 476 |
+
" max_new_tokens=100,\n",
|
| 477 |
+
" temperature=0.1,\n",
|
| 478 |
+
" do_sample=False,\n",
|
| 479 |
+
" pad_token_id=tokenizer.eos_token_id\n",
|
| 480 |
+
" )\n",
|
| 481 |
+
" \n",
|
| 482 |
+
" # Decode\n",
|
| 483 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 484 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 485 |
+
" \n",
|
| 486 |
+
" return response.strip()\n",
|
| 487 |
+
" \n",
|
| 488 |
+
" except TimeoutError:\n",
|
| 489 |
+
" print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
|
| 490 |
+
" return None\n",
|
| 491 |
+
" except Exception as e:\n",
|
| 492 |
+
" print(f\"Generation error: {e}\")\n",
|
| 493 |
+
" import traceback\n",
|
| 494 |
+
" traceback.print_exc()\n",
|
| 495 |
+
" return None\n",
|
| 496 |
+
"\n",
|
| 497 |
+
" \n",
|
| 498 |
+
"# === PARSE RESPONSE WITH CLEANING ===\n",
|
| 499 |
+
"def parse_response(response):\n",
|
| 500 |
+
" \"\"\"Parse Qwen response into structured fields with cleaning.\"\"\"\n",
|
| 501 |
+
" if not response:\n",
|
| 502 |
+
" return {\n",
|
| 503 |
+
" 'full_name': 'Unknown',\n",
|
| 504 |
+
" 'aliases': 'Unknown',\n",
|
| 505 |
+
" 'gender': 'Unknown',\n",
|
| 506 |
+
" 'profession_llm': 'Unknown',\n",
|
| 507 |
+
" 'country': 'Unknown'\n",
|
| 508 |
+
" }\n",
|
| 509 |
+
" \n",
|
| 510 |
+
" # Split into lines and clean\n",
|
| 511 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 512 |
+
" \n",
|
| 513 |
+
" # Initialize with Unknown values\n",
|
| 514 |
+
" fields = {\n",
|
| 515 |
+
" 'full_name': 'Unknown',\n",
|
| 516 |
+
" 'aliases': 'Unknown',\n",
|
| 517 |
+
" 'gender': 'Unknown',\n",
|
| 518 |
+
" 'profession_llm': 'Unknown',\n",
|
| 519 |
+
" 'country': 'Unknown'\n",
|
| 520 |
+
" }\n",
|
| 521 |
+
" \n",
|
| 522 |
+
" # Extract information from each numbered line\n",
|
| 523 |
+
" for line in lines:\n",
|
| 524 |
+
" if line.startswith('1.'):\n",
|
| 525 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 526 |
+
" elif line.startswith('2.'):\n",
|
| 527 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 528 |
+
" elif line.startswith('3.'):\n",
|
| 529 |
+
" # Clean gender field - remove any labels\n",
|
| 530 |
+
" gender_raw = line[2:].strip()\n",
|
| 531 |
+
" # Remove common prefixes\n",
|
| 532 |
+
" gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
|
| 533 |
+
" # Extract just the gender word\n",
|
| 534 |
+
" gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
|
| 535 |
+
" fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
|
| 536 |
+
" elif line.startswith('4.'):\n",
|
| 537 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 538 |
+
" elif line.startswith('5.'):\n",
|
| 539 |
+
" # Clean country field - remove any labels\n",
|
| 540 |
+
" country_raw = line[2:].strip()\n",
|
| 541 |
+
" # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
|
| 542 |
+
" country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
|
| 543 |
+
" fields['country'] = country_raw\n",
|
| 544 |
+
" \n",
|
| 545 |
+
" return fields\n",
|
| 546 |
+
"\n",
|
| 547 |
+
"# === PROCESS DATA ===\n",
|
| 548 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 549 |
+
"\n",
|
| 550 |
+
"# Load index\n",
|
| 551 |
+
"current_index = 0\n",
|
| 552 |
+
"if index_file.exists():\n",
|
| 553 |
+
" try:\n",
|
| 554 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 555 |
+
" except:\n",
|
| 556 |
+
" current_index = 0\n",
|
| 557 |
+
"\n",
|
| 558 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 559 |
+
"\n",
|
| 560 |
+
"start_time = time.time()\n",
|
| 561 |
+
"\n",
|
| 562 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"Qwen Local\"):\n",
|
| 563 |
+
"\n",
|
| 564 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 565 |
+
"\n",
|
| 566 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 567 |
+
" response = None\n",
|
| 568 |
+
" for attempt in range(3):\n",
|
| 569 |
+
" response = query_qwen_local(prompt)\n",
|
| 570 |
+
" \n",
|
| 571 |
+
" # Valid response?\n",
|
| 572 |
+
" if response and len(response.strip()) > 10:\n",
|
| 573 |
+
" break\n",
|
| 574 |
+
" \n",
|
| 575 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 576 |
+
" time.sleep(0.5)\n",
|
| 577 |
+
"\n",
|
| 578 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 579 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 580 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 581 |
+
" continue\n",
|
| 582 |
+
"\n",
|
| 583 |
+
" parsed = parse_response(response)\n",
|
| 584 |
+
"\n",
|
| 585 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 586 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 587 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 588 |
+
" continue\n",
|
| 589 |
+
"\n",
|
| 590 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 591 |
+
" for key, value in parsed.items():\n",
|
| 592 |
+
" df.at[i, key] = value\n",
|
| 593 |
+
"\n",
|
| 594 |
+
" # Advance progress ONLY after successful write\n",
|
| 595 |
+
" current_index = i + 1\n",
|
| 596 |
+
"\n",
|
| 597 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 598 |
+
" if torch.cuda.is_available():\n",
|
| 599 |
+
" torch.cuda.empty_cache()\n",
|
| 600 |
+
" torch.cuda.synchronize()\n",
|
| 601 |
+
"\n",
|
| 602 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 603 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 604 |
+
" df.to_csv(output_file, index=False)\n",
|
| 605 |
+
" with open(index_file, \"w\") as f:\n",
|
| 606 |
+
" f.write(str(current_index))\n",
|
| 607 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 608 |
+
"\n",
|
| 609 |
+
"# Final save\n",
|
| 610 |
+
"df.to_csv(output_file, index=False)\n",
|
| 611 |
+
"index_file.write_text(str(current_index))\n",
|
| 612 |
+
"print(\"✅ Finished full dataset.\")"
|
| 613 |
+
]
|
| 614 |
+
},
|
| 615 |
+
{
|
| 616 |
+
"cell_type": "code",
|
| 617 |
+
"execution_count": null,
|
| 618 |
+
"id": "d9c7deb9-847a-472d-8055-f93dbfa6aa2e",
|
| 619 |
+
"metadata": {},
|
| 620 |
+
"outputs": [],
|
| 621 |
+
"source": []
|
| 622 |
+
}
|
| 623 |
+
],
|
| 624 |
+
"metadata": {
|
| 625 |
+
"kernelspec": {
|
| 626 |
+
"display_name": "pm-paper",
|
| 627 |
+
"language": "python",
|
| 628 |
+
"name": "pm-paper"
|
| 629 |
+
},
|
| 630 |
+
"language_info": {
|
| 631 |
+
"codemirror_mode": {
|
| 632 |
+
"name": "ipython",
|
| 633 |
+
"version": 3
|
| 634 |
+
},
|
| 635 |
+
"file_extension": ".py",
|
| 636 |
+
"mimetype": "text/x-python",
|
| 637 |
+
"name": "python",
|
| 638 |
+
"nbconvert_exporter": "python",
|
| 639 |
+
"pygments_lexer": "ipython3",
|
| 640 |
+
"version": "3.11.13"
|
| 641 |
+
}
|
| 642 |
+
},
|
| 643 |
+
"nbformat": 4,
|
| 644 |
+
"nbformat_minor": 5
|
| 645 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,1795 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "d8161f06-4fd2-436d-9e59-3f68b5a67f2c",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-06T18:30:16.974712Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-06T18:30:16.974296Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-06T18:30:16.976909Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-06T18:30:16.976526Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-06T18:30:16.974692Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# Section 6.2: Age and Gender Estimation using MiVOLO"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "64aaedec-ef56-4a62-b61e-12de2675a1ae",
|
| 22 |
+
"metadata": {
|
| 23 |
+
"execution": {
|
| 24 |
+
"iopub.execute_input": "2025-02-06T19:52:51.171282Z",
|
| 25 |
+
"iopub.status.busy": "2025-02-06T19:52:51.170711Z",
|
| 26 |
+
"iopub.status.idle": "2025-02-06T19:52:55.405039Z",
|
| 27 |
+
"shell.execute_reply": "2025-02-06T19:52:55.404308Z",
|
| 28 |
+
"shell.execute_reply.started": "2025-02-06T19:52:51.171245Z"
|
| 29 |
+
}
|
| 30 |
+
},
|
| 31 |
+
"source": [
|
| 32 |
+
""
|
| 33 |
+
]
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"cell_type": "code",
|
| 37 |
+
"execution_count": 1,
|
| 38 |
+
"id": "4293f307-44fd-455e-90fe-6e6928be9af5",
|
| 39 |
+
"metadata": {
|
| 40 |
+
"execution": {
|
| 41 |
+
"iopub.execute_input": "2025-02-08T21:59:21.970807Z",
|
| 42 |
+
"iopub.status.busy": "2025-02-08T21:59:21.969931Z",
|
| 43 |
+
"iopub.status.idle": "2025-02-08T22:00:09.724295Z",
|
| 44 |
+
"shell.execute_reply": "2025-02-08T22:00:09.723583Z",
|
| 45 |
+
"shell.execute_reply.started": "2025-02-08T21:59:21.970784Z"
|
| 46 |
+
}
|
| 47 |
+
},
|
| 48 |
+
"outputs": [
|
| 49 |
+
{
|
| 50 |
+
"name": "stderr",
|
| 51 |
+
"output_type": "stream",
|
| 52 |
+
"text": [
|
| 53 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 54 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 55 |
+
]
|
| 56 |
+
}
|
| 57 |
+
],
|
| 58 |
+
"source": [
|
| 59 |
+
"import csv\n",
|
| 60 |
+
"from pathlib import Path\n",
|
| 61 |
+
"import logging\n",
|
| 62 |
+
"import os\n",
|
| 63 |
+
"import pandas as pd\n",
|
| 64 |
+
"import requests\n",
|
| 65 |
+
"import numpy as np\n",
|
| 66 |
+
"import torch\n",
|
| 67 |
+
"import cv2\n",
|
| 68 |
+
"from io import BytesIO\n",
|
| 69 |
+
"from PIL import Image, UnidentifiedImageError\n",
|
| 70 |
+
"from datetime import datetime, timedelta\n",
|
| 71 |
+
"from dateutil.relativedelta import relativedelta\n",
|
| 72 |
+
"from mivolo.predictor import Predictor\n",
|
| 73 |
+
"import matplotlib.pyplot as plt\n",
|
| 74 |
+
"import matplotlib.patches as mpatches"
|
| 75 |
+
]
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"cell_type": "code",
|
| 79 |
+
"execution_count": 2,
|
| 80 |
+
"id": "63a54f5f-900c-48dd-8932-2632e56c5670",
|
| 81 |
+
"metadata": {
|
| 82 |
+
"execution": {
|
| 83 |
+
"iopub.execute_input": "2025-02-08T22:00:09.726069Z",
|
| 84 |
+
"iopub.status.busy": "2025-02-08T22:00:09.725699Z",
|
| 85 |
+
"iopub.status.idle": "2025-02-08T22:00:09.730626Z",
|
| 86 |
+
"shell.execute_reply": "2025-02-08T22:00:09.730099Z",
|
| 87 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.726050Z"
|
| 88 |
+
}
|
| 89 |
+
},
|
| 90 |
+
"outputs": [],
|
| 91 |
+
"source": [
|
| 92 |
+
"current_dir = Path.cwd()\n",
|
| 93 |
+
"mini = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
|
| 94 |
+
"mivolo_in = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/'\n",
|
| 95 |
+
"(current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/').mkdir(parents=True, exist_ok=True)"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "code",
|
| 100 |
+
"execution_count": 3,
|
| 101 |
+
"id": "1fdfb89a-6094-4382-8755-fae213221ea5",
|
| 102 |
+
"metadata": {
|
| 103 |
+
"execution": {
|
| 104 |
+
"iopub.execute_input": "2025-02-08T22:00:09.731540Z",
|
| 105 |
+
"iopub.status.busy": "2025-02-08T22:00:09.731359Z",
|
| 106 |
+
"iopub.status.idle": "2025-02-08T22:00:09.825738Z",
|
| 107 |
+
"shell.execute_reply": "2025-02-08T22:00:09.825258Z",
|
| 108 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.731524Z"
|
| 109 |
+
}
|
| 110 |
+
},
|
| 111 |
+
"outputs": [],
|
| 112 |
+
"source": [
|
| 113 |
+
"def split_by_month(input_path, output_dir):\n",
|
| 114 |
+
" # Load the dataset\n",
|
| 115 |
+
" df = pd.read_csv(input_path)\n",
|
| 116 |
+
" \n",
|
| 117 |
+
" # Convert the 'createdAt' column to datetime\n",
|
| 118 |
+
" df['createdAt'] = pd.to_datetime(df['createdAt'], errors='coerce')\n",
|
| 119 |
+
" \n",
|
| 120 |
+
" # Extract year and month\n",
|
| 121 |
+
" df['year_month'] = df['createdAt'].dt.to_period('M')\n",
|
| 122 |
+
" \n",
|
| 123 |
+
" # Group the data by year and month and save each group as a CSV file\n",
|
| 124 |
+
" unique_months = df['year_month'].unique()\n",
|
| 125 |
+
"\n",
|
| 126 |
+
" for month in unique_months:\n",
|
| 127 |
+
" # Filter data for the specific month\n",
|
| 128 |
+
" df_month = df[df['year_month'] == month]\n",
|
| 129 |
+
" \n",
|
| 130 |
+
" # Define the file name based on the year and month\n",
|
| 131 |
+
" file_name = f'{output_dir}/Civiverse-{month}.csv'\n",
|
| 132 |
+
" \n",
|
| 133 |
+
" # Save the file\n",
|
| 134 |
+
" df_month.to_csv(file_name, index=False)\n",
|
| 135 |
+
"\n",
|
| 136 |
+
" print(f\"Data has been split and saved to {output_dir}\")"
|
| 137 |
+
]
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"cell_type": "code",
|
| 141 |
+
"execution_count": 4,
|
| 142 |
+
"id": "2c909d7c-7d16-4dc7-8364-7f1c0784414c",
|
| 143 |
+
"metadata": {
|
| 144 |
+
"execution": {
|
| 145 |
+
"iopub.execute_input": "2025-02-08T22:00:09.827095Z",
|
| 146 |
+
"iopub.status.busy": "2025-02-08T22:00:09.826919Z",
|
| 147 |
+
"iopub.status.idle": "2025-02-08T22:00:10.479484Z",
|
| 148 |
+
"shell.execute_reply": "2025-02-08T22:00:10.478777Z",
|
| 149 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.827079Z"
|
| 150 |
+
}
|
| 151 |
+
},
|
| 152 |
+
"outputs": [
|
| 153 |
+
{
|
| 154 |
+
"name": "stderr",
|
| 155 |
+
"output_type": "stream",
|
| 156 |
+
"text": [
|
| 157 |
+
"/sctmp/lauwag/ipykernel_1497673/1825509207.py:9: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.\n",
|
| 158 |
+
" df['year_month'] = df['createdAt'].dt.to_period('M')\n"
|
| 159 |
+
]
|
| 160 |
+
},
|
| 161 |
+
{
|
| 162 |
+
"name": "stdout",
|
| 163 |
+
"output_type": "stream",
|
| 164 |
+
"text": [
|
| 165 |
+
"Data has been split and saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month\n"
|
| 166 |
+
]
|
| 167 |
+
}
|
| 168 |
+
],
|
| 169 |
+
"source": [
|
| 170 |
+
"split_by_month(mini, mivolo_in)"
|
| 171 |
+
]
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"cell_type": "code",
|
| 175 |
+
"execution_count": 5,
|
| 176 |
+
"id": "4c543306-ffc8-4b9c-a3df-b49b2271caa9",
|
| 177 |
+
"metadata": {
|
| 178 |
+
"execution": {
|
| 179 |
+
"iopub.execute_input": "2025-02-08T22:00:10.480505Z",
|
| 180 |
+
"iopub.status.busy": "2025-02-08T22:00:10.480310Z",
|
| 181 |
+
"iopub.status.idle": "2025-02-08T22:00:10.483961Z",
|
| 182 |
+
"shell.execute_reply": "2025-02-08T22:00:10.483400Z",
|
| 183 |
+
"shell.execute_reply.started": "2025-02-08T22:00:10.480486Z"
|
| 184 |
+
}
|
| 185 |
+
},
|
| 186 |
+
"outputs": [],
|
| 187 |
+
"source": [
|
| 188 |
+
"mivolo_out = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
|
| 189 |
+
"mivolo_out.mkdir(parents=True, exist_ok=True) # Create the output directory if it doesn't exist"
|
| 190 |
+
]
|
| 191 |
+
},
|
| 192 |
+
{
|
| 193 |
+
"cell_type": "markdown",
|
| 194 |
+
"id": "ffb7dd23",
|
| 195 |
+
"metadata": {},
|
| 196 |
+
"source": [
|
| 197 |
+
"## MiVOLO gender and age inference"
|
| 198 |
+
]
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"cell_type": "code",
|
| 202 |
+
"execution_count": null,
|
| 203 |
+
"id": "304ed12f-c7b6-4129-b24d-7ccc793a62c7",
|
| 204 |
+
"metadata": {
|
| 205 |
+
"execution": {
|
| 206 |
+
"iopub.execute_input": "2025-02-08T22:00:10.484802Z",
|
| 207 |
+
"iopub.status.busy": "2025-02-08T22:00:10.484639Z"
|
| 208 |
+
}
|
| 209 |
+
},
|
| 210 |
+
"outputs": [
|
| 211 |
+
{
|
| 212 |
+
"name": "stderr",
|
| 213 |
+
"output_type": "stream",
|
| 214 |
+
"text": [
|
| 215 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/ultralytics/nn/tasks.py:634: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 216 |
+
" return torch.load(file, map_location=\"cpu\"), file # load\n"
|
| 217 |
+
]
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"name": "stdout",
|
| 221 |
+
"output_type": "stream",
|
| 222 |
+
"text": [
|
| 223 |
+
"Model summary (fused): 268 layers, 68125494 parameters, 0 gradients, 257.4 GFLOPs\n"
|
| 224 |
+
]
|
| 225 |
+
},
|
| 226 |
+
{
|
| 227 |
+
"name": "stderr",
|
| 228 |
+
"output_type": "stream",
|
| 229 |
+
"text": [
|
| 230 |
+
"[W208 23:00:15.738708520 NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.\n",
|
| 231 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/mivolo/model/mi_volo.py:33: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 232 |
+
" state = torch.load(ckpt_path, map_location=\"cpu\")\n",
|
| 233 |
+
"INFO:MiVOLO:Model meta:\n",
|
| 234 |
+
"min_age: 1, max_age: 95, avg_age: 48.0, num_classes: 3, in_chans: 6, with_persons_model: True, disable_faces: False, use_persons: True, only_age: False, num_classes_gender: 2, input_size: 224, use_person_crops: True, use_face_crops: True\n",
|
| 235 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/timm/models/_helpers.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 236 |
+
" checkpoint = torch.load(checkpoint_path, map_location='cpu')\n",
|
| 237 |
+
"INFO:timm.models._helpers:Loaded state_dict from checkpoint '/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
|
| 238 |
+
"INFO:MiVOLO:Model mivolo_d1_224 created, param count: 27432414\n",
|
| 239 |
+
"INFO:timm.data.config:Data processing configuration for current model + dataset:\n",
|
| 240 |
+
"INFO:timm.data.config:\tinput_size: (3, 224, 224)\n",
|
| 241 |
+
"INFO:timm.data.config:\tinterpolation: bicubic\n",
|
| 242 |
+
"INFO:timm.data.config:\tmean: (0.485, 0.456, 0.406)\n",
|
| 243 |
+
"INFO:timm.data.config:\tstd: (0.229, 0.224, 0.225)\n",
|
| 244 |
+
"INFO:timm.data.config:\tcrop_pct: 0.96\n",
|
| 245 |
+
"INFO:timm.data.config:\tcrop_mode: center\n"
|
| 246 |
+
]
|
| 247 |
+
},
|
| 248 |
+
{
|
| 249 |
+
"name": "stdout",
|
| 250 |
+
"output_type": "stream",
|
| 251 |
+
"text": [
|
| 252 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-11.csv\n",
|
| 253 |
+
"\n",
|
| 254 |
+
"0: 640x640 (no detections), 723.9ms\n",
|
| 255 |
+
"Speed: 12.1ms preprocess, 723.9ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 256 |
+
"Processed and saved 1 images so far.\n",
|
| 257 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2022-11.csv\n",
|
| 258 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-12.csv\n",
|
| 259 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-01.csv\n",
|
| 260 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-02.csv\n",
|
| 261 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-03.csv\n",
|
| 262 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-04.csv\n",
|
| 263 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-05.csv\n",
|
| 264 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-06.csv\n",
|
| 265 |
+
"\n",
|
| 266 |
+
"0: 416x640 1 person, 455.1ms\n",
|
| 267 |
+
"Speed: 3.5ms preprocess, 455.1ms inference, 33.5ms postprocess per image at shape (1, 3, 416, 640)\n"
|
| 268 |
+
]
|
| 269 |
+
},
|
| 270 |
+
{
|
| 271 |
+
"name": "stderr",
|
| 272 |
+
"output_type": "stream",
|
| 273 |
+
"text": [
|
| 274 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 275 |
+
"INFO:MiVOLO:\tage: 32.89\n",
|
| 276 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 277 |
+
]
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"name": "stdout",
|
| 281 |
+
"output_type": "stream",
|
| 282 |
+
"text": [
|
| 283 |
+
"Processed and saved 1 images so far.\n",
|
| 284 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-06.csv\n",
|
| 285 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-07.csv\n",
|
| 286 |
+
"\n",
|
| 287 |
+
"0: 640x320 1 person, 395.7ms\n",
|
| 288 |
+
"Speed: 2.9ms preprocess, 395.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
|
| 289 |
+
]
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"name": "stderr",
|
| 293 |
+
"output_type": "stream",
|
| 294 |
+
"text": [
|
| 295 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 296 |
+
"INFO:MiVOLO:\tage: 33.49\n",
|
| 297 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 298 |
+
]
|
| 299 |
+
},
|
| 300 |
+
{
|
| 301 |
+
"name": "stdout",
|
| 302 |
+
"output_type": "stream",
|
| 303 |
+
"text": [
|
| 304 |
+
"Processed and saved 1 images so far.\n",
|
| 305 |
+
"\n",
|
| 306 |
+
"0: 640x448 1 person, 1 face, 478.5ms\n",
|
| 307 |
+
"Speed: 1.9ms preprocess, 478.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 308 |
+
]
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"name": "stderr",
|
| 312 |
+
"output_type": "stream",
|
| 313 |
+
"text": [
|
| 314 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 315 |
+
"INFO:MiVOLO:\tage: 17.81\n",
|
| 316 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 317 |
+
]
|
| 318 |
+
},
|
| 319 |
+
{
|
| 320 |
+
"name": "stdout",
|
| 321 |
+
"output_type": "stream",
|
| 322 |
+
"text": [
|
| 323 |
+
"Processed and saved 2 images so far.\n",
|
| 324 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-07.csv\n",
|
| 325 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-08.csv\n",
|
| 326 |
+
"\n",
|
| 327 |
+
"0: 640x448 1 person, 478.0ms\n",
|
| 328 |
+
"Speed: 2.9ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 329 |
+
]
|
| 330 |
+
},
|
| 331 |
+
{
|
| 332 |
+
"name": "stderr",
|
| 333 |
+
"output_type": "stream",
|
| 334 |
+
"text": [
|
| 335 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 336 |
+
"INFO:MiVOLO:\tage: 40.62\n",
|
| 337 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 338 |
+
]
|
| 339 |
+
},
|
| 340 |
+
{
|
| 341 |
+
"name": "stdout",
|
| 342 |
+
"output_type": "stream",
|
| 343 |
+
"text": [
|
| 344 |
+
"Processed and saved 1 images so far.\n",
|
| 345 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-08.csv\n",
|
| 346 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-09.csv\n",
|
| 347 |
+
"\n",
|
| 348 |
+
"0: 416x640 (no detections), 567.1ms\n",
|
| 349 |
+
"Speed: 2.4ms preprocess, 567.1ms inference, 0.4ms postprocess per image at shape (1, 3, 416, 640)\n",
|
| 350 |
+
"Processed and saved 1 images so far.\n",
|
| 351 |
+
"\n",
|
| 352 |
+
"0: 320x640 (no detections), 393.6ms\n",
|
| 353 |
+
"Speed: 1.7ms preprocess, 393.6ms inference, 0.4ms postprocess per image at shape (1, 3, 320, 640)\n",
|
| 354 |
+
"\n",
|
| 355 |
+
"0: 640x640 (no detections), 711.9ms\n",
|
| 356 |
+
"Speed: 3.4ms preprocess, 711.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 357 |
+
"\n",
|
| 358 |
+
"0: 640x640 (no detections), 699.8ms\n",
|
| 359 |
+
"Speed: 2.3ms preprocess, 699.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 360 |
+
"\n",
|
| 361 |
+
"0: 640x576 1 person, 629.6ms\n",
|
| 362 |
+
"Speed: 2.4ms preprocess, 629.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 576)\n"
|
| 363 |
+
]
|
| 364 |
+
},
|
| 365 |
+
{
|
| 366 |
+
"name": "stderr",
|
| 367 |
+
"output_type": "stream",
|
| 368 |
+
"text": [
|
| 369 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 370 |
+
"INFO:MiVOLO:\tage: 28.65\n",
|
| 371 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 372 |
+
]
|
| 373 |
+
},
|
| 374 |
+
{
|
| 375 |
+
"name": "stdout",
|
| 376 |
+
"output_type": "stream",
|
| 377 |
+
"text": [
|
| 378 |
+
"\n",
|
| 379 |
+
"0: 640x448 1 person, 1 face, 598.3ms\n",
|
| 380 |
+
"Speed: 2.1ms preprocess, 598.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 381 |
+
]
|
| 382 |
+
},
|
| 383 |
+
{
|
| 384 |
+
"name": "stderr",
|
| 385 |
+
"output_type": "stream",
|
| 386 |
+
"text": [
|
| 387 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 388 |
+
"INFO:MiVOLO:\tage: 25.85\n",
|
| 389 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 390 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/98848c97-3d1e-4b52-9967-aeeca354a30e/width=656/98848c97-3d1e-4b52-9967-aeeca354a30e.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00133740>\n"
|
| 391 |
+
]
|
| 392 |
+
},
|
| 393 |
+
{
|
| 394 |
+
"name": "stdout",
|
| 395 |
+
"output_type": "stream",
|
| 396 |
+
"text": [
|
| 397 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-09.csv\n",
|
| 398 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-10.csv\n"
|
| 399 |
+
]
|
| 400 |
+
},
|
| 401 |
+
{
|
| 402 |
+
"name": "stderr",
|
| 403 |
+
"output_type": "stream",
|
| 404 |
+
"text": [
|
| 405 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/e6469288-b487-4a06-99c1-59e7ac22fa77/width=1024/e6469288-b487-4a06-99c1-59e7ac22fa77.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ecbdd0>\n"
|
| 406 |
+
]
|
| 407 |
+
},
|
| 408 |
+
{
|
| 409 |
+
"name": "stdout",
|
| 410 |
+
"output_type": "stream",
|
| 411 |
+
"text": [
|
| 412 |
+
"\n",
|
| 413 |
+
"0: 448x640 (no detections), 536.6ms\n",
|
| 414 |
+
"Speed: 10.1ms preprocess, 536.6ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
|
| 415 |
+
"Processed and saved 2 images so far.\n",
|
| 416 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-10.csv\n",
|
| 417 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-11.csv\n",
|
| 418 |
+
"\n",
|
| 419 |
+
"0: 640x448 1 person, 1 face, 662.9ms\n",
|
| 420 |
+
"Speed: 2.6ms preprocess, 662.9ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 421 |
+
]
|
| 422 |
+
},
|
| 423 |
+
{
|
| 424 |
+
"name": "stderr",
|
| 425 |
+
"output_type": "stream",
|
| 426 |
+
"text": [
|
| 427 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 428 |
+
"INFO:MiVOLO:\tage: 17.0\n",
|
| 429 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 430 |
+
]
|
| 431 |
+
},
|
| 432 |
+
{
|
| 433 |
+
"name": "stdout",
|
| 434 |
+
"output_type": "stream",
|
| 435 |
+
"text": [
|
| 436 |
+
"Processed and saved 1 images so far.\n",
|
| 437 |
+
"\n",
|
| 438 |
+
"0: 640x384 1 person, 895.9ms\n",
|
| 439 |
+
"Speed: 2.0ms preprocess, 895.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 440 |
+
]
|
| 441 |
+
},
|
| 442 |
+
{
|
| 443 |
+
"name": "stderr",
|
| 444 |
+
"output_type": "stream",
|
| 445 |
+
"text": [
|
| 446 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 447 |
+
"INFO:MiVOLO:\tage: 43.33\n",
|
| 448 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 449 |
+
]
|
| 450 |
+
},
|
| 451 |
+
{
|
| 452 |
+
"name": "stdout",
|
| 453 |
+
"output_type": "stream",
|
| 454 |
+
"text": [
|
| 455 |
+
"\n",
|
| 456 |
+
"0: 640x448 (no detections), 529.4ms\n",
|
| 457 |
+
"Speed: 2.6ms preprocess, 529.4ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 458 |
+
"\n",
|
| 459 |
+
"0: 640x448 1 person, 539.3ms\n",
|
| 460 |
+
"Speed: 2.8ms preprocess, 539.3ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 461 |
+
]
|
| 462 |
+
},
|
| 463 |
+
{
|
| 464 |
+
"name": "stderr",
|
| 465 |
+
"output_type": "stream",
|
| 466 |
+
"text": [
|
| 467 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 468 |
+
"INFO:MiVOLO:\tage: 39.15\n",
|
| 469 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 470 |
+
]
|
| 471 |
+
},
|
| 472 |
+
{
|
| 473 |
+
"name": "stdout",
|
| 474 |
+
"output_type": "stream",
|
| 475 |
+
"text": [
|
| 476 |
+
"\n",
|
| 477 |
+
"0: 640x448 1 person, 1 face, 708.6ms\n",
|
| 478 |
+
"Speed: 2.5ms preprocess, 708.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 479 |
+
]
|
| 480 |
+
},
|
| 481 |
+
{
|
| 482 |
+
"name": "stderr",
|
| 483 |
+
"output_type": "stream",
|
| 484 |
+
"text": [
|
| 485 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 486 |
+
"INFO:MiVOLO:\tage: 29.64\n",
|
| 487 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 488 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc/width=1080/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc.mp4: cannot identify image file <_io.BytesIO object at 0x14cb010c24d0>\n"
|
| 489 |
+
]
|
| 490 |
+
},
|
| 491 |
+
{
|
| 492 |
+
"name": "stdout",
|
| 493 |
+
"output_type": "stream",
|
| 494 |
+
"text": [
|
| 495 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-11.csv\n",
|
| 496 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-12.csv\n",
|
| 497 |
+
"\n",
|
| 498 |
+
"0: 640x384 1 person, 1 face, 461.0ms\n",
|
| 499 |
+
"Speed: 2.4ms preprocess, 461.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 500 |
+
]
|
| 501 |
+
},
|
| 502 |
+
{
|
| 503 |
+
"name": "stderr",
|
| 504 |
+
"output_type": "stream",
|
| 505 |
+
"text": [
|
| 506 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 507 |
+
"INFO:MiVOLO:\tage: 19.61\n",
|
| 508 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 509 |
+
]
|
| 510 |
+
},
|
| 511 |
+
{
|
| 512 |
+
"name": "stdout",
|
| 513 |
+
"output_type": "stream",
|
| 514 |
+
"text": [
|
| 515 |
+
"Processed and saved 1 images so far.\n",
|
| 516 |
+
"\n",
|
| 517 |
+
"0: 640x448 1 person, 1 face, 501.3ms\n",
|
| 518 |
+
"Speed: 3.1ms preprocess, 501.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 519 |
+
]
|
| 520 |
+
},
|
| 521 |
+
{
|
| 522 |
+
"name": "stderr",
|
| 523 |
+
"output_type": "stream",
|
| 524 |
+
"text": [
|
| 525 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 526 |
+
"INFO:MiVOLO:\tage: 22.58\n",
|
| 527 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 528 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3004b5fa-af81-4de7-829d-1d809d70b878/width=512/3004b5fa-af81-4de7-829d-1d809d70b878.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
|
| 529 |
+
]
|
| 530 |
+
},
|
| 531 |
+
{
|
| 532 |
+
"name": "stdout",
|
| 533 |
+
"output_type": "stream",
|
| 534 |
+
"text": [
|
| 535 |
+
"\n",
|
| 536 |
+
"0: 640x640 (no detections), 842.5ms\n",
|
| 537 |
+
"Speed: 4.5ms preprocess, 842.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 538 |
+
"\n",
|
| 539 |
+
"0: 640x416 (no detections), 446.8ms\n",
|
| 540 |
+
"Speed: 2.5ms preprocess, 446.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 541 |
+
"Processed and saved 5 images so far.\n",
|
| 542 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-12.csv\n",
|
| 543 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-01.csv\n",
|
| 544 |
+
"\n",
|
| 545 |
+
"0: 640x448 (no detections), 638.5ms\n",
|
| 546 |
+
"Speed: 2.3ms preprocess, 638.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 547 |
+
"Processed and saved 1 images so far.\n",
|
| 548 |
+
"\n",
|
| 549 |
+
"0: 640x416 (no detections), 441.7ms\n",
|
| 550 |
+
"Speed: 2.5ms preprocess, 441.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 551 |
+
"\n",
|
| 552 |
+
"0: 640x448 (no detections), 470.3ms\n",
|
| 553 |
+
"Speed: 2.3ms preprocess, 470.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 554 |
+
"\n",
|
| 555 |
+
"0: 640x448 (no detections), 693.9ms\n",
|
| 556 |
+
"Speed: 2.5ms preprocess, 693.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 557 |
+
"\n",
|
| 558 |
+
"0: 640x512 1 person, 1 face, 808.6ms\n",
|
| 559 |
+
"Speed: 3.2ms preprocess, 808.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 560 |
+
]
|
| 561 |
+
},
|
| 562 |
+
{
|
| 563 |
+
"name": "stderr",
|
| 564 |
+
"output_type": "stream",
|
| 565 |
+
"text": [
|
| 566 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 567 |
+
"INFO:MiVOLO:\tage: 15.01\n",
|
| 568 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 569 |
+
]
|
| 570 |
+
},
|
| 571 |
+
{
|
| 572 |
+
"name": "stdout",
|
| 573 |
+
"output_type": "stream",
|
| 574 |
+
"text": [
|
| 575 |
+
"\n",
|
| 576 |
+
"0: 640x320 1 person, 1 face, 345.6ms\n",
|
| 577 |
+
"Speed: 2.0ms preprocess, 345.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
|
| 578 |
+
]
|
| 579 |
+
},
|
| 580 |
+
{
|
| 581 |
+
"name": "stderr",
|
| 582 |
+
"output_type": "stream",
|
| 583 |
+
"text": [
|
| 584 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 585 |
+
"INFO:MiVOLO:\tage: 20.86\n",
|
| 586 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 587 |
+
]
|
| 588 |
+
},
|
| 589 |
+
{
|
| 590 |
+
"name": "stdout",
|
| 591 |
+
"output_type": "stream",
|
| 592 |
+
"text": [
|
| 593 |
+
"Processed and saved 6 images so far.\n",
|
| 594 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-01.csv\n",
|
| 595 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-02.csv\n",
|
| 596 |
+
"\n",
|
| 597 |
+
"0: 640x384 1 person, 1 face, 387.8ms\n",
|
| 598 |
+
"Speed: 1.9ms preprocess, 387.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 599 |
+
]
|
| 600 |
+
},
|
| 601 |
+
{
|
| 602 |
+
"name": "stderr",
|
| 603 |
+
"output_type": "stream",
|
| 604 |
+
"text": [
|
| 605 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 606 |
+
"INFO:MiVOLO:\tage: 17.31\n",
|
| 607 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 608 |
+
]
|
| 609 |
+
},
|
| 610 |
+
{
|
| 611 |
+
"name": "stdout",
|
| 612 |
+
"output_type": "stream",
|
| 613 |
+
"text": [
|
| 614 |
+
"Processed and saved 1 images so far.\n",
|
| 615 |
+
"\n",
|
| 616 |
+
"0: 640x480 1 person, 1 face, 540.4ms\n",
|
| 617 |
+
"Speed: 2.5ms preprocess, 540.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 618 |
+
]
|
| 619 |
+
},
|
| 620 |
+
{
|
| 621 |
+
"name": "stderr",
|
| 622 |
+
"output_type": "stream",
|
| 623 |
+
"text": [
|
| 624 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 625 |
+
"INFO:MiVOLO:\tage: 17.47\n",
|
| 626 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 627 |
+
]
|
| 628 |
+
},
|
| 629 |
+
{
|
| 630 |
+
"name": "stdout",
|
| 631 |
+
"output_type": "stream",
|
| 632 |
+
"text": [
|
| 633 |
+
"\n",
|
| 634 |
+
"0: 640x640 1 person, 1 face, 713.1ms\n",
|
| 635 |
+
"Speed: 3.8ms preprocess, 713.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n"
|
| 636 |
+
]
|
| 637 |
+
},
|
| 638 |
+
{
|
| 639 |
+
"name": "stderr",
|
| 640 |
+
"output_type": "stream",
|
| 641 |
+
"text": [
|
| 642 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 643 |
+
"INFO:MiVOLO:\tage: 17.85\n",
|
| 644 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 645 |
+
]
|
| 646 |
+
},
|
| 647 |
+
{
|
| 648 |
+
"name": "stdout",
|
| 649 |
+
"output_type": "stream",
|
| 650 |
+
"text": [
|
| 651 |
+
"\n",
|
| 652 |
+
"0: 640x640 (no detections), 778.8ms\n",
|
| 653 |
+
"Speed: 28.7ms preprocess, 778.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 654 |
+
"\n",
|
| 655 |
+
"0: 640x448 1 person, 1 face, 528.2ms\n",
|
| 656 |
+
"Speed: 2.3ms preprocess, 528.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 657 |
+
]
|
| 658 |
+
},
|
| 659 |
+
{
|
| 660 |
+
"name": "stderr",
|
| 661 |
+
"output_type": "stream",
|
| 662 |
+
"text": [
|
| 663 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 664 |
+
"INFO:MiVOLO:\tage: 21.63\n",
|
| 665 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 666 |
+
]
|
| 667 |
+
},
|
| 668 |
+
{
|
| 669 |
+
"name": "stdout",
|
| 670 |
+
"output_type": "stream",
|
| 671 |
+
"text": [
|
| 672 |
+
"\n",
|
| 673 |
+
"0: 640x448 1 person, 1 face, 518.4ms\n",
|
| 674 |
+
"Speed: 3.9ms preprocess, 518.4ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 675 |
+
]
|
| 676 |
+
},
|
| 677 |
+
{
|
| 678 |
+
"name": "stderr",
|
| 679 |
+
"output_type": "stream",
|
| 680 |
+
"text": [
|
| 681 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 682 |
+
"INFO:MiVOLO:\tage: 18.25\n",
|
| 683 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 684 |
+
]
|
| 685 |
+
},
|
| 686 |
+
{
|
| 687 |
+
"name": "stdout",
|
| 688 |
+
"output_type": "stream",
|
| 689 |
+
"text": [
|
| 690 |
+
"\n",
|
| 691 |
+
"0: 640x448 1 person, 1 face, 470.7ms\n",
|
| 692 |
+
"Speed: 2.5ms preprocess, 470.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 693 |
+
]
|
| 694 |
+
},
|
| 695 |
+
{
|
| 696 |
+
"name": "stderr",
|
| 697 |
+
"output_type": "stream",
|
| 698 |
+
"text": [
|
| 699 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 700 |
+
"INFO:MiVOLO:\tage: 20.51\n",
|
| 701 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 702 |
+
]
|
| 703 |
+
},
|
| 704 |
+
{
|
| 705 |
+
"name": "stdout",
|
| 706 |
+
"output_type": "stream",
|
| 707 |
+
"text": [
|
| 708 |
+
"\n",
|
| 709 |
+
"0: 640x480 1 person, 1 face, 647.1ms\n",
|
| 710 |
+
"Speed: 2.4ms preprocess, 647.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 711 |
+
]
|
| 712 |
+
},
|
| 713 |
+
{
|
| 714 |
+
"name": "stderr",
|
| 715 |
+
"output_type": "stream",
|
| 716 |
+
"text": [
|
| 717 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 718 |
+
"INFO:MiVOLO:\tage: 58.87\n",
|
| 719 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 720 |
+
]
|
| 721 |
+
},
|
| 722 |
+
{
|
| 723 |
+
"name": "stdout",
|
| 724 |
+
"output_type": "stream",
|
| 725 |
+
"text": [
|
| 726 |
+
"\n",
|
| 727 |
+
"0: 640x448 (no detections), 469.8ms\n",
|
| 728 |
+
"Speed: 2.6ms preprocess, 469.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 729 |
+
"\n",
|
| 730 |
+
"0: 640x448 1 person, 1 face, 477.5ms\n",
|
| 731 |
+
"Speed: 2.3ms preprocess, 477.5ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 732 |
+
]
|
| 733 |
+
},
|
| 734 |
+
{
|
| 735 |
+
"name": "stderr",
|
| 736 |
+
"output_type": "stream",
|
| 737 |
+
"text": [
|
| 738 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 739 |
+
"INFO:MiVOLO:\tage: 23.79\n",
|
| 740 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 741 |
+
]
|
| 742 |
+
},
|
| 743 |
+
{
|
| 744 |
+
"name": "stdout",
|
| 745 |
+
"output_type": "stream",
|
| 746 |
+
"text": [
|
| 747 |
+
"Processed and saved 10 images so far.\n",
|
| 748 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-02.csv\n",
|
| 749 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-03.csv\n",
|
| 750 |
+
"\n",
|
| 751 |
+
"0: 640x448 1 face, 511.4ms\n",
|
| 752 |
+
"Speed: 2.5ms preprocess, 511.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 753 |
+
]
|
| 754 |
+
},
|
| 755 |
+
{
|
| 756 |
+
"name": "stderr",
|
| 757 |
+
"output_type": "stream",
|
| 758 |
+
"text": [
|
| 759 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 760 |
+
"INFO:MiVOLO:\tage: 24.87\n",
|
| 761 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 762 |
+
]
|
| 763 |
+
},
|
| 764 |
+
{
|
| 765 |
+
"name": "stdout",
|
| 766 |
+
"output_type": "stream",
|
| 767 |
+
"text": [
|
| 768 |
+
"Processed and saved 1 images so far.\n",
|
| 769 |
+
"\n",
|
| 770 |
+
"0: 640x544 (no detections), 576.5ms\n",
|
| 771 |
+
"Speed: 2.9ms preprocess, 576.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 544)\n",
|
| 772 |
+
"\n",
|
| 773 |
+
"0: 640x448 1 person, 1 face, 687.1ms\n",
|
| 774 |
+
"Speed: 9.9ms preprocess, 687.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 775 |
+
]
|
| 776 |
+
},
|
| 777 |
+
{
|
| 778 |
+
"name": "stderr",
|
| 779 |
+
"output_type": "stream",
|
| 780 |
+
"text": [
|
| 781 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 782 |
+
"INFO:MiVOLO:\tage: 25.76\n",
|
| 783 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 784 |
+
]
|
| 785 |
+
},
|
| 786 |
+
{
|
| 787 |
+
"name": "stdout",
|
| 788 |
+
"output_type": "stream",
|
| 789 |
+
"text": [
|
| 790 |
+
"\n",
|
| 791 |
+
"0: 640x448 (no detections), 498.3ms\n",
|
| 792 |
+
"Speed: 2.3ms preprocess, 498.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 793 |
+
"\n",
|
| 794 |
+
"0: 640x512 (no detections), 573.2ms\n",
|
| 795 |
+
"Speed: 3.0ms preprocess, 573.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 796 |
+
"Processed and saved 5 images so far.\n",
|
| 797 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-03.csv\n",
|
| 798 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-04.csv\n",
|
| 799 |
+
"\n",
|
| 800 |
+
"0: 640x384 (no detections), 518.2ms\n",
|
| 801 |
+
"Speed: 2.7ms preprocess, 518.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 802 |
+
"Processed and saved 1 images so far.\n",
|
| 803 |
+
"\n",
|
| 804 |
+
"0: 640x512 (no detections), 707.7ms\n",
|
| 805 |
+
"Speed: 3.6ms preprocess, 707.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 806 |
+
"\n",
|
| 807 |
+
"0: 640x416 (no detections), 453.7ms\n",
|
| 808 |
+
"Speed: 2.4ms preprocess, 453.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 809 |
+
"\n",
|
| 810 |
+
"0: 640x384 (no detections), 391.0ms\n",
|
| 811 |
+
"Speed: 2.0ms preprocess, 391.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 812 |
+
"\n",
|
| 813 |
+
"0: 640x448 1 person, 1 face, 449.8ms\n",
|
| 814 |
+
"Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 815 |
+
]
|
| 816 |
+
},
|
| 817 |
+
{
|
| 818 |
+
"name": "stderr",
|
| 819 |
+
"output_type": "stream",
|
| 820 |
+
"text": [
|
| 821 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 822 |
+
"INFO:MiVOLO:\tage: 22.39\n",
|
| 823 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 824 |
+
]
|
| 825 |
+
},
|
| 826 |
+
{
|
| 827 |
+
"name": "stdout",
|
| 828 |
+
"output_type": "stream",
|
| 829 |
+
"text": [
|
| 830 |
+
"\n",
|
| 831 |
+
"0: 640x448 (no detections), 618.4ms\n",
|
| 832 |
+
"Speed: 2.3ms preprocess, 618.4ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 833 |
+
"\n",
|
| 834 |
+
"0: 640x448 1 person, 1 face, 631.0ms\n",
|
| 835 |
+
"Speed: 2.2ms preprocess, 631.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 836 |
+
]
|
| 837 |
+
},
|
| 838 |
+
{
|
| 839 |
+
"name": "stderr",
|
| 840 |
+
"output_type": "stream",
|
| 841 |
+
"text": [
|
| 842 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 843 |
+
"INFO:MiVOLO:\tage: 24.05\n",
|
| 844 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 845 |
+
]
|
| 846 |
+
},
|
| 847 |
+
{
|
| 848 |
+
"name": "stdout",
|
| 849 |
+
"output_type": "stream",
|
| 850 |
+
"text": [
|
| 851 |
+
"\n",
|
| 852 |
+
"0: 640x512 1 person, 496.4ms\n",
|
| 853 |
+
"Speed: 2.6ms preprocess, 496.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 854 |
+
]
|
| 855 |
+
},
|
| 856 |
+
{
|
| 857 |
+
"name": "stderr",
|
| 858 |
+
"output_type": "stream",
|
| 859 |
+
"text": [
|
| 860 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 861 |
+
"INFO:MiVOLO:\tage: 22.81\n",
|
| 862 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 863 |
+
]
|
| 864 |
+
},
|
| 865 |
+
{
|
| 866 |
+
"name": "stdout",
|
| 867 |
+
"output_type": "stream",
|
| 868 |
+
"text": [
|
| 869 |
+
"\n",
|
| 870 |
+
"0: 640x448 (no detections), 442.8ms\n",
|
| 871 |
+
"Speed: 2.3ms preprocess, 442.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 872 |
+
"\n",
|
| 873 |
+
"0: 640x448 1 person, 1 face, 477.7ms\n",
|
| 874 |
+
"Speed: 2.4ms preprocess, 477.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 875 |
+
]
|
| 876 |
+
},
|
| 877 |
+
{
|
| 878 |
+
"name": "stderr",
|
| 879 |
+
"output_type": "stream",
|
| 880 |
+
"text": [
|
| 881 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 882 |
+
"INFO:MiVOLO:\tage: 21.62\n",
|
| 883 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 884 |
+
]
|
| 885 |
+
},
|
| 886 |
+
{
|
| 887 |
+
"name": "stdout",
|
| 888 |
+
"output_type": "stream",
|
| 889 |
+
"text": [
|
| 890 |
+
"\n",
|
| 891 |
+
"0: 640x448 1 person, 1 face, 447.0ms\n",
|
| 892 |
+
"Speed: 2.2ms preprocess, 447.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 893 |
+
]
|
| 894 |
+
},
|
| 895 |
+
{
|
| 896 |
+
"name": "stderr",
|
| 897 |
+
"output_type": "stream",
|
| 898 |
+
"text": [
|
| 899 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 900 |
+
"INFO:MiVOLO:\tage: 54.31\n",
|
| 901 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 902 |
+
]
|
| 903 |
+
},
|
| 904 |
+
{
|
| 905 |
+
"name": "stdout",
|
| 906 |
+
"output_type": "stream",
|
| 907 |
+
"text": [
|
| 908 |
+
"\n",
|
| 909 |
+
"0: 640x640 (no detections), 819.0ms\n",
|
| 910 |
+
"Speed: 3.6ms preprocess, 819.0ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 911 |
+
"\n",
|
| 912 |
+
"0: 640x448 1 person, 1 face, 478.2ms\n",
|
| 913 |
+
"Speed: 1.8ms preprocess, 478.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 914 |
+
]
|
| 915 |
+
},
|
| 916 |
+
{
|
| 917 |
+
"name": "stderr",
|
| 918 |
+
"output_type": "stream",
|
| 919 |
+
"text": [
|
| 920 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 921 |
+
"INFO:MiVOLO:\tage: 20.56\n",
|
| 922 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 923 |
+
]
|
| 924 |
+
},
|
| 925 |
+
{
|
| 926 |
+
"name": "stdout",
|
| 927 |
+
"output_type": "stream",
|
| 928 |
+
"text": [
|
| 929 |
+
"\n",
|
| 930 |
+
"0: 640x448 1 person, 1 face, 471.2ms\n",
|
| 931 |
+
"Speed: 2.7ms preprocess, 471.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 932 |
+
]
|
| 933 |
+
},
|
| 934 |
+
{
|
| 935 |
+
"name": "stderr",
|
| 936 |
+
"output_type": "stream",
|
| 937 |
+
"text": [
|
| 938 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 939 |
+
"INFO:MiVOLO:\tage: 21.31\n",
|
| 940 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 941 |
+
]
|
| 942 |
+
},
|
| 943 |
+
{
|
| 944 |
+
"name": "stdout",
|
| 945 |
+
"output_type": "stream",
|
| 946 |
+
"text": [
|
| 947 |
+
"\n",
|
| 948 |
+
"0: 640x448 (no detections), 484.0ms\n",
|
| 949 |
+
"Speed: 2.2ms preprocess, 484.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 950 |
+
"\n",
|
| 951 |
+
"0: 640x640 (no detections), 832.6ms\n",
|
| 952 |
+
"Speed: 3.0ms preprocess, 832.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 953 |
+
"\n",
|
| 954 |
+
"0: 640x448 1 person, 1 face, 508.9ms\n",
|
| 955 |
+
"Speed: 2.5ms preprocess, 508.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 956 |
+
]
|
| 957 |
+
},
|
| 958 |
+
{
|
| 959 |
+
"name": "stderr",
|
| 960 |
+
"output_type": "stream",
|
| 961 |
+
"text": [
|
| 962 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 963 |
+
"INFO:MiVOLO:\tage: 27.19\n",
|
| 964 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 965 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6879c7b9-5cb3-42db-b409-30b4e2f71945/width=1080/6879c7b9-5cb3-42db-b409-30b4e2f71945.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
|
| 966 |
+
]
|
| 967 |
+
},
|
| 968 |
+
{
|
| 969 |
+
"name": "stdout",
|
| 970 |
+
"output_type": "stream",
|
| 971 |
+
"text": [
|
| 972 |
+
"\n",
|
| 973 |
+
"0: 640x448 9 persons, 461.8ms\n",
|
| 974 |
+
"Speed: 2.4ms preprocess, 461.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 975 |
+
]
|
| 976 |
+
},
|
| 977 |
+
{
|
| 978 |
+
"name": "stderr",
|
| 979 |
+
"output_type": "stream",
|
| 980 |
+
"text": [
|
| 981 |
+
"INFO:MiVOLO:faces_input: torch.Size([9, 3, 224, 224]), person_input: torch.Size([9, 3, 224, 224])\n",
|
| 982 |
+
"INFO:MiVOLO:\tage: 30.4\n",
|
| 983 |
+
"INFO:MiVOLO:\tgender: male [55%]\n",
|
| 984 |
+
"INFO:MiVOLO:\tage: 28.89\n",
|
| 985 |
+
"INFO:MiVOLO:\tgender: female [63%]\n",
|
| 986 |
+
"INFO:MiVOLO:\tage: 30.31\n",
|
| 987 |
+
"INFO:MiVOLO:\tgender: female [68%]\n",
|
| 988 |
+
"INFO:MiVOLO:\tage: 31.62\n",
|
| 989 |
+
"INFO:MiVOLO:\tgender: female [53%]\n",
|
| 990 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 991 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 992 |
+
"INFO:MiVOLO:\tage: 33.02\n",
|
| 993 |
+
"INFO:MiVOLO:\tgender: male [95%]\n",
|
| 994 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 995 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 996 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 997 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 998 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 999 |
+
"INFO:MiVOLO:\tgender: male [53%]\n"
|
| 1000 |
+
]
|
| 1001 |
+
},
|
| 1002 |
+
{
|
| 1003 |
+
"name": "stdout",
|
| 1004 |
+
"output_type": "stream",
|
| 1005 |
+
"text": [
|
| 1006 |
+
"Processed and saved 19 images so far.\n",
|
| 1007 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-04.csv\n",
|
| 1008 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-05.csv\n",
|
| 1009 |
+
"\n",
|
| 1010 |
+
"0: 640x448 1 person, 455.5ms\n",
|
| 1011 |
+
"Speed: 2.2ms preprocess, 455.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1012 |
+
]
|
| 1013 |
+
},
|
| 1014 |
+
{
|
| 1015 |
+
"name": "stderr",
|
| 1016 |
+
"output_type": "stream",
|
| 1017 |
+
"text": [
|
| 1018 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1019 |
+
"INFO:MiVOLO:\tage: 37.57\n",
|
| 1020 |
+
"INFO:MiVOLO:\tgender: male [95%]\n"
|
| 1021 |
+
]
|
| 1022 |
+
},
|
| 1023 |
+
{
|
| 1024 |
+
"name": "stdout",
|
| 1025 |
+
"output_type": "stream",
|
| 1026 |
+
"text": [
|
| 1027 |
+
"Processed and saved 1 images so far.\n",
|
| 1028 |
+
"\n",
|
| 1029 |
+
"0: 640x448 1 person, 1 face, 438.7ms\n",
|
| 1030 |
+
"Speed: 2.2ms preprocess, 438.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1031 |
+
]
|
| 1032 |
+
},
|
| 1033 |
+
{
|
| 1034 |
+
"name": "stderr",
|
| 1035 |
+
"output_type": "stream",
|
| 1036 |
+
"text": [
|
| 1037 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1038 |
+
"INFO:MiVOLO:\tage: 15.62\n",
|
| 1039 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1040 |
+
]
|
| 1041 |
+
},
|
| 1042 |
+
{
|
| 1043 |
+
"name": "stdout",
|
| 1044 |
+
"output_type": "stream",
|
| 1045 |
+
"text": [
|
| 1046 |
+
"\n",
|
| 1047 |
+
"0: 640x448 (no detections), 444.8ms\n",
|
| 1048 |
+
"Speed: 2.3ms preprocess, 444.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1049 |
+
]
|
| 1050 |
+
},
|
| 1051 |
+
{
|
| 1052 |
+
"name": "stderr",
|
| 1053 |
+
"output_type": "stream",
|
| 1054 |
+
"text": [
|
| 1055 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6032bd70-6d53-4007-9e89-e69d4748efb5/width=528/6032bd70-6d53-4007-9e89-e69d4748efb5.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ed0950>\n"
|
| 1056 |
+
]
|
| 1057 |
+
},
|
| 1058 |
+
{
|
| 1059 |
+
"name": "stdout",
|
| 1060 |
+
"output_type": "stream",
|
| 1061 |
+
"text": [
|
| 1062 |
+
"\n",
|
| 1063 |
+
"0: 640x448 (no detections), 453.9ms\n",
|
| 1064 |
+
"Speed: 2.3ms preprocess, 453.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1065 |
+
"\n",
|
| 1066 |
+
"0: 640x448 1 person, 1 face, 475.0ms\n",
|
| 1067 |
+
"Speed: 1.6ms preprocess, 475.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1068 |
+
]
|
| 1069 |
+
},
|
| 1070 |
+
{
|
| 1071 |
+
"name": "stderr",
|
| 1072 |
+
"output_type": "stream",
|
| 1073 |
+
"text": [
|
| 1074 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1075 |
+
"INFO:MiVOLO:\tage: 22.5\n",
|
| 1076 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1077 |
+
]
|
| 1078 |
+
},
|
| 1079 |
+
{
|
| 1080 |
+
"name": "stdout",
|
| 1081 |
+
"output_type": "stream",
|
| 1082 |
+
"text": [
|
| 1083 |
+
"\n",
|
| 1084 |
+
"0: 640x448 1 person, 1 face, 447.6ms\n",
|
| 1085 |
+
"Speed: 2.5ms preprocess, 447.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1086 |
+
]
|
| 1087 |
+
},
|
| 1088 |
+
{
|
| 1089 |
+
"name": "stderr",
|
| 1090 |
+
"output_type": "stream",
|
| 1091 |
+
"text": [
|
| 1092 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1093 |
+
"INFO:MiVOLO:\tage: 23.46\n",
|
| 1094 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1095 |
+
]
|
| 1096 |
+
},
|
| 1097 |
+
{
|
| 1098 |
+
"name": "stdout",
|
| 1099 |
+
"output_type": "stream",
|
| 1100 |
+
"text": [
|
| 1101 |
+
"\n",
|
| 1102 |
+
"0: 640x512 (no detections), 528.5ms\n",
|
| 1103 |
+
"Speed: 3.2ms preprocess, 528.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 1104 |
+
"\n",
|
| 1105 |
+
"0: 640x448 1 person, 1 face, 449.8ms\n",
|
| 1106 |
+
"Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1107 |
+
]
|
| 1108 |
+
},
|
| 1109 |
+
{
|
| 1110 |
+
"name": "stderr",
|
| 1111 |
+
"output_type": "stream",
|
| 1112 |
+
"text": [
|
| 1113 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1114 |
+
"INFO:MiVOLO:\tage: 29.32\n",
|
| 1115 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1116 |
+
]
|
| 1117 |
+
},
|
| 1118 |
+
{
|
| 1119 |
+
"name": "stdout",
|
| 1120 |
+
"output_type": "stream",
|
| 1121 |
+
"text": [
|
| 1122 |
+
"\n",
|
| 1123 |
+
"0: 640x448 1 person, 1 face, 617.7ms\n",
|
| 1124 |
+
"Speed: 2.4ms preprocess, 617.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1125 |
+
]
|
| 1126 |
+
},
|
| 1127 |
+
{
|
| 1128 |
+
"name": "stderr",
|
| 1129 |
+
"output_type": "stream",
|
| 1130 |
+
"text": [
|
| 1131 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1132 |
+
"INFO:MiVOLO:\tage: 21.32\n",
|
| 1133 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1134 |
+
]
|
| 1135 |
+
},
|
| 1136 |
+
{
|
| 1137 |
+
"name": "stdout",
|
| 1138 |
+
"output_type": "stream",
|
| 1139 |
+
"text": [
|
| 1140 |
+
"\n",
|
| 1141 |
+
"0: 640x448 (no detections), 609.1ms\n",
|
| 1142 |
+
"Speed: 2.3ms preprocess, 609.1ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1143 |
+
"\n",
|
| 1144 |
+
"0: 640x448 (no detections), 436.2ms\n",
|
| 1145 |
+
"Speed: 2.5ms preprocess, 436.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1146 |
+
"\n",
|
| 1147 |
+
"0: 640x512 1 person, 1 face, 585.6ms\n",
|
| 1148 |
+
"Speed: 3.1ms preprocess, 585.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1149 |
+
]
|
| 1150 |
+
},
|
| 1151 |
+
{
|
| 1152 |
+
"name": "stderr",
|
| 1153 |
+
"output_type": "stream",
|
| 1154 |
+
"text": [
|
| 1155 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1156 |
+
"INFO:MiVOLO:\tage: 20.5\n",
|
| 1157 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1158 |
+
]
|
| 1159 |
+
},
|
| 1160 |
+
{
|
| 1161 |
+
"name": "stdout",
|
| 1162 |
+
"output_type": "stream",
|
| 1163 |
+
"text": [
|
| 1164 |
+
"\n",
|
| 1165 |
+
"0: 640x448 1 person, 457.3ms\n",
|
| 1166 |
+
"Speed: 2.1ms preprocess, 457.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1167 |
+
]
|
| 1168 |
+
},
|
| 1169 |
+
{
|
| 1170 |
+
"name": "stderr",
|
| 1171 |
+
"output_type": "stream",
|
| 1172 |
+
"text": [
|
| 1173 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1174 |
+
"INFO:MiVOLO:\tage: 25.19\n",
|
| 1175 |
+
"INFO:MiVOLO:\tgender: male [81%]\n"
|
| 1176 |
+
]
|
| 1177 |
+
},
|
| 1178 |
+
{
|
| 1179 |
+
"name": "stdout",
|
| 1180 |
+
"output_type": "stream",
|
| 1181 |
+
"text": [
|
| 1182 |
+
"Processed and saved 14 images so far.\n",
|
| 1183 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-05.csv\n",
|
| 1184 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-06.csv\n",
|
| 1185 |
+
"\n",
|
| 1186 |
+
"0: 640x448 (no detections), 484.5ms\n",
|
| 1187 |
+
"Speed: 2.8ms preprocess, 484.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1188 |
+
"Processed and saved 1 images so far.\n",
|
| 1189 |
+
"\n",
|
| 1190 |
+
"0: 640x512 (no detections), 524.8ms\n",
|
| 1191 |
+
"Speed: 2.9ms preprocess, 524.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 1192 |
+
"\n",
|
| 1193 |
+
"0: 640x480 1 person, 478.0ms\n",
|
| 1194 |
+
"Speed: 2.6ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 1195 |
+
]
|
| 1196 |
+
},
|
| 1197 |
+
{
|
| 1198 |
+
"name": "stderr",
|
| 1199 |
+
"output_type": "stream",
|
| 1200 |
+
"text": [
|
| 1201 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1202 |
+
"INFO:MiVOLO:\tage: 39.4\n",
|
| 1203 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 1204 |
+
]
|
| 1205 |
+
},
|
| 1206 |
+
{
|
| 1207 |
+
"name": "stdout",
|
| 1208 |
+
"output_type": "stream",
|
| 1209 |
+
"text": [
|
| 1210 |
+
"\n",
|
| 1211 |
+
"0: 640x512 1 person, 1 face, 539.8ms\n",
|
| 1212 |
+
"Speed: 2.6ms preprocess, 539.8ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1213 |
+
]
|
| 1214 |
+
},
|
| 1215 |
+
{
|
| 1216 |
+
"name": "stderr",
|
| 1217 |
+
"output_type": "stream",
|
| 1218 |
+
"text": [
|
| 1219 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1220 |
+
"INFO:MiVOLO:\tage: 21.33\n",
|
| 1221 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1222 |
+
]
|
| 1223 |
+
},
|
| 1224 |
+
{
|
| 1225 |
+
"name": "stdout",
|
| 1226 |
+
"output_type": "stream",
|
| 1227 |
+
"text": [
|
| 1228 |
+
"\n",
|
| 1229 |
+
"0: 640x448 1 person, 2 faces, 446.7ms\n",
|
| 1230 |
+
"Speed: 2.4ms preprocess, 446.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1231 |
+
]
|
| 1232 |
+
},
|
| 1233 |
+
{
|
| 1234 |
+
"name": "stderr",
|
| 1235 |
+
"output_type": "stream",
|
| 1236 |
+
"text": [
|
| 1237 |
+
"INFO:MiVOLO:faces_input: torch.Size([2, 3, 224, 224]), person_input: torch.Size([2, 3, 224, 224])\n",
|
| 1238 |
+
"INFO:MiVOLO:\tage: 20.65\n",
|
| 1239 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 1240 |
+
"INFO:MiVOLO:\tage: 20.53\n",
|
| 1241 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1242 |
+
]
|
| 1243 |
+
},
|
| 1244 |
+
{
|
| 1245 |
+
"name": "stdout",
|
| 1246 |
+
"output_type": "stream",
|
| 1247 |
+
"text": [
|
| 1248 |
+
"\n",
|
| 1249 |
+
"0: 640x640 1 person, 1 face, 655.0ms\n",
|
| 1250 |
+
"Speed: 3.3ms preprocess, 655.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n"
|
| 1251 |
+
]
|
| 1252 |
+
},
|
| 1253 |
+
{
|
| 1254 |
+
"name": "stderr",
|
| 1255 |
+
"output_type": "stream",
|
| 1256 |
+
"text": [
|
| 1257 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1258 |
+
"INFO:MiVOLO:\tage: 26.34\n",
|
| 1259 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1260 |
+
]
|
| 1261 |
+
},
|
| 1262 |
+
{
|
| 1263 |
+
"name": "stdout",
|
| 1264 |
+
"output_type": "stream",
|
| 1265 |
+
"text": [
|
| 1266 |
+
"\n",
|
| 1267 |
+
"0: 640x384 (no detections), 400.6ms\n",
|
| 1268 |
+
"Speed: 2.1ms preprocess, 400.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 1269 |
+
"\n",
|
| 1270 |
+
"0: 640x448 1 person, 587.9ms\n",
|
| 1271 |
+
"Speed: 2.2ms preprocess, 587.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1272 |
+
]
|
| 1273 |
+
},
|
| 1274 |
+
{
|
| 1275 |
+
"name": "stderr",
|
| 1276 |
+
"output_type": "stream",
|
| 1277 |
+
"text": [
|
| 1278 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1279 |
+
"INFO:MiVOLO:\tage: 30.4\n",
|
| 1280 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1281 |
+
]
|
| 1282 |
+
},
|
| 1283 |
+
{
|
| 1284 |
+
"name": "stdout",
|
| 1285 |
+
"output_type": "stream",
|
| 1286 |
+
"text": [
|
| 1287 |
+
"\n",
|
| 1288 |
+
"0: 640x448 (no detections), 610.3ms\n",
|
| 1289 |
+
"Speed: 2.3ms preprocess, 610.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1290 |
+
"\n",
|
| 1291 |
+
"0: 640x448 (no detections), 453.6ms\n",
|
| 1292 |
+
"Speed: 2.3ms preprocess, 453.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1293 |
+
"\n",
|
| 1294 |
+
"0: 640x512 1 person, 1 face, 511.3ms\n",
|
| 1295 |
+
"Speed: 2.8ms preprocess, 511.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1296 |
+
]
|
| 1297 |
+
},
|
| 1298 |
+
{
|
| 1299 |
+
"name": "stderr",
|
| 1300 |
+
"output_type": "stream",
|
| 1301 |
+
"text": [
|
| 1302 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1303 |
+
"INFO:MiVOLO:\tage: 34.28\n",
|
| 1304 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1305 |
+
]
|
| 1306 |
+
},
|
| 1307 |
+
{
|
| 1308 |
+
"name": "stdout",
|
| 1309 |
+
"output_type": "stream",
|
| 1310 |
+
"text": [
|
| 1311 |
+
"\n",
|
| 1312 |
+
"0: 640x448 (no detections), 441.2ms\n",
|
| 1313 |
+
"Speed: 2.3ms preprocess, 441.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1314 |
+
"\n",
|
| 1315 |
+
"0: 640x448 (no detections), 586.3ms\n",
|
| 1316 |
+
"Speed: 2.3ms preprocess, 586.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1317 |
+
"\n",
|
| 1318 |
+
"0: 640x448 (no detections), 437.5ms\n",
|
| 1319 |
+
"Speed: 2.4ms preprocess, 437.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1320 |
+
"\n",
|
| 1321 |
+
"0: 448x640 1 person, 1 face, 437.4ms\n",
|
| 1322 |
+
"Speed: 2.4ms preprocess, 437.4ms inference, 0.7ms postprocess per image at shape (1, 3, 448, 640)\n"
|
| 1323 |
+
]
|
| 1324 |
+
},
|
| 1325 |
+
{
|
| 1326 |
+
"name": "stderr",
|
| 1327 |
+
"output_type": "stream",
|
| 1328 |
+
"text": [
|
| 1329 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1330 |
+
"INFO:MiVOLO:\tage: 22.81\n",
|
| 1331 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1332 |
+
]
|
| 1333 |
+
},
|
| 1334 |
+
{
|
| 1335 |
+
"name": "stdout",
|
| 1336 |
+
"output_type": "stream",
|
| 1337 |
+
"text": [
|
| 1338 |
+
"\n",
|
| 1339 |
+
"0: 640x448 (no detections), 436.8ms\n",
|
| 1340 |
+
"Speed: 2.6ms preprocess, 436.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1341 |
+
"\n",
|
| 1342 |
+
"0: 448x640 (no detections), 433.0ms\n",
|
| 1343 |
+
"Speed: 1.9ms preprocess, 433.0ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
|
| 1344 |
+
"\n",
|
| 1345 |
+
"0: 640x448 (no detections), 599.7ms\n",
|
| 1346 |
+
"Speed: 2.5ms preprocess, 599.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1347 |
+
"\n"
|
| 1348 |
+
]
|
| 1349 |
+
}
|
| 1350 |
+
],
|
| 1351 |
+
"source": [
|
| 1352 |
+
"# Set up logging\n",
|
| 1353 |
+
"detector_weights = current_dir.parent / 'ext/MiVOLO/models/yolov8x_person_face.pt'\n",
|
| 1354 |
+
"checkpoint = current_dir.parent / 'ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
|
| 1355 |
+
"\n",
|
| 1356 |
+
"_logger = logging.getLogger(\"inference\")\n",
|
| 1357 |
+
"logging.basicConfig(level=logging.INFO)\n",
|
| 1358 |
+
"\n",
|
| 1359 |
+
"# Placeholder configuration and predictor initialization for MiVOLO\n",
|
| 1360 |
+
"class Config:\n",
|
| 1361 |
+
" def __init__(self, detector_weights, checkpoint, device, with_persons=True, disable_faces=False, draw=False):\n",
|
| 1362 |
+
" self.detector_weights = detector_weights\n",
|
| 1363 |
+
" self.checkpoint = checkpoint\n",
|
| 1364 |
+
" self.device = device\n",
|
| 1365 |
+
" self.with_persons = with_persons\n",
|
| 1366 |
+
" self.disable_faces = disable_faces\n",
|
| 1367 |
+
" self.draw = draw\n",
|
| 1368 |
+
"\n",
|
| 1369 |
+
"\n",
|
| 1370 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1371 |
+
"config = Config(detector_weights=detector_weights, checkpoint=checkpoint, device=device)\n",
|
| 1372 |
+
"predictor = Predictor(config, verbose=True)\n",
|
| 1373 |
+
"\n",
|
| 1374 |
+
"def download_image(url):\n",
|
| 1375 |
+
" try:\n",
|
| 1376 |
+
" response = requests.get(url)\n",
|
| 1377 |
+
" response.raise_for_status()\n",
|
| 1378 |
+
" return Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
|
| 1379 |
+
" except requests.RequestException as e:\n",
|
| 1380 |
+
" _logger.error(f\"Failed to download image from {url}: {e}\")\n",
|
| 1381 |
+
" return None\n",
|
| 1382 |
+
" except UnidentifiedImageError as e:\n",
|
| 1383 |
+
" _logger.error(f\"Unidentified image error for URL {url}: {e}\")\n",
|
| 1384 |
+
" return None\n",
|
| 1385 |
+
"\n",
|
| 1386 |
+
"def process_images_with_progress(data, predictor, output_file, start_idx=0):\n",
|
| 1387 |
+
" results = []\n",
|
| 1388 |
+
" total_images = len(data)\n",
|
| 1389 |
+
"\n",
|
| 1390 |
+
" for idx, row in data.iterrows():\n",
|
| 1391 |
+
" if idx < start_idx:\n",
|
| 1392 |
+
" continue\n",
|
| 1393 |
+
"\n",
|
| 1394 |
+
" img_url = row[\"url\"]\n",
|
| 1395 |
+
" pil_image = download_image(img_url)\n",
|
| 1396 |
+
" if pil_image is None:\n",
|
| 1397 |
+
" continue\n",
|
| 1398 |
+
"\n",
|
| 1399 |
+
" np_image = np.array(pil_image)\n",
|
| 1400 |
+
" np_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)\n",
|
| 1401 |
+
" detected_objects, _ = predictor.recognize(np_image)\n",
|
| 1402 |
+
"\n",
|
| 1403 |
+
" row_result = row.to_dict() # Start with the original row's data\n",
|
| 1404 |
+
"\n",
|
| 1405 |
+
" if detected_objects and detected_objects.ages:\n",
|
| 1406 |
+
" for i in range(len(detected_objects.ages)):\n",
|
| 1407 |
+
" age = detected_objects.ages[i]\n",
|
| 1408 |
+
" gender = detected_objects.genders[i]\n",
|
| 1409 |
+
" gender_confidence = detected_objects.gender_scores[i]\n",
|
| 1410 |
+
"\n",
|
| 1411 |
+
" if gender_confidence >= 0.83:\n",
|
| 1412 |
+
" detection = {\n",
|
| 1413 |
+
" \"detection_type\": 'face' if i in detected_objects.face_to_person_map else 'person',\n",
|
| 1414 |
+
" \"gender\": gender,\n",
|
| 1415 |
+
" \"gender_confidence\": gender_confidence,\n",
|
| 1416 |
+
" \"age\": age,\n",
|
| 1417 |
+
" \"n_persons\": detected_objects.n_persons,\n",
|
| 1418 |
+
" \"n_faces\": detected_objects.n_faces,\n",
|
| 1419 |
+
" \"detected\": True\n",
|
| 1420 |
+
" }\n",
|
| 1421 |
+
" else:\n",
|
| 1422 |
+
" detection = {\n",
|
| 1423 |
+
" \"detection_type\": \"N/A\",\n",
|
| 1424 |
+
" \"gender\": \"N/A\",\n",
|
| 1425 |
+
" \"gender_confidence\": 0,\n",
|
| 1426 |
+
" \"age\": 0,\n",
|
| 1427 |
+
" \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
|
| 1428 |
+
" \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
|
| 1429 |
+
" \"detected\": False\n",
|
| 1430 |
+
" }\n",
|
| 1431 |
+
"\n",
|
| 1432 |
+
" results.append({**row_result, **detection})\n",
|
| 1433 |
+
" else:\n",
|
| 1434 |
+
" detection = {\n",
|
| 1435 |
+
" \"detection_type\": \"N/A\",\n",
|
| 1436 |
+
" \"gender\": \"N/A\",\n",
|
| 1437 |
+
" \"gender_confidence\": 0,\n",
|
| 1438 |
+
" \"age\": 0,\n",
|
| 1439 |
+
" \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
|
| 1440 |
+
" \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
|
| 1441 |
+
" \"detected\": False\n",
|
| 1442 |
+
" }\n",
|
| 1443 |
+
" results.append({**row_result, **detection})\n",
|
| 1444 |
+
"\n",
|
| 1445 |
+
" if idx % 100 == 0 or idx == total_images - 1:\n",
|
| 1446 |
+
" df = pd.DataFrame(results)\n",
|
| 1447 |
+
" if os.path.exists(output_file):\n",
|
| 1448 |
+
" df.to_csv(output_file, mode='a', header=False, index=False)\n",
|
| 1449 |
+
" else:\n",
|
| 1450 |
+
" df.to_csv(output_file, mode='w', header=True, index=False)\n",
|
| 1451 |
+
" results = []\n",
|
| 1452 |
+
" print(f\"Processed and saved {idx + 1} images so far.\")\n",
|
| 1453 |
+
"\n",
|
| 1454 |
+
"def generate_months(start, end):\n",
|
| 1455 |
+
" start_date = datetime.strptime(start, '%Y-%m')\n",
|
| 1456 |
+
" end_date = datetime.strptime(end, '%Y-%m')\n",
|
| 1457 |
+
" while start_date <= end_date:\n",
|
| 1458 |
+
" yield start_date.strftime('%Y-%m')\n",
|
| 1459 |
+
" start_date += relativedelta(months=1) # Increment by calendar months\n",
|
| 1460 |
+
"\n",
|
| 1461 |
+
"\n",
|
| 1462 |
+
"start_month = '2022-11'\n",
|
| 1463 |
+
"end_month = '2024-12'\n",
|
| 1464 |
+
"\n",
|
| 1465 |
+
"for month in generate_months(start_month, end_month):\n",
|
| 1466 |
+
" input_file_path = mivolo_in / f'Civiverse-{month}.csv'\n",
|
| 1467 |
+
" output_file_path = mivolo_out / f'{month}.csv'\n",
|
| 1468 |
+
"\n",
|
| 1469 |
+
" if input_file_path.exists():\n",
|
| 1470 |
+
" print(f\"Processing: {input_file_path}\")\n",
|
| 1471 |
+
"\n",
|
| 1472 |
+
" data = pd.read_csv(input_file_path)\n",
|
| 1473 |
+
" start_index = 0\n",
|
| 1474 |
+
" process_images_with_progress(data, predictor, output_file_path, start_idx=start_index)\n",
|
| 1475 |
+
"\n",
|
| 1476 |
+
" print(f\"Processed and saved to: {output_file_path}\")\n",
|
| 1477 |
+
" else:\n",
|
| 1478 |
+
" print(f\"File not found: {input_file_path}\")"
|
| 1479 |
+
]
|
| 1480 |
+
},
|
| 1481 |
+
{
|
| 1482 |
+
"cell_type": "markdown",
|
| 1483 |
+
"id": "26aeeef7",
|
| 1484 |
+
"metadata": {},
|
| 1485 |
+
"source": [
|
| 1486 |
+
"## Visualization code"
|
| 1487 |
+
]
|
| 1488 |
+
},
|
| 1489 |
+
{
|
| 1490 |
+
"cell_type": "code",
|
| 1491 |
+
"execution_count": null,
|
| 1492 |
+
"id": "88ec896a-bf9b-4cc6-a787-c1343f8acb41",
|
| 1493 |
+
"metadata": {},
|
| 1494 |
+
"outputs": [],
|
| 1495 |
+
"source": [
|
| 1496 |
+
"import matplotlib.pyplot as plt\n",
|
| 1497 |
+
"import matplotlib.patches as mpatches\n",
|
| 1498 |
+
"\n",
|
| 1499 |
+
"input_dir = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
|
| 1500 |
+
"plot_dir = current_dir.parent / 'plots/'\n",
|
| 1501 |
+
"\n",
|
| 1502 |
+
"\n",
|
| 1503 |
+
"all_data = pd.DataFrame()\n",
|
| 1504 |
+
"for file_path in input_dir.glob('*.csv'): # Reads all CSV files in the folder\n",
|
| 1505 |
+
" #print(f\"Loading: {file_path}\")\n",
|
| 1506 |
+
" data = pd.read_csv(file_path)\n",
|
| 1507 |
+
" all_data = pd.concat([all_data, data], ignore_index=True)\n",
|
| 1508 |
+
"\n",
|
| 1509 |
+
"# Filter rows where detection_type equals \"person\"\n",
|
| 1510 |
+
"person_data = all_data[all_data['detection_type'] == 'person']\n",
|
| 1511 |
+
"\n",
|
| 1512 |
+
"# Count unique images and categorize by persons detected\n",
|
| 1513 |
+
"n_images = all_data['id'].nunique()\n",
|
| 1514 |
+
"images_with_zero_persons = all_data[all_data['n_persons'] == 0]['id'].nunique()\n",
|
| 1515 |
+
"images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
|
| 1516 |
+
"\n",
|
| 1517 |
+
"n_persons_detected = person_data['id'].nunique() # Unique images with at least one detected person\n",
|
| 1518 |
+
"total_persons_detected = person_data.shape[0] # Total number of persons detected\n",
|
| 1519 |
+
"\n",
|
| 1520 |
+
"\n",
|
| 1521 |
+
"\n",
|
| 1522 |
+
"n_total_female = person_data[person_data['gender'] == 'female']['id'].nunique()\n",
|
| 1523 |
+
"n_total_male = person_data[person_data['gender'] == 'male']['id'].nunique()\n",
|
| 1524 |
+
"\n",
|
| 1525 |
+
"# Filter the data further for non-missing age and gender\n",
|
| 1526 |
+
"filtered_data = person_data.dropna(subset=['age', 'gender'])\n",
|
| 1527 |
+
"\n",
|
| 1528 |
+
"# Round ages for consistent plotting\n",
|
| 1529 |
+
"filtered_data['rounded_age'] = np.round(filtered_data['age'] * 4) / 4\n",
|
| 1530 |
+
"\n",
|
| 1531 |
+
"# Map browsingLevel to colors\n",
|
| 1532 |
+
"def get_browsing_color(browsing_level):\n",
|
| 1533 |
+
" color_mapping = {\n",
|
| 1534 |
+
" 1: 'silver',\n",
|
| 1535 |
+
" 2: 'rosybrown',\n",
|
| 1536 |
+
" 4: 'coral',\n",
|
| 1537 |
+
" 8: 'crimson',\n",
|
| 1538 |
+
" 16: 'blueviolet'\n",
|
| 1539 |
+
" }\n",
|
| 1540 |
+
" return color_mapping.get(browsing_level, 'black') # Default to black for unknown values\n",
|
| 1541 |
+
"\n",
|
| 1542 |
+
"filtered_data['color'] = filtered_data['browsingLevel'].apply(get_browsing_color)\n",
|
| 1543 |
+
"\n",
|
| 1544 |
+
"# Aggregate data for plotting\n",
|
| 1545 |
+
"aggregated_data = (\n",
|
| 1546 |
+
" filtered_data.groupby(['rounded_age', 'gender', 'color'])\n",
|
| 1547 |
+
" .size()\n",
|
| 1548 |
+
" .unstack(fill_value=0)\n",
|
| 1549 |
+
")\n",
|
| 1550 |
+
"\n",
|
| 1551 |
+
"# Define NSFW colors\n",
|
| 1552 |
+
"nsfw_colors = ['blueviolet', 'crimson', 'coral', 'rosybrown', 'silver']\n",
|
| 1553 |
+
"\n",
|
| 1554 |
+
"# Plotting function\n",
|
| 1555 |
+
"def plot_gender_data(ax, data, gender_label):\n",
|
| 1556 |
+
" ages = data.index\n",
|
| 1557 |
+
" bottom = np.zeros(len(ages))\n",
|
| 1558 |
+
" \n",
|
| 1559 |
+
" for color in nsfw_colors:\n",
|
| 1560 |
+
" counts = data[color] if color in data.columns else np.zeros(len(ages))\n",
|
| 1561 |
+
" ax.bar(\n",
|
| 1562 |
+
" ages,\n",
|
| 1563 |
+
" counts,\n",
|
| 1564 |
+
" color=color,\n",
|
| 1565 |
+
" edgecolor=color,\n",
|
| 1566 |
+
" linewidth=1,\n",
|
| 1567 |
+
" width=0.2,\n",
|
| 1568 |
+
" bottom=bottom,\n",
|
| 1569 |
+
" alpha=0.5\n",
|
| 1570 |
+
" )\n",
|
| 1571 |
+
" bottom += counts\n",
|
| 1572 |
+
"\n",
|
| 1573 |
+
" x_min = 5\n",
|
| 1574 |
+
" x_max = filtered_data['rounded_age'].max()\n",
|
| 1575 |
+
" ax.set_xticks(np.arange(x_min, x_max + 0.5, 5))\n",
|
| 1576 |
+
" ax.set_xticklabels([f'{int(age)}' for age in np.arange(x_min, x_max + 0.5, 5)], fontsize=12, fontweight='bold')\n",
|
| 1577 |
+
" ax.set_xticks(np.arange(x_min, x_max + 0.5, 0.5), minor=True)\n",
|
| 1578 |
+
"\n",
|
| 1579 |
+
" y_min = 0\n",
|
| 1580 |
+
" y_max = bottom.max() + 100\n",
|
| 1581 |
+
" y_ticks = np.arange(y_min, y_max + 1, 100) # Fine-grained steps of 100\n",
|
| 1582 |
+
" ax.set_yticks(y_ticks)\n",
|
| 1583 |
+
" ax.set_yticklabels([str(int(y)) for y in y_ticks], fontsize=12, fontweight='bold')\n",
|
| 1584 |
+
"\n",
|
| 1585 |
+
" ax.grid(True, which='major', color='lightgrey', linestyle='-', linewidth=0.5)\n",
|
| 1586 |
+
" ax.grid(True, which='minor', color='lightgrey', linestyle=':', linewidth=0.5)\n",
|
| 1587 |
+
"\n",
|
| 1588 |
+
" ax.spines['top'].set_visible(False)\n",
|
| 1589 |
+
" ax.spines['right'].set_visible(False)\n",
|
| 1590 |
+
" ax.spines['left'].set_visible(False)\n",
|
| 1591 |
+
" ax.spines['bottom'].set_visible(False)\n",
|
| 1592 |
+
" \n",
|
| 1593 |
+
" ax.set_xlabel('Age', fontsize=12, fontweight='bold')\n",
|
| 1594 |
+
" if gender_label == 'Female':\n",
|
| 1595 |
+
" ax.set_ylabel('Number of Subjects', fontsize=14, fontweight='bold')\n",
|
| 1596 |
+
" ax.set_title(f'{gender_label} Read', fontsize=14, fontweight='bold')\n",
|
| 1597 |
+
"\n",
|
| 1598 |
+
"# Set up the subplots\n",
|
| 1599 |
+
"fig, axes = plt.subplots(1, 2, figsize=(14, 6.5), sharey=True)\n",
|
| 1600 |
+
"\n",
|
| 1601 |
+
"plot_gender_data(axes[0], aggregated_data.xs('male', level='gender'), 'Male')\n",
|
| 1602 |
+
"plot_gender_data(axes[1], aggregated_data.xs('female', level='gender'), 'Female')\n",
|
| 1603 |
+
"\n",
|
| 1604 |
+
"legend_patches = [\n",
|
| 1605 |
+
" mpatches.Patch(facecolor='blueviolet', edgecolor='blueviolet', linewidth=2, label='Level 16: XXX'),\n",
|
| 1606 |
+
" mpatches.Patch(facecolor='crimson', edgecolor='crimson', linewidth=2, label='Level 8: X'),\n",
|
| 1607 |
+
" mpatches.Patch(facecolor='coral', edgecolor='coral', linewidth=2, label='Level 4: Mature'),\n",
|
| 1608 |
+
" mpatches.Patch(facecolor='rosybrown', edgecolor='rosybrown', linewidth=2, label='Level 2: Soft'),\n",
|
| 1609 |
+
" mpatches.Patch(facecolor='silver', edgecolor='silver', linewidth=2, label='Level 1: SFW'),\n",
|
| 1610 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'n images: {n_images}', alpha=0),\n",
|
| 1611 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'Total persons detected: {total_persons_detected}', alpha=0),\n",
|
| 1612 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'Unique images containing persons: {n_persons_detected}', alpha=0),\n",
|
| 1613 |
+
"]\n",
|
| 1614 |
+
"\n",
|
| 1615 |
+
"axes[0].legend(handles=legend_patches, title=\"Browsing Levels\", loc='upper left', fontsize=12, title_fontsize=12, frameon=True)\n",
|
| 1616 |
+
"plt.savefig(f'{plot_dir}/mivolo.svg', format='svg', bbox_inches='tight')\n",
|
| 1617 |
+
"plt.tight_layout()\n",
|
| 1618 |
+
"\n",
|
| 1619 |
+
"# Count images with at least one person\n",
|
| 1620 |
+
"images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
|
| 1621 |
+
"\n",
|
| 1622 |
+
"# Count unique images in `person_data`\n",
|
| 1623 |
+
"n_persons_detected = person_data['id'].nunique()\n",
|
| 1624 |
+
"\n",
|
| 1625 |
+
"# Count total persons detected\n",
|
| 1626 |
+
"total_persons = person_data.shape[0] # This counts all detected persons\n",
|
| 1627 |
+
"\n",
|
| 1628 |
+
"# Display potential inconsistencies\n",
|
| 1629 |
+
"print(f\"Total images: {n_images}\")\n",
|
| 1630 |
+
"print(f\"Images with at least one person: {images_with_one_or_more_persons}\")\n",
|
| 1631 |
+
"print(f\"Unique images in `person_data`: {n_persons_detected}\")\n",
|
| 1632 |
+
"print(f\"Total number of persons detected: {total_persons}\")\n",
|
| 1633 |
+
"\n",
|
| 1634 |
+
"\n",
|
| 1635 |
+
"\n",
|
| 1636 |
+
"plt.show()"
|
| 1637 |
+
]
|
| 1638 |
+
},
|
| 1639 |
+
{
|
| 1640 |
+
"cell_type": "markdown",
|
| 1641 |
+
"id": "42b2a557-b8f4-4d0f-8907-98e3012a1b34",
|
| 1642 |
+
"metadata": {
|
| 1643 |
+
"execution": {
|
| 1644 |
+
"iopub.execute_input": "2025-02-06T20:01:54.848400Z",
|
| 1645 |
+
"iopub.status.busy": "2025-02-06T20:01:54.847713Z",
|
| 1646 |
+
"iopub.status.idle": "2025-02-06T20:01:54.851533Z",
|
| 1647 |
+
"shell.execute_reply": "2025-02-06T20:01:54.851102Z",
|
| 1648 |
+
"shell.execute_reply.started": "2025-02-06T20:01:54.848376Z"
|
| 1649 |
+
}
|
| 1650 |
+
},
|
| 1651 |
+
"source": [
|
| 1652 |
+
"### Latex Table"
|
| 1653 |
+
]
|
| 1654 |
+
},
|
| 1655 |
+
{
|
| 1656 |
+
"cell_type": "code",
|
| 1657 |
+
"execution_count": null,
|
| 1658 |
+
"id": "3e506c41-6497-4ece-99f3-73f09fe1129e",
|
| 1659 |
+
"metadata": {},
|
| 1660 |
+
"outputs": [],
|
| 1661 |
+
"source": [
|
| 1662 |
+
"import os\n",
|
| 1663 |
+
"import pandas as pd\n",
|
| 1664 |
+
"\n",
|
| 1665 |
+
"# Define the directory containing CSV files\n",
|
| 1666 |
+
"directory_path = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/' # Update this path with the actual directory path\n",
|
| 1667 |
+
"\n",
|
| 1668 |
+
"# Prepare data for LaTeX table\n",
|
| 1669 |
+
"table_rows = []\n",
|
| 1670 |
+
"\n",
|
| 1671 |
+
"# Loop through each file in the directory\n",
|
| 1672 |
+
"for file_name in os.listdir(directory_path):\n",
|
| 1673 |
+
" if file_name.endswith('.csv'):\n",
|
| 1674 |
+
" file_path = os.path.join(directory_path, file_name)\n",
|
| 1675 |
+
" print(f\"Processing file: {file_name}\")\n",
|
| 1676 |
+
"\n",
|
| 1677 |
+
" # Load the data\n",
|
| 1678 |
+
" data = pd.read_csv(file_path)\n",
|
| 1679 |
+
"\n",
|
| 1680 |
+
" # Total images analyzed\n",
|
| 1681 |
+
" total_images = data['id'].nunique()\n",
|
| 1682 |
+
"\n",
|
| 1683 |
+
" # Count of images with no persons detected\n",
|
| 1684 |
+
" images_no_persons = data[data['n_persons'] == 0]['id'].nunique()\n",
|
| 1685 |
+
"\n",
|
| 1686 |
+
" # Total persons detected (only using \"person\" detection type)\n",
|
| 1687 |
+
" total_persons_count = data[data['detection_type'] == 'person'].shape[0]\n",
|
| 1688 |
+
"\n",
|
| 1689 |
+
" # Average age and standard deviation for male and female individuals\n",
|
| 1690 |
+
" male_age_stats = data[data['gender'] == 'male']['age'].agg(['mean', 'std']).fillna(0)\n",
|
| 1691 |
+
" female_age_stats = data[data['gender'] == 'female']['age'].agg(['mean', 'std']).fillna(0)\n",
|
| 1692 |
+
"\n",
|
| 1693 |
+
" # Count of female and male subjects\n",
|
| 1694 |
+
" female_images_count = data[data['gender'] == 'female']['id'].nunique()\n",
|
| 1695 |
+
" male_images_count = data[data['gender'] == 'male']['id'].nunique()\n",
|
| 1696 |
+
"\n",
|
| 1697 |
+
" # Female to male ratio\n",
|
| 1698 |
+
" female_to_male_ratio = female_images_count / male_images_count if male_images_count else None\n",
|
| 1699 |
+
"\n",
|
| 1700 |
+
" # Browsing level analysis for females\n",
|
| 1701 |
+
" female_browsing_level_1 = data[(data['gender'] == 'female') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
|
| 1702 |
+
" female_browsing_level_2_16 = data[(data['gender'] == 'female') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
|
| 1703 |
+
" \n",
|
| 1704 |
+
" female_browsing_level_1_percentage = (female_browsing_level_1 / female_images_count * 100) if female_images_count else 0\n",
|
| 1705 |
+
" female_browsing_level_2_16_percentage = (female_browsing_level_2_16 / female_images_count * 100) if female_images_count else 0\n",
|
| 1706 |
+
"\n",
|
| 1707 |
+
" # Browsing level analysis for males\n",
|
| 1708 |
+
" male_browsing_level_1 = data[(data['gender'] == 'male') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
|
| 1709 |
+
" male_browsing_level_2_16 = data[(data['gender'] == 'male') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
|
| 1710 |
+
"\n",
|
| 1711 |
+
" male_browsing_level_1_percentage = (male_browsing_level_1 / male_images_count * 100) if male_images_count else 0\n",
|
| 1712 |
+
" male_browsing_level_2_16_percentage = (male_browsing_level_2_16 / male_images_count * 100) if male_images_count else 0\n",
|
| 1713 |
+
"\n",
|
| 1714 |
+
" # Add row to table data\n",
|
| 1715 |
+
" table_rows.append([\n",
|
| 1716 |
+
" file_name.replace('.csv', ''), # Remove file extension\n",
|
| 1717 |
+
" total_images,\n",
|
| 1718 |
+
" total_persons_count,\n",
|
| 1719 |
+
" images_no_persons,\n",
|
| 1720 |
+
" f\"{female_browsing_level_1_percentage:.2f}\",\n",
|
| 1721 |
+
" f\"{female_browsing_level_2_16_percentage:.2f}\",\n",
|
| 1722 |
+
" f\"{male_browsing_level_1_percentage:.2f}\",\n",
|
| 1723 |
+
" f\"{male_browsing_level_2_16_percentage:.2f}\",\n",
|
| 1724 |
+
" f\"{female_to_male_ratio:.2f}\" if female_to_male_ratio is not None else \"N/A\",\n",
|
| 1725 |
+
" f\"{female_age_stats['mean']:.2f} ({female_age_stats['std']:.2f})\",\n",
|
| 1726 |
+
" f\"{male_age_stats['mean']:.2f} ({male_age_stats['std']:.2f})\"\n",
|
| 1727 |
+
" ])\n",
|
| 1728 |
+
"\n",
|
| 1729 |
+
"# Sort table rows by the filename (assumes filenames are formatted with sortable dates)\n",
|
| 1730 |
+
"table_rows = sorted(table_rows, key=lambda x: x[0])\n",
|
| 1731 |
+
"\n",
|
| 1732 |
+
"# Generate LaTeX table\n",
|
| 1733 |
+
"latex_table = r\"\"\"\n",
|
| 1734 |
+
"\\begin{table}[H]\n",
|
| 1735 |
+
"\\centering\n",
|
| 1736 |
+
"\\scriptsize\n",
|
| 1737 |
+
"\\renewcommand{\\arraystretch}{0.9}\n",
|
| 1738 |
+
"\\caption{Summary of Image Classification for 2023-2024}\n",
|
| 1739 |
+
"\\label{table:image_classification_2023_2024}\n",
|
| 1740 |
+
"\\begin{tabular}{lrrrrrrrrrr}\n",
|
| 1741 |
+
"\\toprule\n",
|
| 1742 |
+
"File Name & Total Images & Total Persons & No Persons & \\multicolumn{2}{c}{Female (\\%)} & \\multicolumn{2}{c}{Male (\\%)} & Female:Male & Female Age (Mean ± SD) & Male Age (Mean ± SD) \\\\\n",
|
| 1743 |
+
" & & & & L1 & L2-16 & L1 & L2-16 & & & \\\\\n",
|
| 1744 |
+
"\\midrule\n",
|
| 1745 |
+
"\"\"\"\n",
|
| 1746 |
+
"for row in table_rows:\n",
|
| 1747 |
+
" latex_table += \" & \".join(map(str, row)) + r\" \\\\\\\\\\n\"\n",
|
| 1748 |
+
"\n",
|
| 1749 |
+
"latex_table += r\"\"\"\n",
|
| 1750 |
+
"\\bottomrule\n",
|
| 1751 |
+
"\\end{tabular}\n",
|
| 1752 |
+
"\\vspace{1em}\n",
|
| 1753 |
+
"\\noindent\n",
|
| 1754 |
+
"\\textbf{Disclaimer:} \\(\\female\\) and \\(\\male\\) refer to female-read and male-read classifications as determined by the MiVOLO system's weights. \n",
|
| 1755 |
+
"We acknowledge the complexities of gender presentations and stress that these terms do not necessarily correspond to biological sex.\n",
|
| 1756 |
+
"\\end{table}\n",
|
| 1757 |
+
"\"\"\"\n",
|
| 1758 |
+
"\n",
|
| 1759 |
+
"# Output LaTeX table\n",
|
| 1760 |
+
"print(\"\\nGenerated LaTeX Table:\")\n",
|
| 1761 |
+
"print(latex_table)\n",
|
| 1762 |
+
"\n"
|
| 1763 |
+
]
|
| 1764 |
+
},
|
| 1765 |
+
{
|
| 1766 |
+
"cell_type": "code",
|
| 1767 |
+
"execution_count": null,
|
| 1768 |
+
"id": "3ef428be-856b-4c4c-b1b0-a052d181d03b",
|
| 1769 |
+
"metadata": {},
|
| 1770 |
+
"outputs": [],
|
| 1771 |
+
"source": []
|
| 1772 |
+
}
|
| 1773 |
+
],
|
| 1774 |
+
"metadata": {
|
| 1775 |
+
"kernelspec": {
|
| 1776 |
+
"display_name": "HORDE",
|
| 1777 |
+
"language": "python",
|
| 1778 |
+
"name": "horde"
|
| 1779 |
+
},
|
| 1780 |
+
"language_info": {
|
| 1781 |
+
"codemirror_mode": {
|
| 1782 |
+
"name": "ipython",
|
| 1783 |
+
"version": 3
|
| 1784 |
+
},
|
| 1785 |
+
"file_extension": ".py",
|
| 1786 |
+
"mimetype": "text/x-python",
|
| 1787 |
+
"name": "python",
|
| 1788 |
+
"nbconvert_exporter": "python",
|
| 1789 |
+
"pygments_lexer": "ipython3",
|
| 1790 |
+
"version": "3.12.4"
|
| 1791 |
+
}
|
| 1792 |
+
},
|
| 1793 |
+
"nbformat": 4,
|
| 1794 |
+
"nbformat_minor": 5
|
| 1795 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-1_Tag_occurences-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [],
|
| 3 |
+
"metadata": {},
|
| 4 |
+
"nbformat": 4,
|
| 5 |
+
"nbformat_minor": 5
|
| 6 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Bloomz_query-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,370 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": null,
|
| 6 |
+
"id": "3b87c378-241e-41ab-be6e-84222594f22f",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [],
|
| 9 |
+
"source": [
|
| 10 |
+
"import pandas as pd\n",
|
| 11 |
+
"import json\n",
|
| 12 |
+
"import time\n",
|
| 13 |
+
"import re\n",
|
| 14 |
+
"from pathlib import Path\n",
|
| 15 |
+
"from tqdm import tqdm\n",
|
| 16 |
+
"import torch\n",
|
| 17 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer\n",
|
| 18 |
+
"\n",
|
| 19 |
+
"# Import is used for pd.notna() and pd.isna() checks\n",
|
| 20 |
+
"\n",
|
| 21 |
+
"current_dir = Path.cwd()\n",
|
| 22 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 23 |
+
"\n",
|
| 24 |
+
"# === CONFIGURATION ===\n",
|
| 25 |
+
"TEST_MODE = True\n",
|
| 26 |
+
"TEST_SIZE = 10\n",
|
| 27 |
+
"MAX_ROWS = 20000\n",
|
| 28 |
+
"SAVE_INTERVAL = 10\n",
|
| 29 |
+
"\n",
|
| 30 |
+
"# Model settings - BLOOMZ (BigScience - European consortium)\n",
|
| 31 |
+
"MODEL_NAME = \"bigscience/bloomz-7b1\" # Largest instruction-tuned BLOOM model\n",
|
| 32 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 33 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 36 |
+
" \"actor\", \"adult performer\", \"singer/musician\", \"model\",\n",
|
| 37 |
+
" \"online personality\", \"public figure\", \"voice actor/ASMR\",\n",
|
| 38 |
+
" \"sports professional\", \"tv personality\"\n",
|
| 39 |
+
"]\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"# === LOAD MODEL ===\n",
|
| 42 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 43 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 44 |
+
"print(f\"This may take a while on first run (~14GB download)...\\n\")\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"# Check GPU availability\n",
|
| 47 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 48 |
+
"print(f\"Device: {device}\")\n",
|
| 49 |
+
"\n",
|
| 50 |
+
"if device == \"cpu\":\n",
|
| 51 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 52 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"# Load tokenizer\n",
|
| 55 |
+
"print(\"Loading tokenizer...\")\n",
|
| 56 |
+
"try:\n",
|
| 57 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 58 |
+
" MODEL_NAME,\n",
|
| 59 |
+
" cache_dir=str(CACHE_DIR)\n",
|
| 60 |
+
" )\n",
|
| 61 |
+
" print(\"✅ Tokenizer loaded\")\n",
|
| 62 |
+
"except Exception as e:\n",
|
| 63 |
+
" print(f\"❌ Error loading tokenizer: {e}\")\n",
|
| 64 |
+
" raise\n",
|
| 65 |
+
"\n",
|
| 66 |
+
"# Ensure pad token is set\n",
|
| 67 |
+
"if tokenizer.pad_token is None:\n",
|
| 68 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 69 |
+
" print(f\"Set pad_token to eos_token: {tokenizer.eos_token}\")\n",
|
| 70 |
+
"\n",
|
| 71 |
+
"# Load model with optimizations\n",
|
| 72 |
+
"print(\"Loading model (this may take several minutes)...\")\n",
|
| 73 |
+
"try:\n",
|
| 74 |
+
" model = AutoModelForCausalLM.from_pretrained(\n",
|
| 75 |
+
" MODEL_NAME,\n",
|
| 76 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 77 |
+
" torch_dtype=torch.bfloat16, # Use BF16 for efficiency\n",
|
| 78 |
+
" device_map=\"auto\", # Automatically distribute across GPUs\n",
|
| 79 |
+
" low_cpu_mem_usage=True # Optimize memory usage\n",
|
| 80 |
+
" )\n",
|
| 81 |
+
" model.eval() # Set to evaluation mode\n",
|
| 82 |
+
" print(\"✅ Model loaded\")\n",
|
| 83 |
+
"except Exception as e:\n",
|
| 84 |
+
" print(f\"❌ Error loading model: {e}\")\n",
|
| 85 |
+
" raise\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"# Check VRAM usage\n",
|
| 88 |
+
"if torch.cuda.is_available():\n",
|
| 89 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 90 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 91 |
+
"\n",
|
| 92 |
+
"# === LOAD DATA ===\n",
|
| 93 |
+
"df = pd.read_csv(input_file)\n",
|
| 94 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 95 |
+
"\n",
|
| 96 |
+
"if TEST_MODE:\n",
|
| 97 |
+
" print(f\"Running in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 98 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 99 |
+
"elif MAX_ROWS:\n",
|
| 100 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 101 |
+
"\n",
|
| 102 |
+
"# === CREATE PROMPT (Exact DeepSeek style) ===\n",
|
| 103 |
+
"def create_prompt(row):\n",
|
| 104 |
+
" \"\"\"Create prompt.\"\"\"\n",
|
| 105 |
+
" name = row.get('real_name', row.get('name', ''))\n",
|
| 106 |
+
" if pd.isna(name):\n",
|
| 107 |
+
" name = row.get('name', '')\n",
|
| 108 |
+
" \n",
|
| 109 |
+
" # Gather hints exactly like DeepSeek version\n",
|
| 110 |
+
" hints = []\n",
|
| 111 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 112 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 113 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 114 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 115 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 116 |
+
" hints.append(str(row['likely_country']))\n",
|
| 117 |
+
" \n",
|
| 118 |
+
" # Add tags if we don't have enough hints\n",
|
| 119 |
+
" if len(hints) < 3:\n",
|
| 120 |
+
" for i in range(1, 8):\n",
|
| 121 |
+
" tag_col = f'tag_{i}'\n",
|
| 122 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 123 |
+
" tag_val = str(row[tag_col])\n",
|
| 124 |
+
" if tag_val not in hints:\n",
|
| 125 |
+
" hints.append(tag_val)\n",
|
| 126 |
+
" if len(hints) >= 5:\n",
|
| 127 |
+
" break\n",
|
| 128 |
+
" \n",
|
| 129 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 130 |
+
" \n",
|
| 131 |
+
" return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
|
| 132 |
+
"1. Full legal name (Western order if non-latin script)\n",
|
| 133 |
+
"2. Any stage names/aliases (comma separated)\n",
|
| 134 |
+
"3. Gender (Male/Female/Other/Unknown)\n",
|
| 135 |
+
"4. Top 3 most likely professions from ONLY these categories:\n",
|
| 136 |
+
" - actor\n",
|
| 137 |
+
" - adult performer\n",
|
| 138 |
+
" - singer/musician\n",
|
| 139 |
+
" - model\n",
|
| 140 |
+
" - online personality (includes streamers, cosplayers, influencers)\n",
|
| 141 |
+
" - public figure (includes politicians, activists, journalists, authors)\n",
|
| 142 |
+
" - voice actor/ASMR\n",
|
| 143 |
+
" - sports professional\n",
|
| 144 |
+
" - tv personality (includes hosts, presenters, reality TV)\n",
|
| 145 |
+
"\n",
|
| 146 |
+
"5. Primary country associated\n",
|
| 147 |
+
"\n",
|
| 148 |
+
"IMPORTANT:\n",
|
| 149 |
+
"- Choose professions ONLY from the 9 categories above\n",
|
| 150 |
+
"- Provide up to 3 professions, comma-separated, ordered by relevance\n",
|
| 151 |
+
"- Be SPECIFIC: choose the most accurate category for each role\n",
|
| 152 |
+
"- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
|
| 153 |
+
"- Use 'Unknown' when uncertain or for fictional characters/places\n",
|
| 154 |
+
"- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
|
| 155 |
+
"\n",
|
| 156 |
+
"Respond with exactly 5 numbered lines.\"\"\"\n",
|
| 157 |
+
"\n",
|
| 158 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"# === QUERY BLOOMZ LOCAL ===\n",
|
| 161 |
+
"def query_bloomz_local(prompt: str) -> str:\n",
|
| 162 |
+
" \"\"\"Query BLOOMZ-7B1 locally via transformers, return raw response string.\"\"\"\n",
|
| 163 |
+
" try:\n",
|
| 164 |
+
" # BLOOMZ works better with instruction-response format\n",
|
| 165 |
+
" full_prompt = f\"\"\"Instruction: Extract key data on a person based on the name and hints.\n",
|
| 166 |
+
"You must respond with exactly 5 numbered lines in this format:\n",
|
| 167 |
+
"1. Full legal name\n",
|
| 168 |
+
"2. Stage names/aliases \n",
|
| 169 |
+
"3. Gender\n",
|
| 170 |
+
"4. Professions (comma-separated, choose ONLY from: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality)\n",
|
| 171 |
+
"5. Country\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"{prompt}\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"Response:\"\"\"\n",
|
| 176 |
+
" \n",
|
| 177 |
+
" inputs = tokenizer(\n",
|
| 178 |
+
" full_prompt, \n",
|
| 179 |
+
" return_tensors=\"pt\", \n",
|
| 180 |
+
" truncation=True,\n",
|
| 181 |
+
" max_length=2048\n",
|
| 182 |
+
" ).to(device)\n",
|
| 183 |
+
" \n",
|
| 184 |
+
" # Generate with adjusted parameters for BLOOMZ\n",
|
| 185 |
+
" with torch.no_grad():\n",
|
| 186 |
+
" outputs = model.generate(\n",
|
| 187 |
+
" **inputs,\n",
|
| 188 |
+
" max_new_tokens=256,\n",
|
| 189 |
+
" temperature=0.3, # Increased for more variability\n",
|
| 190 |
+
" do_sample=True,\n",
|
| 191 |
+
" top_p=0.9,\n",
|
| 192 |
+
" top_k=40,\n",
|
| 193 |
+
" repetition_penalty=1.1,\n",
|
| 194 |
+
" pad_token_id=tokenizer.eos_token_id, # Use EOS as pad token\n",
|
| 195 |
+
" eos_token_id=tokenizer.eos_token_id,\n",
|
| 196 |
+
" early_stopping=True\n",
|
| 197 |
+
" )\n",
|
| 198 |
+
" \n",
|
| 199 |
+
" # Decode the entire output to see what's happening\n",
|
| 200 |
+
" full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
|
| 201 |
+
" \n",
|
| 202 |
+
" # Extract only the generated part (after the prompt)\n",
|
| 203 |
+
" generated_text = full_output[len(tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=True)):]\n",
|
| 204 |
+
" \n",
|
| 205 |
+
" # Debug output\n",
|
| 206 |
+
" if not hasattr(query_bloomz_local, 'debug_count'):\n",
|
| 207 |
+
" query_bloomz_local.debug_count = 0\n",
|
| 208 |
+
" \n",
|
| 209 |
+
" if query_bloomz_local.debug_count < 3:\n",
|
| 210 |
+
" print(f\"\\n📝 BLOOMZ Debug #{query_bloomz_local.debug_count + 1}:\")\n",
|
| 211 |
+
" print(f\"Prompt: {full_prompt[:200]}...\")\n",
|
| 212 |
+
" print(f\"Full output: {full_output[:500]}...\")\n",
|
| 213 |
+
" print(f\"Generated text: {generated_text}\")\n",
|
| 214 |
+
" print(f\"{'='*60}\\n\")\n",
|
| 215 |
+
" query_bloomz_local.debug_count += 1\n",
|
| 216 |
+
" \n",
|
| 217 |
+
" return generated_text.strip()\n",
|
| 218 |
+
" \n",
|
| 219 |
+
" except Exception as e:\n",
|
| 220 |
+
" print(f\"Error querying BLOOMZ: {e}\")\n",
|
| 221 |
+
" return None\n",
|
| 222 |
+
"\n",
|
| 223 |
+
"# === PARSE RESPONSE (Exact DeepSeek format) ===\n",
|
| 224 |
+
"def parse_response(response):\n",
|
| 225 |
+
" \"\"\"Parse numbered response into structured fields.\"\"\"\n",
|
| 226 |
+
" if not response:\n",
|
| 227 |
+
" return {\n",
|
| 228 |
+
" 'full_name': 'Unknown',\n",
|
| 229 |
+
" 'aliases': 'Unknown',\n",
|
| 230 |
+
" 'gender': 'Unknown',\n",
|
| 231 |
+
" 'profession_llm': 'Unknown',\n",
|
| 232 |
+
" 'country': 'Unknown'\n",
|
| 233 |
+
" }\n",
|
| 234 |
+
" \n",
|
| 235 |
+
" # Split into lines and clean\n",
|
| 236 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 237 |
+
" \n",
|
| 238 |
+
" # Initialize with Unknown values\n",
|
| 239 |
+
" fields = {\n",
|
| 240 |
+
" 'full_name': 'Unknown',\n",
|
| 241 |
+
" 'aliases': 'Unknown',\n",
|
| 242 |
+
" 'gender': 'Unknown',\n",
|
| 243 |
+
" 'profession_llm': 'Unknown',\n",
|
| 244 |
+
" 'country': 'Unknown'\n",
|
| 245 |
+
" }\n",
|
| 246 |
+
" \n",
|
| 247 |
+
" # Extract information from each numbered line\n",
|
| 248 |
+
" for line in lines:\n",
|
| 249 |
+
" if line.startswith('1.') or line.startswith('1)'):\n",
|
| 250 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 251 |
+
" elif line.startswith('2.') or line.startswith('2)'):\n",
|
| 252 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 253 |
+
" elif line.startswith('3.') or line.startswith('3)'):\n",
|
| 254 |
+
" fields['gender'] = line[2:].strip()\n",
|
| 255 |
+
" elif line.startswith('4.') or line.startswith('4)'):\n",
|
| 256 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 257 |
+
" elif line.startswith('5.') or line.startswith('5)'):\n",
|
| 258 |
+
" fields['country'] = line[2:].strip()\n",
|
| 259 |
+
" \n",
|
| 260 |
+
" return fields\n",
|
| 261 |
+
"\n",
|
| 262 |
+
"# === PROCESS ===\n",
|
| 263 |
+
"output_file = current_dir.parent / f\"data/CSV/bloomz_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 264 |
+
"index_file = current_dir.parent / \"misc/bloomz_query_index.txt\"\n",
|
| 265 |
+
"\n",
|
| 266 |
+
"current_index = 0\n",
|
| 267 |
+
"if index_file.exists():\n",
|
| 268 |
+
" with open(index_file) as f:\n",
|
| 269 |
+
" current_index = int(f.read().strip())\n",
|
| 270 |
+
" print(f\"Resuming from index {current_index}\")\n",
|
| 271 |
+
"\n",
|
| 272 |
+
"# Initialize columns (same as DeepSeek)\n",
|
| 273 |
+
"for col in ['full_name', 'gender', 'profession_llm', 'country', 'aliases']:\n",
|
| 274 |
+
" if col not in df.columns:\n",
|
| 275 |
+
" df[col] = 'Unknown'\n",
|
| 276 |
+
"\n",
|
| 277 |
+
"# Create prompts for all rows (same as DeepSeek)\n",
|
| 278 |
+
"print(\"Creating prompts...\")\n",
|
| 279 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 280 |
+
"\n",
|
| 281 |
+
"print(f\"\\nAnnotating with BLOOMZ-7B1 LOCAL - rows {current_index} to {len(df)}...\")\n",
|
| 282 |
+
"print(f\"Model: {MODEL_NAME}\")\n",
|
| 283 |
+
"print(f\"This may take a while...\\n\")\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"try:\n",
|
| 286 |
+
" start_time = time.time()\n",
|
| 287 |
+
" \n",
|
| 288 |
+
" for i in tqdm(range(current_index, len(df)), desc=\"Annotating\"):\n",
|
| 289 |
+
" row = df.iloc[i]\n",
|
| 290 |
+
" \n",
|
| 291 |
+
" # Query BLOOMZ (equivalent to DeepSeek query)\n",
|
| 292 |
+
" response = query_bloomz_local(row['prompt'])\n",
|
| 293 |
+
" parsed_data = parse_response(response)\n",
|
| 294 |
+
" \n",
|
| 295 |
+
" # Update dataframe\n",
|
| 296 |
+
" for key, value in parsed_data.items():\n",
|
| 297 |
+
" df.at[i, key] = value\n",
|
| 298 |
+
" \n",
|
| 299 |
+
" current_index = i + 1\n",
|
| 300 |
+
" \n",
|
| 301 |
+
" # Save progress at intervals\n",
|
| 302 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 303 |
+
" df.to_csv(output_file, index=False)\n",
|
| 304 |
+
" with open(index_file, 'w') as f:\n",
|
| 305 |
+
" f.write(str(current_index))\n",
|
| 306 |
+
" print(f\"✅ Progress saved after {i+1} rows\")\n",
|
| 307 |
+
" \n",
|
| 308 |
+
" # Optional: Add small delay to prevent overheating (not needed for rate limiting like DeepSeek)\n",
|
| 309 |
+
" # time.sleep(0.1)\n",
|
| 310 |
+
" \n",
|
| 311 |
+
" elapsed_total = time.time() - start_time\n",
|
| 312 |
+
" print(f\"\\n✅ Done! Final results saved to {output_file}\")\n",
|
| 313 |
+
" \n",
|
| 314 |
+
" # Summary statistics (same as DeepSeek)\n",
|
| 315 |
+
" print(\"\\n=== Summary Statistics ===\")\n",
|
| 316 |
+
" print(f\"Total processed: {len(df)}\")\n",
|
| 317 |
+
" print(f\"\\nGender distribution:\")\n",
|
| 318 |
+
" print(df['gender'].value_counts())\n",
|
| 319 |
+
" print(f\"\\nTop 10 profession combinations:\")\n",
|
| 320 |
+
" print(df['profession_llm'].value_counts().head(10))\n",
|
| 321 |
+
" print(f\"\\nTop 10 countries:\")\n",
|
| 322 |
+
" print(df['country'].value_counts().head(10))\n",
|
| 323 |
+
" \n",
|
| 324 |
+
" # Sample results\n",
|
| 325 |
+
" print(\"\\n=== Sample Results ===\")\n",
|
| 326 |
+
" display_cols = ['real_name', 'full_name', 'gender', 'profession_llm', 'country']\n",
|
| 327 |
+
" available_cols = [col for col in display_cols if col in df.columns]\n",
|
| 328 |
+
" print(df[available_cols].head(10).to_string(index=False))\n",
|
| 329 |
+
" \n",
|
| 330 |
+
" # Additional info for local model\n",
|
| 331 |
+
" print(f\"\\nTotal time: {elapsed_total/60:.1f} minutes\")\n",
|
| 332 |
+
" print(f\"Average speed: {len(df)/(elapsed_total/3600):.1f} samples/hour\")\n",
|
| 333 |
+
" if torch.cuda.is_available():\n",
|
| 334 |
+
" print(f\"Final VRAM usage: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB\")\n",
|
| 335 |
+
"\n",
|
| 336 |
+
"except Exception as e:\n",
|
| 337 |
+
" print(f\"⚠️ Error encountered: {e}\")\n",
|
| 338 |
+
" print(f\"⚠️ Last processed index: {current_index}\")\n",
|
| 339 |
+
" \n",
|
| 340 |
+
" # Save progress before exiting\n",
|
| 341 |
+
" df.to_csv(output_file, index=False)\n",
|
| 342 |
+
" with open(index_file, 'w') as f:\n",
|
| 343 |
+
" f.write(str(current_index))\n",
|
| 344 |
+
" \n",
|
| 345 |
+
" print(f\"⚠️ Progress saved up to row {current_index}\")"
|
| 346 |
+
]
|
| 347 |
+
}
|
| 348 |
+
],
|
| 349 |
+
"metadata": {
|
| 350 |
+
"kernelspec": {
|
| 351 |
+
"display_name": "pm-paper",
|
| 352 |
+
"language": "python",
|
| 353 |
+
"name": "pm-paper"
|
| 354 |
+
},
|
| 355 |
+
"language_info": {
|
| 356 |
+
"codemirror_mode": {
|
| 357 |
+
"name": "ipython",
|
| 358 |
+
"version": 3
|
| 359 |
+
},
|
| 360 |
+
"file_extension": ".py",
|
| 361 |
+
"mimetype": "text/x-python",
|
| 362 |
+
"name": "python",
|
| 363 |
+
"nbconvert_exporter": "python",
|
| 364 |
+
"pygments_lexer": "ipython3",
|
| 365 |
+
"version": "3.11.13"
|
| 366 |
+
}
|
| 367 |
+
},
|
| 368 |
+
"nbformat": 4,
|
| 369 |
+
"nbformat_minor": 5
|
| 370 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_1_LLM_annotation-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,1941 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "23d0ae58",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# Deepfake Adapter Dataset - LLM Annotation Pipeline"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"id": "e4407358",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"### Unified Model Loading & Inference\n",
|
| 17 |
+
"Code for querying Mistral, Gemma, and Qwen models."
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"id": "1a1b9d0e",
|
| 23 |
+
"metadata": {},
|
| 24 |
+
"source": [
|
| 25 |
+
"## CLEANING & PREPROCESSING"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "3df42c46",
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"source": [
|
| 33 |
+
"#### Named Entity Recognitition (NER) using SpaCy "
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"cell_type": "code",
|
| 38 |
+
"execution_count": null,
|
| 39 |
+
"id": "a287eef4",
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"outputs": [],
|
| 42 |
+
"source": [
|
| 43 |
+
"import pandas as pd\n",
|
| 44 |
+
"import re\n",
|
| 45 |
+
"from pathlib import Path\n",
|
| 46 |
+
"import emoji\n",
|
| 47 |
+
"import spacy\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"# Load spaCy model\n",
|
| 50 |
+
"# You may need to download it first: python -m spacy download en_core_web_sm\n",
|
| 51 |
+
"try:\n",
|
| 52 |
+
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
| 53 |
+
" print(\"✅ spaCy model loaded: en_core_web_sm\")\n",
|
| 54 |
+
"except OSError:\n",
|
| 55 |
+
" print(\"❌ spaCy model not found. Downloading...\")\n",
|
| 56 |
+
" import subprocess\n",
|
| 57 |
+
" subprocess.run([\"python\", \"-m\", \"spacy\", \"download\", \"en_core_web_sm\"])\n",
|
| 58 |
+
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
| 59 |
+
" print(\"✅ spaCy model downloaded and loaded\")\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"# Set up paths\n",
|
| 62 |
+
"current_dir = Path.cwd()\n",
|
| 63 |
+
"#input_file = current_dir.parent / \"data/CSV/real_person_adapters.csv\"\n",
|
| 64 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter.csv\"\n",
|
| 65 |
+
"\n",
|
| 66 |
+
"# Load dataset\n",
|
| 67 |
+
"df = pd.read_csv(input_file)\n",
|
| 68 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 69 |
+
"\n",
|
| 70 |
+
"def translate_leetspeak(text: str) -> str:\n",
|
| 71 |
+
" \"\"\"\n",
|
| 72 |
+
" Translate common leetspeak patterns to normal letters.\n",
|
| 73 |
+
" Examples: 4kira -> Akira, 3mma -> Emma, 1rene -> Irene\n",
|
| 74 |
+
" \"\"\"\n",
|
| 75 |
+
" if not text:\n",
|
| 76 |
+
" return text\n",
|
| 77 |
+
" \n",
|
| 78 |
+
" # Common leetspeak mappings (order matters!)\n",
|
| 79 |
+
" leetspeak_map = {\n",
|
| 80 |
+
" '4': 'a',\n",
|
| 81 |
+
" '3': 'e', \n",
|
| 82 |
+
" '1': 'i',\n",
|
| 83 |
+
" '0': 'o',\n",
|
| 84 |
+
" '7': 't',\n",
|
| 85 |
+
" '5': 's',\n",
|
| 86 |
+
" '8': 'b',\n",
|
| 87 |
+
" '9': 'g',\n",
|
| 88 |
+
" '@': 'a',\n",
|
| 89 |
+
" '$': 's',\n",
|
| 90 |
+
" '!': 'i',\n",
|
| 91 |
+
" }\n",
|
| 92 |
+
" \n",
|
| 93 |
+
" result = text\n",
|
| 94 |
+
" # Apply mappings at word boundaries or start of string\n",
|
| 95 |
+
" for leet, normal in leetspeak_map.items():\n",
|
| 96 |
+
" # Replace at start of word\n",
|
| 97 |
+
" result = re.sub(rf'\\b{re.escape(leet)}', normal, result, flags=re.IGNORECASE)\n",
|
| 98 |
+
" # Replace standalone numbers that look like letters in context\n",
|
| 99 |
+
" result = re.sub(rf'(?<=[a-z]){re.escape(leet)}(?=[a-z])', normal, result, flags=re.IGNORECASE)\n",
|
| 100 |
+
" \n",
|
| 101 |
+
" return result\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"def preprocess_for_ner(name: str) -> str:\n",
|
| 104 |
+
" \"\"\"\n",
|
| 105 |
+
" Preprocess the name before spaCy NER.\n",
|
| 106 |
+
" Remove noise but keep the actual name parts.\n",
|
| 107 |
+
" \"\"\"\n",
|
| 108 |
+
" if pd.isna(name):\n",
|
| 109 |
+
" return \"\"\n",
|
| 110 |
+
" \n",
|
| 111 |
+
" name = str(name)\n",
|
| 112 |
+
" \n",
|
| 113 |
+
" # FIRST: Translate leetspeak\n",
|
| 114 |
+
" name = translate_leetspeak(name)\n",
|
| 115 |
+
" \n",
|
| 116 |
+
" # Remove emoji\n",
|
| 117 |
+
" name = emoji.replace_emoji(name, replace=' ')\n",
|
| 118 |
+
" \n",
|
| 119 |
+
" # Remove version indicators (v1, v2, v1.0, etc.)\n",
|
| 120 |
+
" name = re.sub(r'\\s*[vV]\\d+(\\.\\d+)?\\s*', ' ', name)\n",
|
| 121 |
+
" \n",
|
| 122 |
+
" # Remove LoRA-related terms (case insensitive)\n",
|
| 123 |
+
" lora_terms = ['lora', 'loha', 'lycoris', 'controlnet', 'textual inversion', \n",
|
| 124 |
+
" 'embedding', 'ti', 'checkpoint', 'model', 'adapter', 'pony', 'sdxl', 'flux', 'illustrious', 'sd14', 'sd14', 'sd2', 'sd3', 'diffusion', 'stable', 'hunyuan']\n",
|
| 125 |
+
" for term in lora_terms:\n",
|
| 126 |
+
" name = re.sub(rf'\\b{term}\\b', '', name, flags=re.IGNORECASE)\n",
|
| 127 |
+
" \n",
|
| 128 |
+
" # Remove content in parentheses or brackets (often metadata)\n",
|
| 129 |
+
" name = re.sub(r'\\([^)]*\\)', '', name)\n",
|
| 130 |
+
" name = re.sub(r'\\[[^\\]]*\\]', '', name)\n",
|
| 131 |
+
" \n",
|
| 132 |
+
" # Remove special characters like 「」\n",
|
| 133 |
+
" name = re.sub(r'[「」『』【】〈〉《》]', '', name)\n",
|
| 134 |
+
" \n",
|
| 135 |
+
" # Handle pipe - keep first part\n",
|
| 136 |
+
" if '|' in name:\n",
|
| 137 |
+
" name = name.split('|')[0]\n",
|
| 138 |
+
" \n",
|
| 139 |
+
" # Handle forward slash - keep first part\n",
|
| 140 |
+
" if '/' in name:\n",
|
| 141 |
+
" name = name.split('/')[0]\n",
|
| 142 |
+
" \n",
|
| 143 |
+
" # Replace underscores with spaces\n",
|
| 144 |
+
" name = name.replace('_', ' ')\n",
|
| 145 |
+
" \n",
|
| 146 |
+
" # Remove multiple spaces\n",
|
| 147 |
+
" name = re.sub(r'\\s+', ' ', name)\n",
|
| 148 |
+
" \n",
|
| 149 |
+
" # Strip\n",
|
| 150 |
+
" name = name.strip()\n",
|
| 151 |
+
" \n",
|
| 152 |
+
" return name\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"def extract_person_name(text: str) -> str:\n",
|
| 155 |
+
" \"\"\"\n",
|
| 156 |
+
" Use spaCy NER to extract person names from text.\n",
|
| 157 |
+
" Falls back to cleaned text if no PERSON entity found.\n",
|
| 158 |
+
" \"\"\"\n",
|
| 159 |
+
" if not text:\n",
|
| 160 |
+
" return \"\"\n",
|
| 161 |
+
" \n",
|
| 162 |
+
" # Run spaCy NER\n",
|
| 163 |
+
" doc = nlp(text)\n",
|
| 164 |
+
" \n",
|
| 165 |
+
" # Extract PERSON entities\n",
|
| 166 |
+
" person_entities = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
|
| 167 |
+
" \n",
|
| 168 |
+
" if person_entities:\n",
|
| 169 |
+
" # Return the first (usually longest/best) person name\n",
|
| 170 |
+
" return person_entities[0].strip()\n",
|
| 171 |
+
" \n",
|
| 172 |
+
" # If no PERSON entity found, try to extract capitalized words (likely names)\n",
|
| 173 |
+
" # This helps with names spaCy might miss\n",
|
| 174 |
+
" words = text.split()\n",
|
| 175 |
+
" capitalized_words = [w for w in words if w and w[0].isupper() and len(w) > 1]\n",
|
| 176 |
+
" \n",
|
| 177 |
+
" if capitalized_words:\n",
|
| 178 |
+
" # Join first few capitalized words (likely the name)\n",
|
| 179 |
+
" return ' '.join(capitalized_words[:3]).strip()\n",
|
| 180 |
+
" \n",
|
| 181 |
+
" # Last resort: return cleaned text\n",
|
| 182 |
+
" return text.strip()\n",
|
| 183 |
+
"\n",
|
| 184 |
+
"def clean_name_with_spacy(name: str) -> str:\n",
|
| 185 |
+
" \"\"\"\n",
|
| 186 |
+
" Complete name cleaning pipeline with spaCy NER.\n",
|
| 187 |
+
" \n",
|
| 188 |
+
" Pipeline:\n",
|
| 189 |
+
" 1. Translate leetspeak (4→a, 3→e, 1→i, etc.)\n",
|
| 190 |
+
" 2. Remove noise (emoji, version tags, LoRA terms)\n",
|
| 191 |
+
" 3. Use spaCy to extract PERSON entities\n",
|
| 192 |
+
" 4. Fallback to capitalized words or cleaned text\n",
|
| 193 |
+
" \"\"\"\n",
|
| 194 |
+
" # Step 1 & 2: Preprocess (leetspeak + noise removal)\n",
|
| 195 |
+
" preprocessed = preprocess_for_ner(name)\n",
|
| 196 |
+
" \n",
|
| 197 |
+
" if not preprocessed:\n",
|
| 198 |
+
" return \"\"\n",
|
| 199 |
+
" \n",
|
| 200 |
+
" # Step 3: Extract person name using spaCy NER\n",
|
| 201 |
+
" person_name = extract_person_name(preprocessed)\n",
|
| 202 |
+
" \n",
|
| 203 |
+
" return person_name\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"# Apply name cleaning with spaCy\n",
|
| 206 |
+
"print(\"\\n🔄 Processing names with spaCy NER...\")\n",
|
| 207 |
+
"df['real_name'] = df['name'].apply(clean_name_with_spacy)\n",
|
| 208 |
+
"\n",
|
| 209 |
+
"# Show examples with detailed comparison\n",
|
| 210 |
+
"print(\"\\n📊 Name cleaning examples (with spaCy NER):\")\n",
|
| 211 |
+
"print(\"=\" * 100)\n",
|
| 212 |
+
"print(f\"{'Original Name':<50} | {'Cleaned Name':<30}\")\n",
|
| 213 |
+
"print(\"=\" * 100)\n",
|
| 214 |
+
"\n",
|
| 215 |
+
"examples = df[['name', 'real_name']].head(30)\n",
|
| 216 |
+
"shown = 0\n",
|
| 217 |
+
"for idx, row in examples.iterrows():\n",
|
| 218 |
+
" if row['name'] != row['real_name'] and shown < 20:\n",
|
| 219 |
+
" print(f\"{row['name']:<50} | {row['real_name']:<30}\")\n",
|
| 220 |
+
" shown += 1\n",
|
| 221 |
+
"\n",
|
| 222 |
+
"print(\"=\" * 100)\n",
|
| 223 |
+
"\n",
|
| 224 |
+
"# Show specific test cases\n",
|
| 225 |
+
"print(\"\\n🧪 Leetspeak translation examples:\")\n",
|
| 226 |
+
"test_names = ['4kira LoRA', '3mma Watson v2', '1rene LORA', 'L3vi Ackerman']\n",
|
| 227 |
+
"for test in test_names:\n",
|
| 228 |
+
" result = clean_name_with_spacy(test)\n",
|
| 229 |
+
" print(f\" {test:<30} -> {result}\")\n",
|
| 230 |
+
"\n",
|
| 231 |
+
"# Statistics\n",
|
| 232 |
+
"print(f\"\\n📈 Statistics:\")\n",
|
| 233 |
+
"print(f\" Total rows: {len(df)}\")\n",
|
| 234 |
+
"print(f\" Non-empty names: {(df['real_name'] != '').sum()}\")\n",
|
| 235 |
+
"print(f\" Empty names: {(df['real_name'] == '').sum()}\")\n",
|
| 236 |
+
"\n",
|
| 237 |
+
"# Show some examples of what spaCy identified\n",
|
| 238 |
+
"print(\"\\n🎯 Sample spaCy NER results:\")\n",
|
| 239 |
+
"sample_names = df['real_name'].head(20).tolist()\n",
|
| 240 |
+
"for i, name in enumerate(sample_names[:10], 1):\n",
|
| 241 |
+
" if name:\n",
|
| 242 |
+
" print(f\" {i}. {name}\")\n",
|
| 243 |
+
"\n",
|
| 244 |
+
"print(f\"\\n✅ Cleaned {len(df)} names using spaCy NER\")\n",
|
| 245 |
+
"\n",
|
| 246 |
+
"# Save intermediate result\n",
|
| 247 |
+
"output_step1 = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
|
| 248 |
+
"df.to_csv(output_step1, index=False)\n",
|
| 249 |
+
"print(f\"💾 Saved to {output_step1}\")\n"
|
| 250 |
+
]
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"cell_type": "markdown",
|
| 254 |
+
"id": "64687c72",
|
| 255 |
+
"metadata": {},
|
| 256 |
+
"source": [
|
| 257 |
+
"#### STEP 02: Nationality tag to Country hint\n",
|
| 258 |
+
"here tags related to nationality gets converted to the country equivalent."
|
| 259 |
+
]
|
| 260 |
+
},
|
| 261 |
+
{
|
| 262 |
+
"cell_type": "code",
|
| 263 |
+
"execution_count": null,
|
| 264 |
+
"id": "d6eaef5b",
|
| 265 |
+
"metadata": {},
|
| 266 |
+
"outputs": [],
|
| 267 |
+
"source": [
|
| 268 |
+
"import pandas as pd\n",
|
| 269 |
+
"from pathlib import Path\n",
|
| 270 |
+
"\n",
|
| 271 |
+
"# Set up paths\n",
|
| 272 |
+
"current_dir = Path.cwd()\n",
|
| 273 |
+
"countries_file = current_dir.parent / \"misc/lists/countries.csv\"\n",
|
| 274 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 275 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
|
| 276 |
+
"output_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 277 |
+
"\n",
|
| 278 |
+
"# Load datasets\n",
|
| 279 |
+
"poi_df = pd.read_csv(input_file)\n",
|
| 280 |
+
"countries_df = pd.read_csv(countries_file)\n",
|
| 281 |
+
"professions_df = pd.read_csv(professions_file)\n",
|
| 282 |
+
"\n",
|
| 283 |
+
"# Define uninhabited or non-relevant territories to exclude\n",
|
| 284 |
+
"excluded_territories = {\n",
|
| 285 |
+
" 'isle of man', 'bouvet island', 'heard island and mcdonald islands',\n",
|
| 286 |
+
" 'french southern territories', 'south georgia and the south sandwich islands',\n",
|
| 287 |
+
" 'svalbard and jan mayen', 'british indian ocean territory', 'antarctica',\n",
|
| 288 |
+
" 'christmas island', 'cocos (keeling) islands', 'norfolk island',\n",
|
| 289 |
+
" 'pitcairn', 'tokelau', 'united states minor outlying islands',\n",
|
| 290 |
+
" 'wallis and futuna', 'western sahara'\n",
|
| 291 |
+
"}\n",
|
| 292 |
+
"\n",
|
| 293 |
+
"# Step 1: Combine tags into one lowercase list\n",
|
| 294 |
+
"def combine_tags(row):\n",
|
| 295 |
+
" return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
|
| 296 |
+
"\n",
|
| 297 |
+
"poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
|
| 298 |
+
"\n",
|
| 299 |
+
"# Step 2: Build tag → (country, nationality) mapping with PRIORITIES\n",
|
| 300 |
+
"tag_to_country_nationality = {}\n",
|
| 301 |
+
"# We'll use a priority score: direct country name = 3, nationality = 2, word parts = 1\n",
|
| 302 |
+
"\n",
|
| 303 |
+
"for _, row in countries_df.iterrows():\n",
|
| 304 |
+
" country = str(row[\"en_short_name\"]).strip()\n",
|
| 305 |
+
" nationality = str(row[\"nationality\"]).strip()\n",
|
| 306 |
+
" \n",
|
| 307 |
+
" # Skip excluded territories\n",
|
| 308 |
+
" if country.lower() in excluded_territories:\n",
|
| 309 |
+
" continue\n",
|
| 310 |
+
"\n",
|
| 311 |
+
" country_lc = country.lower()\n",
|
| 312 |
+
" nationality_lc = nationality.lower()\n",
|
| 313 |
+
"\n",
|
| 314 |
+
" # Store as (country, nationality, priority)\n",
|
| 315 |
+
" # Exact country name match = highest priority\n",
|
| 316 |
+
" if country_lc not in tag_to_country_nationality:\n",
|
| 317 |
+
" tag_to_country_nationality[country_lc] = (country, \"\", 3)\n",
|
| 318 |
+
" \n",
|
| 319 |
+
" # Exact nationality match = medium priority \n",
|
| 320 |
+
" if nationality_lc not in tag_to_country_nationality:\n",
|
| 321 |
+
" tag_to_country_nationality[nationality_lc] = (\"\", nationality, 2)\n",
|
| 322 |
+
" \n",
|
| 323 |
+
" # No-space versions\n",
|
| 324 |
+
" country_no_space = country_lc.replace(\" \", \"\")\n",
|
| 325 |
+
" nationality_no_space = nationality_lc.replace(\" \", \"\")\n",
|
| 326 |
+
" \n",
|
| 327 |
+
" if country_no_space not in tag_to_country_nationality:\n",
|
| 328 |
+
" tag_to_country_nationality[country_no_space] = (country, \"\", 3)\n",
|
| 329 |
+
" if nationality_no_space not in tag_to_country_nationality:\n",
|
| 330 |
+
" tag_to_country_nationality[nationality_no_space] = (\"\", nationality, 2)\n",
|
| 331 |
+
"\n",
|
| 332 |
+
" # Word parts = lowest priority (only for longer words to avoid false matches)\n",
|
| 333 |
+
" for part in country_lc.split():\n",
|
| 334 |
+
" if len(part) > 4: # Only words longer than 4 chars\n",
|
| 335 |
+
" if part not in tag_to_country_nationality:\n",
|
| 336 |
+
" tag_to_country_nationality[part] = (country, \"\", 1)\n",
|
| 337 |
+
" for part in nationality_lc.split():\n",
|
| 338 |
+
" if len(part) > 4:\n",
|
| 339 |
+
" if part not in tag_to_country_nationality:\n",
|
| 340 |
+
" tag_to_country_nationality[part] = (\"\", nationality, 1)\n",
|
| 341 |
+
"\n",
|
| 342 |
+
"print(f\"Built country/nationality mapping with {len(tag_to_country_nationality)} entries\")\n",
|
| 343 |
+
"\n",
|
| 344 |
+
"# Step 3: Infer likely_country and likely_nationality by checking ALL tags\n",
|
| 345 |
+
"def infer_country_and_nationality(tags):\n",
|
| 346 |
+
" \"\"\"\n",
|
| 347 |
+
" Check ALL tags and return the best match based on priority.\n",
|
| 348 |
+
" Priority: exact country name > nationality > word parts\n",
|
| 349 |
+
" \"\"\"\n",
|
| 350 |
+
" best_match = None\n",
|
| 351 |
+
" best_priority = 0\n",
|
| 352 |
+
" \n",
|
| 353 |
+
" for tag in tags:\n",
|
| 354 |
+
" # Try cleaned version (no spaces)\n",
|
| 355 |
+
" cleaned = tag.replace(\" \", \"\").lower()\n",
|
| 356 |
+
" \n",
|
| 357 |
+
" # Check cleaned version\n",
|
| 358 |
+
" if cleaned in tag_to_country_nationality:\n",
|
| 359 |
+
" country, nationality, priority = tag_to_country_nationality[cleaned]\n",
|
| 360 |
+
" if priority > best_priority and country and country.lower() not in excluded_territories:\n",
|
| 361 |
+
" best_match = (country, nationality)\n",
|
| 362 |
+
" best_priority = priority\n",
|
| 363 |
+
" \n",
|
| 364 |
+
" # Also check original tag\n",
|
| 365 |
+
" if tag in tag_to_country_nationality:\n",
|
| 366 |
+
" country, nationality, priority = tag_to_country_nationality[tag]\n",
|
| 367 |
+
" if priority > best_priority and country and country.lower() not in excluded_territories:\n",
|
| 368 |
+
" best_match = (country, nationality)\n",
|
| 369 |
+
" best_priority = priority\n",
|
| 370 |
+
" \n",
|
| 371 |
+
" if best_match:\n",
|
| 372 |
+
" return pd.Series(best_match)\n",
|
| 373 |
+
" return pd.Series([\"\", \"\"])\n",
|
| 374 |
+
"\n",
|
| 375 |
+
"poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
|
| 376 |
+
"\n",
|
| 377 |
+
"# Step 4: Build tag → profession mapping\n",
|
| 378 |
+
"profession_alias_map = {}\n",
|
| 379 |
+
"\n",
|
| 380 |
+
"for _, row in professions_df.iterrows():\n",
|
| 381 |
+
" canonical = str(row['profession']).strip().lower()\n",
|
| 382 |
+
" profession_alias_map[canonical] = canonical\n",
|
| 383 |
+
" for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
|
| 384 |
+
" alias = row.get(alias_col)\n",
|
| 385 |
+
" if pd.notna(alias):\n",
|
| 386 |
+
" profession_alias_map[str(alias).strip().lower()] = canonical\n",
|
| 387 |
+
"\n",
|
| 388 |
+
"# Step 5: Infer likely profession from tags\n",
|
| 389 |
+
"def infer_profession_from_tags(tags):\n",
|
| 390 |
+
" matched = []\n",
|
| 391 |
+
" for tag in tags:\n",
|
| 392 |
+
" cleaned = tag.strip().lower()\n",
|
| 393 |
+
" if cleaned in profession_alias_map:\n",
|
| 394 |
+
" matched.append(profession_alias_map[cleaned])\n",
|
| 395 |
+
"\n",
|
| 396 |
+
" if not matched:\n",
|
| 397 |
+
" return \"\"\n",
|
| 398 |
+
" if \"celebrity\" in matched and len(set(matched)) > 1:\n",
|
| 399 |
+
" # Drop 'celebrity' if other professions are present\n",
|
| 400 |
+
" matched = [m for m in matched if m != \"celebrity\"]\n",
|
| 401 |
+
"\n",
|
| 402 |
+
" return matched[0] # Return the first specific match\n",
|
| 403 |
+
"\n",
|
| 404 |
+
"\n",
|
| 405 |
+
"poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
|
| 406 |
+
"\n",
|
| 407 |
+
"# Step 6: Save enriched dataset\n",
|
| 408 |
+
"poi_df.to_csv(output_file, index=False)\n",
|
| 409 |
+
"\n",
|
| 410 |
+
"# Preview results\n",
|
| 411 |
+
"print(f\"\\nProcessed {len(poi_df)} rows\")\n",
|
| 412 |
+
"print(f\"Rows with country: {(poi_df['likely_country'] != '').sum()}\")\n",
|
| 413 |
+
"print(f\"Rows with nationality: {(poi_df['likely_nationality'] != '').sum()}\")\n",
|
| 414 |
+
"print(f\"Rows with profession: {(poi_df['likely_profession'] != '').sum()}\")\n",
|
| 415 |
+
"\n",
|
| 416 |
+
"print(f\"\\nTop 10 countries:\")\n",
|
| 417 |
+
"print(poi_df[poi_df['likely_country'] != '']['likely_country'].value_counts().head(10))\n"
|
| 418 |
+
]
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"cell_type": "markdown",
|
| 422 |
+
"id": "4a4a58b3",
|
| 423 |
+
"metadata": {},
|
| 424 |
+
"source": [
|
| 425 |
+
"## LLM ANNOTATION"
|
| 426 |
+
]
|
| 427 |
+
},
|
| 428 |
+
{
|
| 429 |
+
"cell_type": "markdown",
|
| 430 |
+
"id": "b298844d",
|
| 431 |
+
"metadata": {},
|
| 432 |
+
"source": [
|
| 433 |
+
"#### Model Configurations"
|
| 434 |
+
]
|
| 435 |
+
},
|
| 436 |
+
{
|
| 437 |
+
"cell_type": "code",
|
| 438 |
+
"execution_count": null,
|
| 439 |
+
"id": "39f3d65e",
|
| 440 |
+
"metadata": {},
|
| 441 |
+
"outputs": [],
|
| 442 |
+
"source": [
|
| 443 |
+
"import pandas as pd\n",
|
| 444 |
+
"import json\n",
|
| 445 |
+
"import time\n",
|
| 446 |
+
"import re\n",
|
| 447 |
+
"from pathlib import Path\n",
|
| 448 |
+
"from tqdm import tqdm\n",
|
| 449 |
+
"import torch\n",
|
| 450 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 451 |
+
"import signal\n",
|
| 452 |
+
"from contextlib import contextmanager\n",
|
| 453 |
+
"\n",
|
| 454 |
+
"# Configuration\n",
|
| 455 |
+
"current_dir = Path.cwd()\n",
|
| 456 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 457 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 458 |
+
"\n",
|
| 459 |
+
"# Model configurations\n",
|
| 460 |
+
"MODEL_CONFIGS = {\n",
|
| 461 |
+
" 'mistral': {\n",
|
| 462 |
+
" 'name': 'mistralai/Mistral-7B-Instruct-v0.3',\n",
|
| 463 |
+
" 'dtype': torch.bfloat16,\n",
|
| 464 |
+
" 'quantization': None,\n",
|
| 465 |
+
" 'generation_params': {\n",
|
| 466 |
+
" 'max_new_tokens': 512,\n",
|
| 467 |
+
" 'temperature': 0.05,\n",
|
| 468 |
+
" 'do_sample': True,\n",
|
| 469 |
+
" 'top_p': 1.0,\n",
|
| 470 |
+
" }\n",
|
| 471 |
+
" },\n",
|
| 472 |
+
" 'gemma': {\n",
|
| 473 |
+
" 'name': 'google/gemma-3-27b-it',\n",
|
| 474 |
+
" 'dtype': torch.bfloat16,\n",
|
| 475 |
+
" 'quantization': None,\n",
|
| 476 |
+
" 'generation_params': {\n",
|
| 477 |
+
" 'max_new_tokens': 512,\n",
|
| 478 |
+
" 'temperature': 0.1,\n",
|
| 479 |
+
" 'do_sample': True,\n",
|
| 480 |
+
" 'top_p': 1.0,\n",
|
| 481 |
+
" }\n",
|
| 482 |
+
" },\n",
|
| 483 |
+
" 'qwen': {\n",
|
| 484 |
+
" 'name': 'Qwen/Qwen2.5-32B-Instruct',\n",
|
| 485 |
+
" 'dtype': None, # Will use quantization\n",
|
| 486 |
+
" 'quantization': BitsAndBytesConfig(\n",
|
| 487 |
+
" load_in_8bit=True,\n",
|
| 488 |
+
" llm_int8_threshold=6.0,\n",
|
| 489 |
+
" llm_int8_has_fp16_weight=False\n",
|
| 490 |
+
" ),\n",
|
| 491 |
+
" 'generation_params': {\n",
|
| 492 |
+
" 'max_new_tokens': 512,\n",
|
| 493 |
+
" 'temperature': 0.1,\n",
|
| 494 |
+
" 'do_sample': False,\n",
|
| 495 |
+
" }\n",
|
| 496 |
+
" }\n",
|
| 497 |
+
"}\n",
|
| 498 |
+
"\n",
|
| 499 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 500 |
+
" \"actor\",\n",
|
| 501 |
+
" \"adult performer\",\n",
|
| 502 |
+
" \"singer/musician\",\n",
|
| 503 |
+
" \"model\",\n",
|
| 504 |
+
" \"online personality\",\n",
|
| 505 |
+
" \"public figure\",\n",
|
| 506 |
+
" \"voice actor/ASMR\",\n",
|
| 507 |
+
" \"sports professional\",\n",
|
| 508 |
+
" \"tv personality\"\n",
|
| 509 |
+
"]\n"
|
| 510 |
+
]
|
| 511 |
+
},
|
| 512 |
+
{
|
| 513 |
+
"cell_type": "markdown",
|
| 514 |
+
"id": "c215b38c",
|
| 515 |
+
"metadata": {},
|
| 516 |
+
"source": [
|
| 517 |
+
"#### Load Model Function"
|
| 518 |
+
]
|
| 519 |
+
},
|
| 520 |
+
{
|
| 521 |
+
"cell_type": "code",
|
| 522 |
+
"execution_count": null,
|
| 523 |
+
"id": "cfb5b13e",
|
| 524 |
+
"metadata": {},
|
| 525 |
+
"outputs": [],
|
| 526 |
+
"source": [
|
| 527 |
+
"def load_model(model_type='mistral'):\n",
|
| 528 |
+
" \"\"\"\n",
|
| 529 |
+
" Load model and tokenizer based on type.\n",
|
| 530 |
+
" \n",
|
| 531 |
+
" Args:\n",
|
| 532 |
+
" model_type: 'mistral', 'gemma', or 'qwen'\n",
|
| 533 |
+
" \n",
|
| 534 |
+
" Returns:\n",
|
| 535 |
+
" tuple: (model, tokenizer, config)\n",
|
| 536 |
+
" \"\"\"\n",
|
| 537 |
+
" if model_type not in MODEL_CONFIGS:\n",
|
| 538 |
+
" raise ValueError(f\"Unknown model type: {model_type}. Choose from {list(MODEL_CONFIGS.keys())}\")\n",
|
| 539 |
+
" \n",
|
| 540 |
+
" config = MODEL_CONFIGS[model_type]\n",
|
| 541 |
+
" model_name = config['name']\n",
|
| 542 |
+
" \n",
|
| 543 |
+
" device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 544 |
+
" print(f\"Loading model: {model_name}\")\n",
|
| 545 |
+
" print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 546 |
+
" print(f\"Device: {device}\\n\")\n",
|
| 547 |
+
" \n",
|
| 548 |
+
" if device == \"cpu\":\n",
|
| 549 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 550 |
+
" \n",
|
| 551 |
+
" # Load tokenizer\n",
|
| 552 |
+
" try:\n",
|
| 553 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 554 |
+
" model_name,\n",
|
| 555 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 556 |
+
" use_fast=True\n",
|
| 557 |
+
" )\n",
|
| 558 |
+
" except:\n",
|
| 559 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 560 |
+
" model_name,\n",
|
| 561 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 562 |
+
" use_fast=False\n",
|
| 563 |
+
" )\n",
|
| 564 |
+
" \n",
|
| 565 |
+
" if tokenizer.pad_token is None:\n",
|
| 566 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 567 |
+
" \n",
|
| 568 |
+
" # Load model\n",
|
| 569 |
+
" model_kwargs = {\n",
|
| 570 |
+
" 'cache_dir': str(CACHE_DIR),\n",
|
| 571 |
+
" 'device_map': 'auto',\n",
|
| 572 |
+
" 'trust_remote_code': False\n",
|
| 573 |
+
" }\n",
|
| 574 |
+
" \n",
|
| 575 |
+
" if config['quantization']:\n",
|
| 576 |
+
" model_kwargs['quantization_config'] = config['quantization']\n",
|
| 577 |
+
" else:\n",
|
| 578 |
+
" model_kwargs['torch_dtype'] = config['dtype']\n",
|
| 579 |
+
" \n",
|
| 580 |
+
" model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n",
|
| 581 |
+
" model.eval()\n",
|
| 582 |
+
" \n",
|
| 583 |
+
" # Check VRAM\n",
|
| 584 |
+
" if torch.cuda.is_available():\n",
|
| 585 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 586 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 587 |
+
" \n",
|
| 588 |
+
" return model, tokenizer, config\n"
|
| 589 |
+
]
|
| 590 |
+
},
|
| 591 |
+
{
|
| 592 |
+
"cell_type": "markdown",
|
| 593 |
+
"id": "11b2221a",
|
| 594 |
+
"metadata": {},
|
| 595 |
+
"source": [
|
| 596 |
+
"#### Inference Code"
|
| 597 |
+
]
|
| 598 |
+
},
|
| 599 |
+
{
|
| 600 |
+
"cell_type": "code",
|
| 601 |
+
"execution_count": null,
|
| 602 |
+
"id": "229f96bd",
|
| 603 |
+
"metadata": {},
|
| 604 |
+
"outputs": [],
|
| 605 |
+
"source": [
|
| 606 |
+
"@contextmanager\n",
|
| 607 |
+
"def timeout(duration):\n",
|
| 608 |
+
" \"\"\"Context manager for timeout.\"\"\"\n",
|
| 609 |
+
" def handler(signum, frame):\n",
|
| 610 |
+
" raise TimeoutError(\"Operation timed out\")\n",
|
| 611 |
+
" \n",
|
| 612 |
+
" signal.signal(signal.SIGALRM, handler)\n",
|
| 613 |
+
" signal.alarm(duration)\n",
|
| 614 |
+
" try:\n",
|
| 615 |
+
" yield\n",
|
| 616 |
+
" finally:\n",
|
| 617 |
+
" signal.alarm(0)\n",
|
| 618 |
+
"\n",
|
| 619 |
+
"def query_model(prompt, model, tokenizer, config, use_timeout=False):\n",
|
| 620 |
+
" \"\"\"\n",
|
| 621 |
+
" Query model with given prompt.\n",
|
| 622 |
+
" \n",
|
| 623 |
+
" Args:\n",
|
| 624 |
+
" prompt: Input prompt string\n",
|
| 625 |
+
" model: Loaded model\n",
|
| 626 |
+
" tokenizer: Loaded tokenizer\n",
|
| 627 |
+
" config: Model configuration dict\n",
|
| 628 |
+
" use_timeout: Whether to use 60s timeout (for Qwen)\n",
|
| 629 |
+
" \n",
|
| 630 |
+
" Returns:\n",
|
| 631 |
+
" str: Model response or None on error\n",
|
| 632 |
+
" \"\"\"\n",
|
| 633 |
+
" try:\n",
|
| 634 |
+
" device = next(model.parameters()).device\n",
|
| 635 |
+
" \n",
|
| 636 |
+
" # Format as chat message\n",
|
| 637 |
+
" messages = [\n",
|
| 638 |
+
" {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
|
| 639 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 640 |
+
" ]\n",
|
| 641 |
+
" \n",
|
| 642 |
+
" # Tokenize\n",
|
| 643 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 644 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 645 |
+
" messages,\n",
|
| 646 |
+
" tokenize=False,\n",
|
| 647 |
+
" add_generation_prompt=True\n",
|
| 648 |
+
" )\n",
|
| 649 |
+
" else:\n",
|
| 650 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 651 |
+
" \n",
|
| 652 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 653 |
+
" \n",
|
| 654 |
+
" # Generation parameters\n",
|
| 655 |
+
" gen_kwargs = config['generation_params'].copy()\n",
|
| 656 |
+
" gen_kwargs['pad_token_id'] = tokenizer.eos_token_id\n",
|
| 657 |
+
" \n",
|
| 658 |
+
" # Generate\n",
|
| 659 |
+
" generation_fn = lambda: model.generate(**inputs, **gen_kwargs)\n",
|
| 660 |
+
" \n",
|
| 661 |
+
" if use_timeout:\n",
|
| 662 |
+
" with timeout(60):\n",
|
| 663 |
+
" with torch.no_grad():\n",
|
| 664 |
+
" outputs = generation_fn()\n",
|
| 665 |
+
" else:\n",
|
| 666 |
+
" with torch.no_grad():\n",
|
| 667 |
+
" outputs = generation_fn()\n",
|
| 668 |
+
" \n",
|
| 669 |
+
" # Decode\n",
|
| 670 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 671 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 672 |
+
" \n",
|
| 673 |
+
" return response.strip()\n",
|
| 674 |
+
" \n",
|
| 675 |
+
" except TimeoutError:\n",
|
| 676 |
+
" print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
|
| 677 |
+
" return None\n",
|
| 678 |
+
" except Exception as e:\n",
|
| 679 |
+
" print(f\"[ERROR] Generation failed: {e}\")\n",
|
| 680 |
+
" return None\n"
|
| 681 |
+
]
|
| 682 |
+
},
|
| 683 |
+
{
|
| 684 |
+
"cell_type": "markdown",
|
| 685 |
+
"id": "88f005f8",
|
| 686 |
+
"metadata": {},
|
| 687 |
+
"source": [
|
| 688 |
+
"#### Prompt creation"
|
| 689 |
+
]
|
| 690 |
+
},
|
| 691 |
+
{
|
| 692 |
+
"cell_type": "code",
|
| 693 |
+
"execution_count": null,
|
| 694 |
+
"id": "dfe05463",
|
| 695 |
+
"metadata": {},
|
| 696 |
+
"outputs": [],
|
| 697 |
+
"source": [
|
| 698 |
+
"def create_prompt(row):\n",
|
| 699 |
+
" \"\"\"Create annotation prompt from row data.\"\"\"\n",
|
| 700 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 701 |
+
" \n",
|
| 702 |
+
" # Gather hints\n",
|
| 703 |
+
" hints = []\n",
|
| 704 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 705 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 706 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 707 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 708 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 709 |
+
" hints.append(str(row['likely_country']))\n",
|
| 710 |
+
" \n",
|
| 711 |
+
" # Add tags if needed\n",
|
| 712 |
+
" if len(hints) < 3:\n",
|
| 713 |
+
" for i in range(1, 8):\n",
|
| 714 |
+
" tag_col = f'tag_{i}'\n",
|
| 715 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 716 |
+
" tag_val = str(row[tag_col])\n",
|
| 717 |
+
" if tag_val not in hints:\n",
|
| 718 |
+
" hints.append(tag_val)\n",
|
| 719 |
+
" if len(hints) >= 5:\n",
|
| 720 |
+
" break\n",
|
| 721 |
+
" \n",
|
| 722 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 723 |
+
" \n",
|
| 724 |
+
" return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
|
| 725 |
+
"\n",
|
| 726 |
+
"Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
|
| 727 |
+
"\n",
|
| 728 |
+
"FORMAT REQUIREMENTS:\n",
|
| 729 |
+
"1. Full legal name in Western order (first last). VALUE ONLY.\n",
|
| 730 |
+
"2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
|
| 731 |
+
"3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
|
| 732 |
+
"4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
|
| 733 |
+
"5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
|
| 734 |
+
"\n",
|
| 735 |
+
"RULES:\n",
|
| 736 |
+
"- Professions MUST match the exact categories listed (actress = actor)\n",
|
| 737 |
+
"- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
|
| 738 |
+
"- \"public figure\" includes politicians, activists, journalists, authors\n",
|
| 739 |
+
"- Use \"Unknown\" when uncertain or for fictional characters\n",
|
| 740 |
+
"- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
|
| 741 |
+
"- For multi-role people, list up to 3 categories by relevance\n",
|
| 742 |
+
"\n",
|
| 743 |
+
"EXAMPLE FORMAT:\n",
|
| 744 |
+
"1. Taylor Swift\n",
|
| 745 |
+
"2. None\n",
|
| 746 |
+
"3. Female\n",
|
| 747 |
+
"4. singer/musician, public figure\n",
|
| 748 |
+
"5. United States\"\"\"\n"
|
| 749 |
+
]
|
| 750 |
+
},
|
| 751 |
+
{
|
| 752 |
+
"cell_type": "markdown",
|
| 753 |
+
"id": "854fa668",
|
| 754 |
+
"metadata": {},
|
| 755 |
+
"source": [
|
| 756 |
+
"#### Response parsing code"
|
| 757 |
+
]
|
| 758 |
+
},
|
| 759 |
+
{
|
| 760 |
+
"cell_type": "code",
|
| 761 |
+
"execution_count": null,
|
| 762 |
+
"id": "1a4be2ee",
|
| 763 |
+
"metadata": {},
|
| 764 |
+
"outputs": [],
|
| 765 |
+
"source": [
|
| 766 |
+
"def parse_response(response):\n",
|
| 767 |
+
" \"\"\"Parse model response into structured fields.\"\"\"\n",
|
| 768 |
+
" if not response:\n",
|
| 769 |
+
" return {\n",
|
| 770 |
+
" 'full_name': 'Unknown',\n",
|
| 771 |
+
" 'aliases': 'Unknown',\n",
|
| 772 |
+
" 'gender': 'Unknown',\n",
|
| 773 |
+
" 'profession_llm': 'Unknown',\n",
|
| 774 |
+
" 'country': 'Unknown'\n",
|
| 775 |
+
" }\n",
|
| 776 |
+
" \n",
|
| 777 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 778 |
+
" \n",
|
| 779 |
+
" fields = {\n",
|
| 780 |
+
" 'full_name': 'Unknown',\n",
|
| 781 |
+
" 'aliases': 'Unknown',\n",
|
| 782 |
+
" 'gender': 'Unknown',\n",
|
| 783 |
+
" 'profession_llm': 'Unknown',\n",
|
| 784 |
+
" 'country': 'Unknown'\n",
|
| 785 |
+
" }\n",
|
| 786 |
+
" \n",
|
| 787 |
+
" for line in lines:\n",
|
| 788 |
+
" if line.startswith('1.'):\n",
|
| 789 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 790 |
+
" elif line.startswith('2.'):\n",
|
| 791 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 792 |
+
" elif line.startswith('3.'):\n",
|
| 793 |
+
" gender_raw = line[2:].strip()\n",
|
| 794 |
+
" gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
|
| 795 |
+
" gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
|
| 796 |
+
" fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
|
| 797 |
+
" elif line.startswith('4.'):\n",
|
| 798 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 799 |
+
" elif line.startswith('5.'):\n",
|
| 800 |
+
" country_raw = line[2:].strip()\n",
|
| 801 |
+
" country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
|
| 802 |
+
" fields['country'] = country_raw\n",
|
| 803 |
+
" \n",
|
| 804 |
+
" return fields\n"
|
| 805 |
+
]
|
| 806 |
+
},
|
| 807 |
+
{
|
| 808 |
+
"cell_type": "markdown",
|
| 809 |
+
"id": "7e2f7a86",
|
| 810 |
+
"metadata": {},
|
| 811 |
+
"source": [
|
| 812 |
+
"#### CSV annotation"
|
| 813 |
+
]
|
| 814 |
+
},
|
| 815 |
+
{
|
| 816 |
+
"cell_type": "code",
|
| 817 |
+
"execution_count": null,
|
| 818 |
+
"id": "5f3dd5d6",
|
| 819 |
+
"metadata": {},
|
| 820 |
+
"outputs": [],
|
| 821 |
+
"source": [
|
| 822 |
+
"def annotate_dataset(model_type='mistral', test_mode=False, test_size=100, max_rows=50862, save_interval=10):\n",
|
| 823 |
+
" \"\"\"\n",
|
| 824 |
+
" Annotate dataset using specified model.\n",
|
| 825 |
+
" \n",
|
| 826 |
+
" Args:\n",
|
| 827 |
+
" model_type: 'mistral', 'gemma', or 'qwen'\n",
|
| 828 |
+
" test_mode: If True, only process test_size rows\n",
|
| 829 |
+
" test_size: Number of rows to process in test mode\n",
|
| 830 |
+
" max_rows: Maximum rows to process\n",
|
| 831 |
+
" save_interval: Save progress every N rows\n",
|
| 832 |
+
" \"\"\"\n",
|
| 833 |
+
" # Setup paths\n",
|
| 834 |
+
" input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 835 |
+
" output_file = current_dir.parent / f\"data/CSV/{model_type}_local_annotated_POI{'_test' if test_mode else ''}.csv\"\n",
|
| 836 |
+
" index_file = current_dir.parent / f\"misc/query_indicies/{model_type}_local_query_index.txt\"\n",
|
| 837 |
+
" index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 838 |
+
" \n",
|
| 839 |
+
" # Load model\n",
|
| 840 |
+
" model, tokenizer, config = load_model(model_type)\n",
|
| 841 |
+
" \n",
|
| 842 |
+
" # Load data\n",
|
| 843 |
+
" print(f\"Loaded {len(df)} rows from input file\")\n",
|
| 844 |
+
" df = pd.read_csv(input_file)\n",
|
| 845 |
+
" \n",
|
| 846 |
+
" # Merge existing annotations if available\n",
|
| 847 |
+
" if output_file.exists():\n",
|
| 848 |
+
" existing_df = pd.read_csv(output_file)\n",
|
| 849 |
+
" annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
|
| 850 |
+
" for col in annotation_cols:\n",
|
| 851 |
+
" if col in existing_df.columns:\n",
|
| 852 |
+
" df[col] = existing_df[col][:len(df)]\n",
|
| 853 |
+
" \n",
|
| 854 |
+
" # Apply limits\n",
|
| 855 |
+
" if test_mode:\n",
|
| 856 |
+
" df = df.head(test_size).copy()\n",
|
| 857 |
+
" elif max_rows:\n",
|
| 858 |
+
" df = df.head(max_rows).copy()\n",
|
| 859 |
+
" \n",
|
| 860 |
+
" # Create prompts\n",
|
| 861 |
+
" df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 862 |
+
" \n",
|
| 863 |
+
" # Load progress index\n",
|
| 864 |
+
" current_index = 0\n",
|
| 865 |
+
" if index_file.exists():\n",
|
| 866 |
+
" try:\n",
|
| 867 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 868 |
+
" except:\n",
|
| 869 |
+
" current_index = 0\n",
|
| 870 |
+
" \n",
|
| 871 |
+
" print(f\"Resuming from index {current_index}\")\n",
|
| 872 |
+
" \n",
|
| 873 |
+
" # Process rows\n",
|
| 874 |
+
" use_timeout = (model_type == 'qwen')\n",
|
| 875 |
+
" \n",
|
| 876 |
+
" for i in tqdm(range(current_index, len(df)), desc=f\"{model_type.capitalize()} Annotation\"):\n",
|
| 877 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 878 |
+
" \n",
|
| 879 |
+
" # Query with retries\n",
|
| 880 |
+
" response = None\n",
|
| 881 |
+
" for attempt in range(3):\n",
|
| 882 |
+
" response = query_model(prompt, model, tokenizer, config, use_timeout)\n",
|
| 883 |
+
" \n",
|
| 884 |
+
" if response and len(response.strip()) > 10:\n",
|
| 885 |
+
" break\n",
|
| 886 |
+
" \n",
|
| 887 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 888 |
+
" time.sleep(0.5)\n",
|
| 889 |
+
" \n",
|
| 890 |
+
" # Skip if invalid\n",
|
| 891 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 892 |
+
" print(f\"❌ Row {i}: failed after retries, skipping\")\n",
|
| 893 |
+
" continue\n",
|
| 894 |
+
" \n",
|
| 895 |
+
" # Parse and validate\n",
|
| 896 |
+
" parsed = parse_response(response)\n",
|
| 897 |
+
" \n",
|
| 898 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 899 |
+
" print(f\"❌ Row {i}: parsed as all Unknown, skipping\")\n",
|
| 900 |
+
" continue\n",
|
| 901 |
+
" \n",
|
| 902 |
+
" # Write fields\n",
|
| 903 |
+
" for key, value in parsed.items():\n",
|
| 904 |
+
" df.at[i, key] = value\n",
|
| 905 |
+
" \n",
|
| 906 |
+
" current_index = i + 1\n",
|
| 907 |
+
" \n",
|
| 908 |
+
" # GPU cleanup\n",
|
| 909 |
+
" if torch.cuda.is_available():\n",
|
| 910 |
+
" torch.cuda.empty_cache()\n",
|
| 911 |
+
" torch.cuda.synchronize()\n",
|
| 912 |
+
" \n",
|
| 913 |
+
" # Save progress\n",
|
| 914 |
+
" if (i + 1) % save_interval == 0 or (i + 1) == len(df):\n",
|
| 915 |
+
" df.to_csv(output_file, index=False)\n",
|
| 916 |
+
" index_file.write_text(str(current_index))\n",
|
| 917 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 918 |
+
" \n",
|
| 919 |
+
" # Final save\n",
|
| 920 |
+
" df.to_csv(output_file, index=False)\n",
|
| 921 |
+
" index_file.write_text(str(current_index))\n",
|
| 922 |
+
" print(f\"✓ Finished annotation with {model_type}\")\n"
|
| 923 |
+
]
|
| 924 |
+
},
|
| 925 |
+
{
|
| 926 |
+
"cell_type": "markdown",
|
| 927 |
+
"id": "55da2f4c",
|
| 928 |
+
"metadata": {},
|
| 929 |
+
"source": [
|
| 930 |
+
"### Usage Examples\n",
|
| 931 |
+
"Run annotation with your chosen model."
|
| 932 |
+
]
|
| 933 |
+
},
|
| 934 |
+
{
|
| 935 |
+
"cell_type": "code",
|
| 936 |
+
"execution_count": null,
|
| 937 |
+
"id": "351ea40c",
|
| 938 |
+
"metadata": {},
|
| 939 |
+
"outputs": [],
|
| 940 |
+
"source": [
|
| 941 |
+
"# Example 1: Annotate with Mistral (13.5 GB VRAM)\n",
|
| 942 |
+
"# annotate_dataset(model_type='mistral', test_mode=False)\n",
|
| 943 |
+
"\n",
|
| 944 |
+
"# Example 2: Annotate with Gemma (56.3 GB VRAM)\n",
|
| 945 |
+
"# annotate_dataset(model_type='gemma', test_mode=False)\n",
|
| 946 |
+
"\n",
|
| 947 |
+
"# Example 3: Annotate with Qwen (32.7 GB VRAM, 8-bit)\n",
|
| 948 |
+
"# annotate_dataset(model_type='qwen', test_mode=False)\n",
|
| 949 |
+
"\n",
|
| 950 |
+
"# Test mode (first 100 rows)\n",
|
| 951 |
+
"# annotate_dataset(model_type='mistral', test_mode=True, test_size=100)\n"
|
| 952 |
+
]
|
| 953 |
+
},
|
| 954 |
+
{
|
| 955 |
+
"cell_type": "markdown",
|
| 956 |
+
"id": "6431d347-d80c-4e8b-83a7-531e5df95a72",
|
| 957 |
+
"metadata": {},
|
| 958 |
+
"source": [
|
| 959 |
+
"## EuroLLM-9B-Instruct"
|
| 960 |
+
]
|
| 961 |
+
},
|
| 962 |
+
{
|
| 963 |
+
"cell_type": "code",
|
| 964 |
+
"execution_count": null,
|
| 965 |
+
"id": "e8203abc-e7c3-4cb6-aaeb-fdc6933981fc",
|
| 966 |
+
"metadata": {},
|
| 967 |
+
"outputs": [],
|
| 968 |
+
"source": [
|
| 969 |
+
"import pandas as pd\n",
|
| 970 |
+
"import json\n",
|
| 971 |
+
"import time\n",
|
| 972 |
+
"import re\n",
|
| 973 |
+
"from pathlib import Path\n",
|
| 974 |
+
"from tqdm import tqdm\n",
|
| 975 |
+
"import torch\n",
|
| 976 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 977 |
+
"import signal\n",
|
| 978 |
+
"from contextlib import contextmanager\n",
|
| 979 |
+
"\n",
|
| 980 |
+
"current_dir = Path.cwd()\n",
|
| 981 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 982 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 983 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 984 |
+
"# === PROCESS DATA ===\n",
|
| 985 |
+
"\n",
|
| 986 |
+
"\n",
|
| 987 |
+
"# === CONFIGURATION ===\n",
|
| 988 |
+
"TEST_MODE = False\n",
|
| 989 |
+
"TEST_SIZE = 100\n",
|
| 990 |
+
"MAX_ROWS = 50862\n",
|
| 991 |
+
"SAVE_INTERVAL = 10\n",
|
| 992 |
+
"\n",
|
| 993 |
+
"\n",
|
| 994 |
+
"index_file = current_dir.parent / \"misc/query_indicies/eurollm_local_query_index.txt\"\n",
|
| 995 |
+
"output_file = current_dir.parent / f\"data/CSV/eurollm_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 996 |
+
"\n",
|
| 997 |
+
"# Model settings\n",
|
| 998 |
+
"MODEL_NAME = \"utter-project/EuroLLM-9B-Instruct\"\n",
|
| 999 |
+
"#MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
|
| 1000 |
+
"#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
|
| 1001 |
+
"#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
|
| 1002 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 1003 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 1004 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 1005 |
+
"\n",
|
| 1006 |
+
"# Define the SPECIFIC profession categories\n",
|
| 1007 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 1008 |
+
" \"actor\",\n",
|
| 1009 |
+
" \"adult performer\",\n",
|
| 1010 |
+
" \"singer/musician\",\n",
|
| 1011 |
+
" \"model\",\n",
|
| 1012 |
+
" \"online personality\",\n",
|
| 1013 |
+
" \"public figure\",\n",
|
| 1014 |
+
" \"voice actor/ASMR\",\n",
|
| 1015 |
+
" \"sports professional\",\n",
|
| 1016 |
+
" \"tv personality\"\n",
|
| 1017 |
+
"]\n",
|
| 1018 |
+
"\n",
|
| 1019 |
+
"# === LOAD MODEL ===\n",
|
| 1020 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 1021 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 1022 |
+
"print(f\"This may take a while on first run...\\n\")\n",
|
| 1023 |
+
"\n",
|
| 1024 |
+
"# Check GPU availability\n",
|
| 1025 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1026 |
+
"print(f\"Device: {device}\")\n",
|
| 1027 |
+
"\n",
|
| 1028 |
+
"if device == \"cpu\":\n",
|
| 1029 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 1030 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 1031 |
+
"\n",
|
| 1032 |
+
"# Get HF token from credentials file\n",
|
| 1033 |
+
"import os\n",
|
| 1034 |
+
"credentials_dir = current_dir.parent / \"misc/credentials\"\n",
|
| 1035 |
+
"hf_token_file = credentials_dir / \"hf_token.txt\"\n",
|
| 1036 |
+
"\n",
|
| 1037 |
+
"HF_TOKEN = None\n",
|
| 1038 |
+
"if hf_token_file.exists():\n",
|
| 1039 |
+
" HF_TOKEN = hf_token_file.read_text().strip()\n",
|
| 1040 |
+
" print(\"✅ HF token loaded from credentials file\")\n",
|
| 1041 |
+
"else:\n",
|
| 1042 |
+
" print(\"⚠️ HF token file not found at:\", hf_token_file)\n",
|
| 1043 |
+
" print(\" The script will try to use cached credentials from 'huggingface-cli login'\")\n",
|
| 1044 |
+
" print(\" Or create the file: misc/credentials/hf_token.txt with your token\")\n",
|
| 1045 |
+
" HF_TOKEN = None # Will use cached token if available\n",
|
| 1046 |
+
"\n",
|
| 1047 |
+
"# Load tokenizer\n",
|
| 1048 |
+
"print(\"Loading tokenizer...\")\n",
|
| 1049 |
+
"try:\n",
|
| 1050 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1051 |
+
" MODEL_NAME,\n",
|
| 1052 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1053 |
+
" use_fast=True,\n",
|
| 1054 |
+
" token=HF_TOKEN\n",
|
| 1055 |
+
" )\n",
|
| 1056 |
+
"except Exception as e:\n",
|
| 1057 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 1058 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1059 |
+
" MODEL_NAME,\n",
|
| 1060 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1061 |
+
" use_fast=False,\n",
|
| 1062 |
+
" token=HF_TOKEN\n",
|
| 1063 |
+
" )\n",
|
| 1064 |
+
"\n",
|
| 1065 |
+
"# Ensure pad token is set\n",
|
| 1066 |
+
"if tokenizer.pad_token is None:\n",
|
| 1067 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 1068 |
+
"\n",
|
| 1069 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 1070 |
+
"\n",
|
| 1071 |
+
"# Configure 8-bit quantization for A100\n",
|
| 1072 |
+
"print(\"Configuring 8-bit quantization...\")\n",
|
| 1073 |
+
"quantization_config = BitsAndBytesConfig(\n",
|
| 1074 |
+
" load_in_8bit=True,\n",
|
| 1075 |
+
" llm_int8_threshold=6.0,\n",
|
| 1076 |
+
" llm_int8_has_fp16_weight=False\n",
|
| 1077 |
+
")\n",
|
| 1078 |
+
"\n",
|
| 1079 |
+
"# Load model with 8-bit quantization\n",
|
| 1080 |
+
"print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
|
| 1081 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 1082 |
+
" MODEL_NAME,\n",
|
| 1083 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1084 |
+
" quantization_config=quantization_config,\n",
|
| 1085 |
+
" device_map=\"auto\",\n",
|
| 1086 |
+
" trust_remote_code=False,\n",
|
| 1087 |
+
" token=HF_TOKEN\n",
|
| 1088 |
+
")\n",
|
| 1089 |
+
"model.eval()\n",
|
| 1090 |
+
"print(\"✅ Model loaded with 8-bit quantization\")\n",
|
| 1091 |
+
"\n",
|
| 1092 |
+
"# Check VRAM usage\n",
|
| 1093 |
+
"if torch.cuda.is_available():\n",
|
| 1094 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 1095 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 1096 |
+
"\n",
|
| 1097 |
+
"# === LOAD DATA ===\n",
|
| 1098 |
+
"if output_file.exists():\n",
|
| 1099 |
+
" print(\"Loading annotated CSV...\")\n",
|
| 1100 |
+
" df = pd.read_csv(output_file)\n",
|
| 1101 |
+
"else:\n",
|
| 1102 |
+
" print(\"Loading raw input CSV...\")\n",
|
| 1103 |
+
" df = pd.read_csv(input_file)\n",
|
| 1104 |
+
"\n",
|
| 1105 |
+
"\n",
|
| 1106 |
+
"# Try to load profession mapping files\n",
|
| 1107 |
+
"try:\n",
|
| 1108 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 1109 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 1110 |
+
"except:\n",
|
| 1111 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 1112 |
+
"\n",
|
| 1113 |
+
"try:\n",
|
| 1114 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 1115 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 1116 |
+
"except:\n",
|
| 1117 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 1118 |
+
"\n",
|
| 1119 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 1120 |
+
"\n",
|
| 1121 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 1122 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 1123 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 1124 |
+
" print(f\" - {cat}\")\n",
|
| 1125 |
+
"\n",
|
| 1126 |
+
"if TEST_MODE:\n",
|
| 1127 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 1128 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 1129 |
+
"elif MAX_ROWS:\n",
|
| 1130 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 1131 |
+
"\n",
|
| 1132 |
+
"# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
|
| 1133 |
+
"def create_prompt(row):\n",
|
| 1134 |
+
" \"\"\"Create prompt for EuroLLM annotation with strict formatting requirements.\"\"\"\n",
|
| 1135 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 1136 |
+
" \n",
|
| 1137 |
+
" # Gather hints\n",
|
| 1138 |
+
" hints = []\n",
|
| 1139 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 1140 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 1141 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 1142 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 1143 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 1144 |
+
" hints.append(str(row['likely_country']))\n",
|
| 1145 |
+
" \n",
|
| 1146 |
+
" # Add tags if we don't have enough hints\n",
|
| 1147 |
+
" if len(hints) < 3:\n",
|
| 1148 |
+
" for i in range(1, 8):\n",
|
| 1149 |
+
" tag_col = f'tag_{i}'\n",
|
| 1150 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 1151 |
+
" tag_val = str(row[tag_col])\n",
|
| 1152 |
+
" if tag_val not in hints:\n",
|
| 1153 |
+
" hints.append(tag_val)\n",
|
| 1154 |
+
" if len(hints) >= 5:\n",
|
| 1155 |
+
" break\n",
|
| 1156 |
+
" \n",
|
| 1157 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 1158 |
+
" \n",
|
| 1159 |
+
" return f\"\"\"Extract information about '{name}'. \n",
|
| 1160 |
+
"Context hints (DO NOT copy these as professions): {hint_text}\n",
|
| 1161 |
+
"\n",
|
| 1162 |
+
"Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
|
| 1163 |
+
"\n",
|
| 1164 |
+
"FORMAT REQUIREMENTS:\n",
|
| 1165 |
+
"1. Full legal name in Western order (first last). VALUE ONLY.\n",
|
| 1166 |
+
"2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
|
| 1167 |
+
"3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
|
| 1168 |
+
"4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
|
| 1169 |
+
"5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
|
| 1170 |
+
"\n",
|
| 1171 |
+
"CRITICAL RULES FOR PROFESSIONS (Line 4):\n",
|
| 1172 |
+
"- ONLY use the exact profession categories listed above\n",
|
| 1173 |
+
"- DO NOT use descriptive words like \"sexy\", \"photorealistic\", \"celebrity\"\n",
|
| 1174 |
+
"- DO NOT copy the hint words as professions\n",
|
| 1175 |
+
"- If uncertain about profession, write \"Unknown\"\n",
|
| 1176 |
+
"- Valid professions are ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality\n",
|
| 1177 |
+
"- Actress = actor, streamer = online personality, YouTuber = online personality\n",
|
| 1178 |
+
"\n",
|
| 1179 |
+
"OTHER RULES:\n",
|
| 1180 |
+
"- Use \"Unknown\" when uncertain or for fictional characters\n",
|
| 1181 |
+
"- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
|
| 1182 |
+
"- For multi-role people, list up to 3 categories by relevance\"\"\"\n",
|
| 1183 |
+
"\n",
|
| 1184 |
+
"# Create prompts\n",
|
| 1185 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 1186 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 1187 |
+
"print(\"✅ Prompts created\")\n",
|
| 1188 |
+
"\n",
|
| 1189 |
+
"@contextmanager\n",
|
| 1190 |
+
"def timeout(duration):\n",
|
| 1191 |
+
" def handler(signum, frame):\n",
|
| 1192 |
+
" raise TimeoutError(\"Operation timed out\")\n",
|
| 1193 |
+
" \n",
|
| 1194 |
+
" # Set the signal handler and alarm\n",
|
| 1195 |
+
" signal.signal(signal.SIGALRM, handler)\n",
|
| 1196 |
+
" signal.alarm(duration)\n",
|
| 1197 |
+
" try:\n",
|
| 1198 |
+
" yield\n",
|
| 1199 |
+
" finally:\n",
|
| 1200 |
+
" signal.alarm(0) # Disable the alarm\n",
|
| 1201 |
+
"\n",
|
| 1202 |
+
"\n",
|
| 1203 |
+
"def query_eurollm_local(prompt: str) -> str:\n",
|
| 1204 |
+
" \"\"\"Query EuroLLM locally via transformers with very low temperature.\"\"\"\n",
|
| 1205 |
+
" try:\n",
|
| 1206 |
+
" # Format as chat message for EuroLLM with strict system prompt\n",
|
| 1207 |
+
" messages = [\n",
|
| 1208 |
+
" {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
|
| 1209 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 1210 |
+
" ]\n",
|
| 1211 |
+
" \n",
|
| 1212 |
+
" # Tokenize\n",
|
| 1213 |
+
" if hasattr(tokenizer, 'apply_chat_template') and tokenizer.chat_template is not None:\n",
|
| 1214 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 1215 |
+
" messages,\n",
|
| 1216 |
+
" tokenize=False,\n",
|
| 1217 |
+
" add_generation_prompt=True\n",
|
| 1218 |
+
" )\n",
|
| 1219 |
+
" else:\n",
|
| 1220 |
+
" # Fallback for models without chat template\n",
|
| 1221 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 1222 |
+
" \n",
|
| 1223 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 1224 |
+
" \n",
|
| 1225 |
+
" # Generate with timeout and very low temperature\n",
|
| 1226 |
+
" with timeout(60):\n",
|
| 1227 |
+
" with torch.no_grad():\n",
|
| 1228 |
+
" outputs = model.generate(\n",
|
| 1229 |
+
" **inputs,\n",
|
| 1230 |
+
" max_new_tokens=100,\n",
|
| 1231 |
+
" temperature=0.01, # Very low temperature for more deterministic outputs\n",
|
| 1232 |
+
" do_sample=True, # Must be True when temperature is set\n",
|
| 1233 |
+
" pad_token_id=tokenizer.eos_token_id\n",
|
| 1234 |
+
" )\n",
|
| 1235 |
+
" \n",
|
| 1236 |
+
" # Decode\n",
|
| 1237 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 1238 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 1239 |
+
" \n",
|
| 1240 |
+
" return response.strip()\n",
|
| 1241 |
+
" \n",
|
| 1242 |
+
" except TimeoutError:\n",
|
| 1243 |
+
" print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
|
| 1244 |
+
" return None\n",
|
| 1245 |
+
" except Exception as e:\n",
|
| 1246 |
+
" print(f\"Generation error: {e}\")\n",
|
| 1247 |
+
" import traceback\n",
|
| 1248 |
+
" traceback.print_exc()\n",
|
| 1249 |
+
" return None\n",
|
| 1250 |
+
"\n",
|
| 1251 |
+
" \n",
|
| 1252 |
+
"# === PARSE RESPONSE WITH CLEANING ===\n",
|
| 1253 |
+
"def parse_response(response):\n",
|
| 1254 |
+
" \"\"\"Parse EuroLLM response into structured fields with cleaning.\"\"\"\n",
|
| 1255 |
+
" if not response:\n",
|
| 1256 |
+
" return {\n",
|
| 1257 |
+
" 'full_name': 'Unknown',\n",
|
| 1258 |
+
" 'aliases': 'Unknown',\n",
|
| 1259 |
+
" 'gender': 'Unknown',\n",
|
| 1260 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1261 |
+
" 'country': 'Unknown'\n",
|
| 1262 |
+
" }\n",
|
| 1263 |
+
" \n",
|
| 1264 |
+
" # Valid profession categories\n",
|
| 1265 |
+
" VALID_PROFESSIONS = {\n",
|
| 1266 |
+
" \"actor\", \"adult performer\", \"singer/musician\", \"model\", \n",
|
| 1267 |
+
" \"online personality\", \"public figure\", \"voice actor/asmr\", \n",
|
| 1268 |
+
" \"sports professional\", \"tv personality\"\n",
|
| 1269 |
+
" }\n",
|
| 1270 |
+
" \n",
|
| 1271 |
+
" # Split into lines and clean\n",
|
| 1272 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 1273 |
+
" \n",
|
| 1274 |
+
" # Initialize with Unknown values\n",
|
| 1275 |
+
" fields = {\n",
|
| 1276 |
+
" 'full_name': 'Unknown',\n",
|
| 1277 |
+
" 'aliases': 'Unknown',\n",
|
| 1278 |
+
" 'gender': 'Unknown',\n",
|
| 1279 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1280 |
+
" 'country': 'Unknown'\n",
|
| 1281 |
+
" }\n",
|
| 1282 |
+
" \n",
|
| 1283 |
+
" # Extract information from each numbered line\n",
|
| 1284 |
+
" for line in lines:\n",
|
| 1285 |
+
" if line.startswith('1.'):\n",
|
| 1286 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 1287 |
+
" elif line.startswith('2.'):\n",
|
| 1288 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 1289 |
+
" elif line.startswith('3.'):\n",
|
| 1290 |
+
" # Clean gender field - remove any labels\n",
|
| 1291 |
+
" gender_raw = line[2:].strip()\n",
|
| 1292 |
+
" # Remove common prefixes\n",
|
| 1293 |
+
" gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
|
| 1294 |
+
" # Extract just the gender word\n",
|
| 1295 |
+
" gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
|
| 1296 |
+
" fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
|
| 1297 |
+
" elif line.startswith('4.'):\n",
|
| 1298 |
+
" # Clean and validate profession field\n",
|
| 1299 |
+
" profession_raw = line[2:].strip()\n",
|
| 1300 |
+
" \n",
|
| 1301 |
+
" # Split by comma and validate each profession\n",
|
| 1302 |
+
" professions = [p.strip().lower() for p in profession_raw.split(',')]\n",
|
| 1303 |
+
" valid_profs = []\n",
|
| 1304 |
+
" \n",
|
| 1305 |
+
" for prof in professions:\n",
|
| 1306 |
+
" # Check if it's a valid profession\n",
|
| 1307 |
+
" if prof in VALID_PROFESSIONS:\n",
|
| 1308 |
+
" valid_profs.append(prof)\n",
|
| 1309 |
+
" # Check for common invalid entries\n",
|
| 1310 |
+
" elif prof in ['unknown', '']:\n",
|
| 1311 |
+
" continue\n",
|
| 1312 |
+
" # Reject descriptive words that aren't professions\n",
|
| 1313 |
+
" elif prof in ['sexy', 'photorealistic', 'celebrity', 'famous', 'popular', \n",
|
| 1314 |
+
" 'beautiful', 'attractive', 'hot', 'gorgeous']:\n",
|
| 1315 |
+
" continue\n",
|
| 1316 |
+
" # If it looks like it might be close to a valid profession, keep it\n",
|
| 1317 |
+
" elif any(valid in prof for valid in VALID_PROFESSIONS):\n",
|
| 1318 |
+
" # Try to extract the valid part\n",
|
| 1319 |
+
" for valid in VALID_PROFESSIONS:\n",
|
| 1320 |
+
" if valid in prof:\n",
|
| 1321 |
+
" valid_profs.append(valid)\n",
|
| 1322 |
+
" break\n",
|
| 1323 |
+
" \n",
|
| 1324 |
+
" # Set the cleaned professions or Unknown if none are valid\n",
|
| 1325 |
+
" if valid_profs:\n",
|
| 1326 |
+
" fields['profession_llm'] = ', '.join(valid_profs)\n",
|
| 1327 |
+
" else:\n",
|
| 1328 |
+
" fields['profession_llm'] = 'Unknown'\n",
|
| 1329 |
+
" \n",
|
| 1330 |
+
" elif line.startswith('5.'):\n",
|
| 1331 |
+
" # Clean country field - remove any labels\n",
|
| 1332 |
+
" country_raw = line[2:].strip()\n",
|
| 1333 |
+
" # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
|
| 1334 |
+
" country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
|
| 1335 |
+
" fields['country'] = country_raw\n",
|
| 1336 |
+
" \n",
|
| 1337 |
+
" return fields\n",
|
| 1338 |
+
"\n",
|
| 1339 |
+
"# === PROCESS DATA ===\n",
|
| 1340 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 1341 |
+
"\n",
|
| 1342 |
+
"# Load index\n",
|
| 1343 |
+
"current_index = 0\n",
|
| 1344 |
+
"if index_file.exists():\n",
|
| 1345 |
+
" try:\n",
|
| 1346 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 1347 |
+
" except:\n",
|
| 1348 |
+
" current_index = 0\n",
|
| 1349 |
+
"\n",
|
| 1350 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 1351 |
+
"\n",
|
| 1352 |
+
"start_time = time.time()\n",
|
| 1353 |
+
"\n",
|
| 1354 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"EuroLLM Local\"):\n",
|
| 1355 |
+
"\n",
|
| 1356 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 1357 |
+
"\n",
|
| 1358 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 1359 |
+
" response = None\n",
|
| 1360 |
+
" for attempt in range(3):\n",
|
| 1361 |
+
" response = query_eurollm_local(prompt)\n",
|
| 1362 |
+
" \n",
|
| 1363 |
+
" # DEBUG: Print first few responses to see what's happening\n",
|
| 1364 |
+
" if i < 5:\n",
|
| 1365 |
+
" print(f\"\\n=== DEBUG Row {i}, Attempt {attempt+1} ===\")\n",
|
| 1366 |
+
" print(f\"Response length: {len(response) if response else 0}\")\n",
|
| 1367 |
+
" print(f\"Response: {response[:500] if response else 'None'}\")\n",
|
| 1368 |
+
" print(\"=\" * 50)\n",
|
| 1369 |
+
" \n",
|
| 1370 |
+
" # Valid response?\n",
|
| 1371 |
+
" if response and len(response.strip()) > 10:\n",
|
| 1372 |
+
" break\n",
|
| 1373 |
+
" \n",
|
| 1374 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 1375 |
+
" time.sleep(0.5)\n",
|
| 1376 |
+
"\n",
|
| 1377 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 1378 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 1379 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 1380 |
+
" continue\n",
|
| 1381 |
+
"\n",
|
| 1382 |
+
" parsed = parse_response(response)\n",
|
| 1383 |
+
"\n",
|
| 1384 |
+
" # DEBUG: Print first few parsed results\n",
|
| 1385 |
+
" if i < 5:\n",
|
| 1386 |
+
" print(f\"\\n=== PARSED Row {i} ===\")\n",
|
| 1387 |
+
" for key, value in parsed.items():\n",
|
| 1388 |
+
" print(f\" {key}: {value}\")\n",
|
| 1389 |
+
" print(\"=\" * 50)\n",
|
| 1390 |
+
"\n",
|
| 1391 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 1392 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 1393 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 1394 |
+
" continue\n",
|
| 1395 |
+
"\n",
|
| 1396 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 1397 |
+
" for key, value in parsed.items():\n",
|
| 1398 |
+
" df.at[i, key] = value\n",
|
| 1399 |
+
"\n",
|
| 1400 |
+
" # Advance progress ONLY after successful write\n",
|
| 1401 |
+
" current_index = i + 1\n",
|
| 1402 |
+
"\n",
|
| 1403 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 1404 |
+
" if torch.cuda.is_available():\n",
|
| 1405 |
+
" torch.cuda.empty_cache()\n",
|
| 1406 |
+
" torch.cuda.synchronize()\n",
|
| 1407 |
+
"\n",
|
| 1408 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 1409 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 1410 |
+
" df.to_csv(output_file, index=False)\n",
|
| 1411 |
+
" with open(index_file, \"w\") as f:\n",
|
| 1412 |
+
" f.write(str(current_index))\n",
|
| 1413 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 1414 |
+
"\n",
|
| 1415 |
+
"# Final save\n",
|
| 1416 |
+
"df.to_csv(output_file, index=False)\n",
|
| 1417 |
+
"index_file.write_text(str(current_index))\n",
|
| 1418 |
+
"print(\"✅ Finished full dataset.\")"
|
| 1419 |
+
]
|
| 1420 |
+
},
|
| 1421 |
+
{
|
| 1422 |
+
"cell_type": "markdown",
|
| 1423 |
+
"id": "472e5ac2-ec04-4bfa-8a67-116277238c15",
|
| 1424 |
+
"metadata": {},
|
| 1425 |
+
"source": [
|
| 1426 |
+
"## Mistral 24b instruct"
|
| 1427 |
+
]
|
| 1428 |
+
},
|
| 1429 |
+
{
|
| 1430 |
+
"cell_type": "code",
|
| 1431 |
+
"execution_count": null,
|
| 1432 |
+
"id": "a55a5e30-83f3-4f7c-a537-b1216d4e8a07",
|
| 1433 |
+
"metadata": {
|
| 1434 |
+
"execution": {
|
| 1435 |
+
"iopub.execute_input": "2025-12-09T22:16:21.002786Z",
|
| 1436 |
+
"iopub.status.busy": "2025-12-09T22:16:21.002337Z"
|
| 1437 |
+
}
|
| 1438 |
+
},
|
| 1439 |
+
"outputs": [
|
| 1440 |
+
{
|
| 1441 |
+
"name": "stderr",
|
| 1442 |
+
"output_type": "stream",
|
| 1443 |
+
"text": [
|
| 1444 |
+
"/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 1445 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 1446 |
+
]
|
| 1447 |
+
},
|
| 1448 |
+
{
|
| 1449 |
+
"name": "stdout",
|
| 1450 |
+
"output_type": "stream",
|
| 1451 |
+
"text": [
|
| 1452 |
+
"Loading model: mistralai/Mistral-Small-Instruct-2409\n",
|
| 1453 |
+
"Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
|
| 1454 |
+
"This may take a while on first run (~65GB download)...\n",
|
| 1455 |
+
"\n",
|
| 1456 |
+
"Device: cuda\n",
|
| 1457 |
+
"Loading tokenizer...\n",
|
| 1458 |
+
"✅ Tokenizer loaded\n",
|
| 1459 |
+
"Loading model (this may take several minutes)...\n"
|
| 1460 |
+
]
|
| 1461 |
+
},
|
| 1462 |
+
{
|
| 1463 |
+
"name": "stderr",
|
| 1464 |
+
"output_type": "stream",
|
| 1465 |
+
"text": [
|
| 1466 |
+
"`torch_dtype` is deprecated! Use `dtype` instead!\n",
|
| 1467 |
+
"Loading checkpoint shards: 100%|██████████| 9/9 [02:42<00:00, 18.06s/it]\n"
|
| 1468 |
+
]
|
| 1469 |
+
},
|
| 1470 |
+
{
|
| 1471 |
+
"name": "stdout",
|
| 1472 |
+
"output_type": "stream",
|
| 1473 |
+
"text": [
|
| 1474 |
+
"✅ Model loaded\n",
|
| 1475 |
+
"VRAM used: 21.40 GB\n",
|
| 1476 |
+
"\n",
|
| 1477 |
+
"Loading raw input CSV...\n",
|
| 1478 |
+
"Loaded 50861 rows from input file\n",
|
| 1479 |
+
"Found existing annotations, merging...\n"
|
| 1480 |
+
]
|
| 1481 |
+
},
|
| 1482 |
+
{
|
| 1483 |
+
"name": "stderr",
|
| 1484 |
+
"output_type": "stream",
|
| 1485 |
+
"text": [
|
| 1486 |
+
"/tmp/ipykernel_3104208/1997558719.py:113: DtypeWarning: Columns (52,53,54,55,56) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 1487 |
+
" existing_df = pd.read_csv(output_file)\n"
|
| 1488 |
+
]
|
| 1489 |
+
},
|
| 1490 |
+
{
|
| 1491 |
+
"name": "stdout",
|
| 1492 |
+
"output_type": "stream",
|
| 1493 |
+
"text": [
|
| 1494 |
+
"Existing annotations has 50861 rows\n",
|
| 1495 |
+
"Merged annotations, continuing with 50861 total rows\n",
|
| 1496 |
+
"✅ Loaded professions.csv\n",
|
| 1497 |
+
"✅ Loaded profession mapping with 9 categories\n",
|
| 1498 |
+
"Loaded 50861 rows\n",
|
| 1499 |
+
"\n",
|
| 1500 |
+
"Profession categories (9):\n",
|
| 1501 |
+
" - actor\n",
|
| 1502 |
+
" - adult performer\n",
|
| 1503 |
+
" - singer/musician\n",
|
| 1504 |
+
" - model\n",
|
| 1505 |
+
" - online personality\n",
|
| 1506 |
+
" - public figure\n",
|
| 1507 |
+
" - voice actor/ASMR\n",
|
| 1508 |
+
" - sports professional\n",
|
| 1509 |
+
" - tv personality\n",
|
| 1510 |
+
"\n",
|
| 1511 |
+
"Creating prompts...\n",
|
| 1512 |
+
"✅ Prompts created\n",
|
| 1513 |
+
"Resuming from index 8810\n"
|
| 1514 |
+
]
|
| 1515 |
+
},
|
| 1516 |
+
{
|
| 1517 |
+
"name": "stderr",
|
| 1518 |
+
"output_type": "stream",
|
| 1519 |
+
"text": [
|
| 1520 |
+
"Mistral Local: 0%| | 0/42051 [00:00<?, ?it/s]/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:181: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
|
| 1521 |
+
" warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
|
| 1522 |
+
"Mistral Local: 0%| | 7/42051 [00:57<93:01:03, 7.96s/it] "
|
| 1523 |
+
]
|
| 1524 |
+
}
|
| 1525 |
+
],
|
| 1526 |
+
"source": [
|
| 1527 |
+
"import pandas as pd\n",
|
| 1528 |
+
"import json\n",
|
| 1529 |
+
"import time\n",
|
| 1530 |
+
"import re\n",
|
| 1531 |
+
"from pathlib import Path\n",
|
| 1532 |
+
"from tqdm import tqdm\n",
|
| 1533 |
+
"import torch\n",
|
| 1534 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 1535 |
+
"\n",
|
| 1536 |
+
"current_dir = Path.cwd()\n",
|
| 1537 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 1538 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 1539 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 1540 |
+
"# === PROCESS DATA ===\n",
|
| 1541 |
+
"\n",
|
| 1542 |
+
"\n",
|
| 1543 |
+
"# === CONFIGURATION ===\n",
|
| 1544 |
+
"TEST_MODE = False\n",
|
| 1545 |
+
"TEST_SIZE = 100\n",
|
| 1546 |
+
"MAX_ROWS = 50862\n",
|
| 1547 |
+
"SAVE_INTERVAL = 10\n",
|
| 1548 |
+
"\n",
|
| 1549 |
+
"output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 1550 |
+
"index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
|
| 1551 |
+
"\n",
|
| 1552 |
+
"\n",
|
| 1553 |
+
"# Model settings\n",
|
| 1554 |
+
"#MODEL_NAME = \"mistralai/Mistral-Small-3.1-24B-Instruct-2503\"\n",
|
| 1555 |
+
"MODEL_NAME = \"mistralai/Mistral-Small-Instruct-2409\"\n",
|
| 1556 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 1557 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 1558 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 1559 |
+
"\n",
|
| 1560 |
+
"# Define the SPECIFIC profession categories\n",
|
| 1561 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 1562 |
+
" \"actor\",\n",
|
| 1563 |
+
" \"adult performer\",\n",
|
| 1564 |
+
" \"singer/musician\",\n",
|
| 1565 |
+
" \"model\",\n",
|
| 1566 |
+
" \"online personality\",\n",
|
| 1567 |
+
" \"public figure\",\n",
|
| 1568 |
+
" \"voice actor/ASMR\",\n",
|
| 1569 |
+
" \"sports professional\",\n",
|
| 1570 |
+
" \"tv personality\"\n",
|
| 1571 |
+
"]\n",
|
| 1572 |
+
"\n",
|
| 1573 |
+
"# === LOAD MODEL ===\n",
|
| 1574 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 1575 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 1576 |
+
"print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
|
| 1577 |
+
"\n",
|
| 1578 |
+
"# Check GPU availability\n",
|
| 1579 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1580 |
+
"print(f\"Device: {device}\")\n",
|
| 1581 |
+
"\n",
|
| 1582 |
+
"if device == \"cpu\":\n",
|
| 1583 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 1584 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 1585 |
+
"\n",
|
| 1586 |
+
"# Load tokenizer\n",
|
| 1587 |
+
"print(\"Loading tokenizer...\")\n",
|
| 1588 |
+
"try:\n",
|
| 1589 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1590 |
+
" MODEL_NAME,\n",
|
| 1591 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1592 |
+
" use_fast=True\n",
|
| 1593 |
+
" )\n",
|
| 1594 |
+
"except Exception as e:\n",
|
| 1595 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 1596 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1597 |
+
" MODEL_NAME,\n",
|
| 1598 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1599 |
+
" use_fast=False\n",
|
| 1600 |
+
" )\n",
|
| 1601 |
+
"\n",
|
| 1602 |
+
"# Ensure pad token is set\n",
|
| 1603 |
+
"if tokenizer.pad_token is None:\n",
|
| 1604 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 1605 |
+
"\n",
|
| 1606 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 1607 |
+
"\n",
|
| 1608 |
+
"quantization_config = BitsAndBytesConfig(\n",
|
| 1609 |
+
" load_in_8bit=True\n",
|
| 1610 |
+
")\n",
|
| 1611 |
+
"\n",
|
| 1612 |
+
"\n",
|
| 1613 |
+
"# Load model with optimizations\n",
|
| 1614 |
+
"print(\"Loading model (this may take several minutes)...\")\n",
|
| 1615 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 1616 |
+
" MODEL_NAME,\n",
|
| 1617 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1618 |
+
" torch_dtype=torch.bfloat16,\n",
|
| 1619 |
+
" quantization_config=quantization_config,\n",
|
| 1620 |
+
" device_map=\"auto\",\n",
|
| 1621 |
+
" trust_remote_code=False\n",
|
| 1622 |
+
")\n",
|
| 1623 |
+
"model.eval()\n",
|
| 1624 |
+
"print(\"✅ Model loaded\")\n",
|
| 1625 |
+
"\n",
|
| 1626 |
+
"# Check VRAM usage\n",
|
| 1627 |
+
"if torch.cuda.is_available():\n",
|
| 1628 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 1629 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 1630 |
+
"\n",
|
| 1631 |
+
"# === LOAD DATA ===\n",
|
| 1632 |
+
"print(\"Loading raw input CSV...\")\n",
|
| 1633 |
+
"df = pd.read_csv(input_file) # ALWAYS load the full input\n",
|
| 1634 |
+
"print(f\"Loaded {len(df)} rows from input file\")\n",
|
| 1635 |
+
"\n",
|
| 1636 |
+
"# If we have previous annotations, merge them\n",
|
| 1637 |
+
"if output_file.exists():\n",
|
| 1638 |
+
" print(\"Found existing annotations, merging...\")\n",
|
| 1639 |
+
" existing_df = pd.read_csv(output_file)\n",
|
| 1640 |
+
" print(f\"Existing annotations has {len(existing_df)} rows\")\n",
|
| 1641 |
+
" \n",
|
| 1642 |
+
" # Update df with existing annotations\n",
|
| 1643 |
+
" # Only update the columns that were annotated\n",
|
| 1644 |
+
" annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
|
| 1645 |
+
" for col in annotation_cols:\n",
|
| 1646 |
+
" if col in existing_df.columns:\n",
|
| 1647 |
+
" df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
|
| 1648 |
+
" \n",
|
| 1649 |
+
" print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
|
| 1650 |
+
"\n",
|
| 1651 |
+
"\n",
|
| 1652 |
+
"# Try to load profession mapping files\n",
|
| 1653 |
+
"try:\n",
|
| 1654 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 1655 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 1656 |
+
"except:\n",
|
| 1657 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 1658 |
+
"\n",
|
| 1659 |
+
"try:\n",
|
| 1660 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 1661 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 1662 |
+
"except:\n",
|
| 1663 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 1664 |
+
"\n",
|
| 1665 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 1666 |
+
"\n",
|
| 1667 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 1668 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 1669 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 1670 |
+
" print(f\" - {cat}\")\n",
|
| 1671 |
+
"\n",
|
| 1672 |
+
"if TEST_MODE:\n",
|
| 1673 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 1674 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 1675 |
+
"elif MAX_ROWS:\n",
|
| 1676 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 1677 |
+
"\n",
|
| 1678 |
+
"# === CREATE PROMPTS (DEEPSEEK STYLE) ===\n",
|
| 1679 |
+
"def create_prompt(row):\n",
|
| 1680 |
+
" \"\"\"Create prompt for Mistral annotation with specific profession categories.\"\"\"\n",
|
| 1681 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 1682 |
+
" \n",
|
| 1683 |
+
" # Gather hints\n",
|
| 1684 |
+
" hints = []\n",
|
| 1685 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 1686 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 1687 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 1688 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 1689 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 1690 |
+
" hints.append(str(row['likely_country']))\n",
|
| 1691 |
+
" \n",
|
| 1692 |
+
" # Add tags if we don't have enough hints\n",
|
| 1693 |
+
" if len(hints) < 3:\n",
|
| 1694 |
+
" for i in range(1, 8):\n",
|
| 1695 |
+
" tag_col = f'tag_{i}'\n",
|
| 1696 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 1697 |
+
" tag_val = str(row[tag_col])\n",
|
| 1698 |
+
" if tag_val not in hints:\n",
|
| 1699 |
+
" hints.append(tag_val)\n",
|
| 1700 |
+
" if len(hints) >= 5:\n",
|
| 1701 |
+
" break\n",
|
| 1702 |
+
" \n",
|
| 1703 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 1704 |
+
" \n",
|
| 1705 |
+
" return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
|
| 1706 |
+
"1. Full legal name (Western order if non-latin script)\n",
|
| 1707 |
+
"2. Any stage names/aliases (comma separated)\n",
|
| 1708 |
+
"3. Gender (Male/Female/Other/Unknown)\n",
|
| 1709 |
+
"4. Top 3 most likely professions from ONLY these categories:\n",
|
| 1710 |
+
" - actor\n",
|
| 1711 |
+
" - adult performer\n",
|
| 1712 |
+
" - singer/musician\n",
|
| 1713 |
+
" - model\n",
|
| 1714 |
+
" - online personality (includes streamers, cosplayers, influencers)\n",
|
| 1715 |
+
" - public figure (includes politicians, activists, journalists, authors)\n",
|
| 1716 |
+
" - voice actor/ASMR\n",
|
| 1717 |
+
" - sports professional\n",
|
| 1718 |
+
" - tv personality (includes hosts, presenters, reality TV)\n",
|
| 1719 |
+
"\n",
|
| 1720 |
+
"5. Primary country associated\n",
|
| 1721 |
+
"\n",
|
| 1722 |
+
"IMPORTANT:\n",
|
| 1723 |
+
"- Choose professions ONLY from the 9 categories above\n",
|
| 1724 |
+
"- Provide up to 3 professions, comma-separated, ordered by relevance\n",
|
| 1725 |
+
"- Be SPECIFIC: choose the most accurate category for each role\n",
|
| 1726 |
+
"- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
|
| 1727 |
+
"- Use 'Unknown' when uncertain or for fictional characters/places\n",
|
| 1728 |
+
"- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
|
| 1729 |
+
"- For country respond with one word only, for example China or Columbia\n",
|
| 1730 |
+
"- actress = actor\n",
|
| 1731 |
+
"\n",
|
| 1732 |
+
"Respond with exactly 5 numbered lines.\"\"\"\n",
|
| 1733 |
+
"\n",
|
| 1734 |
+
"# Create prompts\n",
|
| 1735 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 1736 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 1737 |
+
"print(\"✅ Prompts created\")\n",
|
| 1738 |
+
"\n",
|
| 1739 |
+
"# === QUERY MISTRAL LOCAL ===\n",
|
| 1740 |
+
"def query_mistral_local(prompt: str) -> str:\n",
|
| 1741 |
+
" \"\"\"Query Mistral locally via transformers.\"\"\"\n",
|
| 1742 |
+
" try:\n",
|
| 1743 |
+
" # Format as chat message for Mistral\n",
|
| 1744 |
+
" messages = [\n",
|
| 1745 |
+
" {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
|
| 1746 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 1747 |
+
" ]\n",
|
| 1748 |
+
" \n",
|
| 1749 |
+
" # Tokenize\n",
|
| 1750 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 1751 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 1752 |
+
" messages,\n",
|
| 1753 |
+
" tokenize=False,\n",
|
| 1754 |
+
" add_generation_prompt=True\n",
|
| 1755 |
+
" )\n",
|
| 1756 |
+
" else:\n",
|
| 1757 |
+
" # Fallback for older tokenizers\n",
|
| 1758 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 1759 |
+
" \n",
|
| 1760 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 1761 |
+
" \n",
|
| 1762 |
+
" # Generate\n",
|
| 1763 |
+
" with torch.no_grad():\n",
|
| 1764 |
+
" outputs = model.generate(\n",
|
| 1765 |
+
" **inputs,\n",
|
| 1766 |
+
" max_new_tokens=512,\n",
|
| 1767 |
+
" temperature=0.05,\n",
|
| 1768 |
+
" do_sample=True,\n",
|
| 1769 |
+
" top_p=0.8,\n",
|
| 1770 |
+
" pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id else tokenizer.eos_token_id\n",
|
| 1771 |
+
" )\n",
|
| 1772 |
+
" \n",
|
| 1773 |
+
" # Decode\n",
|
| 1774 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 1775 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 1776 |
+
" \n",
|
| 1777 |
+
" return response.strip()\n",
|
| 1778 |
+
" \n",
|
| 1779 |
+
" except Exception as e:\n",
|
| 1780 |
+
" print(f\"Generation error: {e}\")\n",
|
| 1781 |
+
" return None\n",
|
| 1782 |
+
"\n",
|
| 1783 |
+
"# === PARSE RESPONSE (DEEPSEEK STYLE) ===\n",
|
| 1784 |
+
"def parse_response(response):\n",
|
| 1785 |
+
" \"\"\"Parse Mistral response into structured fields.\"\"\"\n",
|
| 1786 |
+
" if not response:\n",
|
| 1787 |
+
" return {\n",
|
| 1788 |
+
" 'full_name': 'Unknown',\n",
|
| 1789 |
+
" 'aliases': 'Unknown',\n",
|
| 1790 |
+
" 'gender': 'Unknown',\n",
|
| 1791 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1792 |
+
" 'country': 'Unknown'\n",
|
| 1793 |
+
" }\n",
|
| 1794 |
+
" \n",
|
| 1795 |
+
" # Split into lines and clean\n",
|
| 1796 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 1797 |
+
" \n",
|
| 1798 |
+
" # Initialize with Unknown values\n",
|
| 1799 |
+
" fields = {\n",
|
| 1800 |
+
" 'full_name': 'Unknown',\n",
|
| 1801 |
+
" 'aliases': 'Unknown',\n",
|
| 1802 |
+
" 'gender': 'Unknown',\n",
|
| 1803 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1804 |
+
" 'country': 'Unknown'\n",
|
| 1805 |
+
" }\n",
|
| 1806 |
+
" \n",
|
| 1807 |
+
" # Extract information from each numbered line\n",
|
| 1808 |
+
" for line in lines:\n",
|
| 1809 |
+
" if line.startswith('1.'):\n",
|
| 1810 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 1811 |
+
" elif line.startswith('2.'):\n",
|
| 1812 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 1813 |
+
" elif line.startswith('3.'):\n",
|
| 1814 |
+
" fields['gender'] = line[2:].strip()\n",
|
| 1815 |
+
" elif line.startswith('4.'):\n",
|
| 1816 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 1817 |
+
" elif line.startswith('5.'):\n",
|
| 1818 |
+
" fields['country'] = line[2:].strip()\n",
|
| 1819 |
+
" \n",
|
| 1820 |
+
" return fields\n",
|
| 1821 |
+
"\n",
|
| 1822 |
+
"# === PROCESS DATA ===\n",
|
| 1823 |
+
"output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 1824 |
+
"index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
|
| 1825 |
+
"\n",
|
| 1826 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 1827 |
+
"\n",
|
| 1828 |
+
"# Load index\n",
|
| 1829 |
+
"current_index = 0\n",
|
| 1830 |
+
"if index_file.exists():\n",
|
| 1831 |
+
" try:\n",
|
| 1832 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 1833 |
+
" except:\n",
|
| 1834 |
+
" current_index = 0\n",
|
| 1835 |
+
"\n",
|
| 1836 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 1837 |
+
"\n",
|
| 1838 |
+
"start_time = time.time()\n",
|
| 1839 |
+
"\n",
|
| 1840 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"Mistral Local\"):\n",
|
| 1841 |
+
"\n",
|
| 1842 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 1843 |
+
"\n",
|
| 1844 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 1845 |
+
" response = None\n",
|
| 1846 |
+
" for attempt in range(3):\n",
|
| 1847 |
+
" response = query_mistral_local(prompt)\n",
|
| 1848 |
+
" \n",
|
| 1849 |
+
" # Valid response?\n",
|
| 1850 |
+
" if response and len(response.strip()) > 10:\n",
|
| 1851 |
+
" break\n",
|
| 1852 |
+
" \n",
|
| 1853 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 1854 |
+
" time.sleep(0.5)\n",
|
| 1855 |
+
"\n",
|
| 1856 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 1857 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 1858 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 1859 |
+
" continue\n",
|
| 1860 |
+
"\n",
|
| 1861 |
+
" parsed = parse_response(response)\n",
|
| 1862 |
+
"\n",
|
| 1863 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 1864 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 1865 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 1866 |
+
" continue\n",
|
| 1867 |
+
"\n",
|
| 1868 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 1869 |
+
" for key, value in parsed.items():\n",
|
| 1870 |
+
" df.at[i, key] = value\n",
|
| 1871 |
+
"\n",
|
| 1872 |
+
" # Advance progress ONLY after successful write\n",
|
| 1873 |
+
" current_index = i + 1\n",
|
| 1874 |
+
"\n",
|
| 1875 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 1876 |
+
" if torch.cuda.is_available():\n",
|
| 1877 |
+
" torch.cuda.empty_cache()\n",
|
| 1878 |
+
" torch.cuda.synchronize()\n",
|
| 1879 |
+
"\n",
|
| 1880 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 1881 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 1882 |
+
" df.to_csv(output_file, index=False)\n",
|
| 1883 |
+
" with open(index_file, \"w\") as f:\n",
|
| 1884 |
+
" f.write(str(current_index))\n",
|
| 1885 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 1886 |
+
"\n",
|
| 1887 |
+
"# Final save\n",
|
| 1888 |
+
"df.to_csv(output_file, index=False)\n",
|
| 1889 |
+
"index_file.write_text(str(current_index))\n",
|
| 1890 |
+
"print(\"✅ Finished full dataset.\")\n"
|
| 1891 |
+
]
|
| 1892 |
+
},
|
| 1893 |
+
{
|
| 1894 |
+
"cell_type": "code",
|
| 1895 |
+
"execution_count": null,
|
| 1896 |
+
"id": "d7212e75-0ff6-45a0-8695-c4a3d3e02818",
|
| 1897 |
+
"metadata": {},
|
| 1898 |
+
"outputs": [],
|
| 1899 |
+
"source": [
|
| 1900 |
+
"import transformers\n",
|
| 1901 |
+
"print(f\"Transformers version: {transformers.__version__}\")\n",
|
| 1902 |
+
"\n",
|
| 1903 |
+
"# Check if Mistral3 is available\n",
|
| 1904 |
+
"try:\n",
|
| 1905 |
+
" from transformers import Mistral3ForCausalLM\n",
|
| 1906 |
+
" print(\"✅ Mistral3 is available\")\n",
|
| 1907 |
+
"except ImportError:\n",
|
| 1908 |
+
" print(\"❌ Mistral3 not available in this transformers version\")"
|
| 1909 |
+
]
|
| 1910 |
+
},
|
| 1911 |
+
{
|
| 1912 |
+
"cell_type": "code",
|
| 1913 |
+
"execution_count": null,
|
| 1914 |
+
"id": "a6ab032e-246e-4c4e-9776-ff0bfbf6fd9c",
|
| 1915 |
+
"metadata": {},
|
| 1916 |
+
"outputs": [],
|
| 1917 |
+
"source": []
|
| 1918 |
+
}
|
| 1919 |
+
],
|
| 1920 |
+
"metadata": {
|
| 1921 |
+
"kernelspec": {
|
| 1922 |
+
"display_name": "pm-paper",
|
| 1923 |
+
"language": "python",
|
| 1924 |
+
"name": "pm-paper"
|
| 1925 |
+
},
|
| 1926 |
+
"language_info": {
|
| 1927 |
+
"codemirror_mode": {
|
| 1928 |
+
"name": "ipython",
|
| 1929 |
+
"version": 3
|
| 1930 |
+
},
|
| 1931 |
+
"file_extension": ".py",
|
| 1932 |
+
"mimetype": "text/x-python",
|
| 1933 |
+
"name": "python",
|
| 1934 |
+
"nbconvert_exporter": "python",
|
| 1935 |
+
"pygments_lexer": "ipython3",
|
| 1936 |
+
"version": "3.11.13"
|
| 1937 |
+
}
|
| 1938 |
+
},
|
| 1939 |
+
"nbformat": 4,
|
| 1940 |
+
"nbformat_minor": 5
|
| 1941 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction-checkpoint.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-Copy1-checkpoint.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_Figure_8_deepfake_adapters-checkpoint.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8_Deepfake_victims-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,668 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "06763fde",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# LLM annotation of Deepfake adapters"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"id": "b773045a",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"## Step 01 Data cleaning NER using spaCy"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "234a55e5",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"#### Here we clean leetspeak and architectre specifiers from the names as a preprossesing step for the named entity recognition NER below"
|
| 25 |
+
]
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"cell_type": "code",
|
| 29 |
+
"execution_count": 1,
|
| 30 |
+
"id": "f177df11",
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"import pandas as pd\n",
|
| 35 |
+
"import spacy\n",
|
| 36 |
+
"import re\n",
|
| 37 |
+
"import torch\n",
|
| 38 |
+
"from pathlib import Path\n",
|
| 39 |
+
"import unicodedata\n"
|
| 40 |
+
]
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"cell_type": "code",
|
| 44 |
+
"execution_count": 2,
|
| 45 |
+
"id": "a2383f32",
|
| 46 |
+
"metadata": {},
|
| 47 |
+
"outputs": [],
|
| 48 |
+
"source": [
|
| 49 |
+
"current_dir = Path.cwd()\n",
|
| 50 |
+
"poi_models_dir = current_dir.parent / \"data/CSV/model_subsets/POI_models.csv\" ### POI models dataset\n",
|
| 51 |
+
"output = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_01.csv\" ### Output file"
|
| 52 |
+
]
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"cell_type": "code",
|
| 56 |
+
"execution_count": 3,
|
| 57 |
+
"id": "66fc691f",
|
| 58 |
+
"metadata": {},
|
| 59 |
+
"outputs": [
|
| 60 |
+
{
|
| 61 |
+
"name": "stderr",
|
| 62 |
+
"output_type": "stream",
|
| 63 |
+
"text": [
|
| 64 |
+
"/home/lauhp/anaconda3/envs/latm/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 65 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 66 |
+
]
|
| 67 |
+
}
|
| 68 |
+
],
|
| 69 |
+
"source": [
|
| 70 |
+
"nlp = spacy.load(\"en_core_web_sm\") # or another model of your choice"
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "code",
|
| 75 |
+
"execution_count": 4,
|
| 76 |
+
"id": "8348c4a4",
|
| 77 |
+
"metadata": {},
|
| 78 |
+
"outputs": [
|
| 79 |
+
{
|
| 80 |
+
"name": "stdout",
|
| 81 |
+
"output_type": "stream",
|
| 82 |
+
"text": [
|
| 83 |
+
"Done! Saved to NER_poi_step_01.csv\n"
|
| 84 |
+
]
|
| 85 |
+
}
|
| 86 |
+
],
|
| 87 |
+
"source": [
|
| 88 |
+
"def preprocess_name(name):\n",
|
| 89 |
+
" name = str(name)\n",
|
| 90 |
+
"\n",
|
| 91 |
+
" # Normalize unicode characters (e.g., fancy fonts)\n",
|
| 92 |
+
" name = unicodedata.normalize(\"NFKD\", name)\n",
|
| 93 |
+
"\n",
|
| 94 |
+
" # Lowercase everything\n",
|
| 95 |
+
" name = name.lower()\n",
|
| 96 |
+
"\n",
|
| 97 |
+
" # Remove special keywords and patterns\n",
|
| 98 |
+
" junk_words = [\n",
|
| 99 |
+
" 'jav', 'jp', 'lora', 'locon', 'lycoris', 'requested', 'japanese', 'model',\n",
|
| 100 |
+
" 'flux', 'flux1.d', 'pony', 'realistic'\n",
|
| 101 |
+
" ]\n",
|
| 102 |
+
" for word in junk_words:\n",
|
| 103 |
+
" name = re.sub(rf'\\b{re.escape(word)}\\b', '', name, flags=re.IGNORECASE)\n",
|
| 104 |
+
"\n",
|
| 105 |
+
" # Remove versions like v1, v2.0, etc.\n",
|
| 106 |
+
" name = re.sub(r'v\\.?\\d+(\\.\\d+)?', '', name)\n",
|
| 107 |
+
"\n",
|
| 108 |
+
" # Remove 'not' followed by a word\n",
|
| 109 |
+
" name = re.sub(r'\\bnot\\s+\\w+', '', name)\n",
|
| 110 |
+
"\n",
|
| 111 |
+
" # Replace underscores and pipes with spaces\n",
|
| 112 |
+
" name = re.sub(r'[_|]', ' ', name)\n",
|
| 113 |
+
"\n",
|
| 114 |
+
" # Remove parentheses and content within\n",
|
| 115 |
+
" name = re.sub(r'\\(.*?\\)', '', name)\n",
|
| 116 |
+
"\n",
|
| 117 |
+
" # Remove excess whitespace\n",
|
| 118 |
+
" name = re.sub(r'\\s+', ' ', name).strip()\n",
|
| 119 |
+
"\n",
|
| 120 |
+
" return name\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"# -------------------------\n",
|
| 123 |
+
"# Fallback Extractor\n",
|
| 124 |
+
"# -------------------------\n",
|
| 125 |
+
"def fallback_extract(text):\n",
|
| 126 |
+
" words = text.split()\n",
|
| 127 |
+
" capitalized = [w for w in words if w and w[0].isalpha()]\n",
|
| 128 |
+
" if len(capitalized) >= 2:\n",
|
| 129 |
+
" return \" \".join(capitalized[:2])\n",
|
| 130 |
+
" elif capitalized:\n",
|
| 131 |
+
" return capitalized[0]\n",
|
| 132 |
+
" return None\n",
|
| 133 |
+
"\n",
|
| 134 |
+
"# -------------------------\n",
|
| 135 |
+
"# Full Extraction Logic\n",
|
| 136 |
+
"# -------------------------\n",
|
| 137 |
+
"def extract_real_name(raw_name):\n",
|
| 138 |
+
" cleaned = preprocess_name(raw_name)\n",
|
| 139 |
+
" doc = nlp(cleaned)\n",
|
| 140 |
+
" persons = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
|
| 141 |
+
" if persons:\n",
|
| 142 |
+
" return persons[0]\n",
|
| 143 |
+
" return fallback_extract(cleaned)\n",
|
| 144 |
+
"\n",
|
| 145 |
+
"# -------------------------\n",
|
| 146 |
+
"# Load Data and Apply\n",
|
| 147 |
+
"# -------------------------\n",
|
| 148 |
+
"df = pd.read_csv(poi_models_dir) # Or your own path\n",
|
| 149 |
+
"\n",
|
| 150 |
+
"# Apply the full extractor\n",
|
| 151 |
+
"texts = df['name'].astype(str).tolist()\n",
|
| 152 |
+
"docs = list(nlp.pipe([preprocess_name(t) for t in texts], batch_size=32))\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"def extract_from_doc(doc, raw_text):\n",
|
| 155 |
+
" persons = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
|
| 156 |
+
" if persons:\n",
|
| 157 |
+
" return persons[0]\n",
|
| 158 |
+
" return fallback_extract(preprocess_name(raw_text))\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"df['real_name'] = [extract_from_doc(doc, raw_text) for doc, raw_text in zip(docs, texts)]\n",
|
| 161 |
+
"\n",
|
| 162 |
+
"# Save the result\n",
|
| 163 |
+
"df.to_csv(output, index=False)\n",
|
| 164 |
+
"print(\"Done! Saved to NER_poi_step_01.csv\")\n"
|
| 165 |
+
]
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"cell_type": "markdown",
|
| 169 |
+
"id": "a40b17fe",
|
| 170 |
+
"metadata": {},
|
| 171 |
+
"source": [
|
| 172 |
+
"## Step 2: compare country and profession with lists"
|
| 173 |
+
]
|
| 174 |
+
},
|
| 175 |
+
{
|
| 176 |
+
"cell_type": "code",
|
| 177 |
+
"execution_count": 5,
|
| 178 |
+
"id": "414954ed",
|
| 179 |
+
"metadata": {},
|
| 180 |
+
"outputs": [
|
| 181 |
+
{
|
| 182 |
+
"name": "stdout",
|
| 183 |
+
"output_type": "stream",
|
| 184 |
+
"text": [
|
| 185 |
+
" real_name likely_country likely_nationality likely_profession\n",
|
| 186 |
+
"0 ronnie alonte model\n",
|
| 187 |
+
"1 zh elena kamperi twitch streamer\n",
|
| 188 |
+
"2 sofia vergara model\n",
|
| 189 |
+
"3 安然 anran celebrity\n",
|
| 190 |
+
"4 zoe kravitz celebrity\n"
|
| 191 |
+
]
|
| 192 |
+
}
|
| 193 |
+
],
|
| 194 |
+
"source": [
|
| 195 |
+
"import pandas as pd\n",
|
| 196 |
+
"from pathlib import Path\n",
|
| 197 |
+
"\n",
|
| 198 |
+
"# Set up paths\n",
|
| 199 |
+
"current_dir = Path.cwd()\n",
|
| 200 |
+
"countries = current_dir.parent / \"misc/lists/countries.csv\"\n",
|
| 201 |
+
"professions = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 202 |
+
"inputNER = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_01.csv\"\n",
|
| 203 |
+
"outfile = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_02.csv\"\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"# Load datasets\n",
|
| 206 |
+
"poi_df = pd.read_csv(inputNER)\n",
|
| 207 |
+
"countries_df = pd.read_csv(countries)\n",
|
| 208 |
+
"professions_df = pd.read_csv(professions)\n",
|
| 209 |
+
"\n",
|
| 210 |
+
"# Step 1: Combine tags into one lowercase list\n",
|
| 211 |
+
"def combine_tags(row):\n",
|
| 212 |
+
" return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
|
| 213 |
+
"\n",
|
| 214 |
+
"poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
|
| 215 |
+
"\n",
|
| 216 |
+
"# Step 2: Build tag → (country, nationality) mapping\n",
|
| 217 |
+
"tag_to_country_nationality = {}\n",
|
| 218 |
+
"\n",
|
| 219 |
+
"for _, row in countries_df.iterrows():\n",
|
| 220 |
+
" country = str(row[\"en_short_name\"]).strip()\n",
|
| 221 |
+
" nationality = str(row[\"nationality\"]).strip()\n",
|
| 222 |
+
"\n",
|
| 223 |
+
" country_lc = country.lower()\n",
|
| 224 |
+
" nationality_lc = nationality.lower()\n",
|
| 225 |
+
"\n",
|
| 226 |
+
" # Add variations to the mapping\n",
|
| 227 |
+
" tag_to_country_nationality[country_lc] = (country, \"\")\n",
|
| 228 |
+
" tag_to_country_nationality[nationality_lc] = (\"\", nationality)\n",
|
| 229 |
+
" tag_to_country_nationality[country_lc.replace(\" \", \"\")] = (country, \"\")\n",
|
| 230 |
+
" tag_to_country_nationality[nationality_lc.replace(\" \", \"\")] = (\"\", nationality)\n",
|
| 231 |
+
"\n",
|
| 232 |
+
" for part in country_lc.split():\n",
|
| 233 |
+
" tag_to_country_nationality[part] = (country, \"\")\n",
|
| 234 |
+
" for part in nationality_lc.split():\n",
|
| 235 |
+
" tag_to_country_nationality[part] = (\"\", nationality)\n",
|
| 236 |
+
"\n",
|
| 237 |
+
"# Step 3: Infer likely_country and likely_nationality\n",
|
| 238 |
+
"# Step 3: Infer likely_country and likely_nationality\n",
|
| 239 |
+
"def infer_country_and_nationality(tags):\n",
|
| 240 |
+
" for tag in tags:\n",
|
| 241 |
+
" cleaned = tag.replace(\" \", \"\").lower()\n",
|
| 242 |
+
" if cleaned in tag_to_country_nationality:\n",
|
| 243 |
+
" country, nationality = tag_to_country_nationality[cleaned]\n",
|
| 244 |
+
" # Special case: skip if country is \"Isle of Man\"\n",
|
| 245 |
+
" if country == \"Isle of Man\":\n",
|
| 246 |
+
" country = \"\"\n",
|
| 247 |
+
" return pd.Series([country, nationality])\n",
|
| 248 |
+
" return pd.Series([\"\", \"\"])\n",
|
| 249 |
+
"\n",
|
| 250 |
+
"\n",
|
| 251 |
+
"poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
|
| 252 |
+
"\n",
|
| 253 |
+
"# Step 4: Build tag → profession mapping\n",
|
| 254 |
+
"profession_alias_map = {}\n",
|
| 255 |
+
"\n",
|
| 256 |
+
"for _, row in professions_df.iterrows():\n",
|
| 257 |
+
" canonical = str(row['profession']).strip().lower()\n",
|
| 258 |
+
" profession_alias_map[canonical] = canonical\n",
|
| 259 |
+
" for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
|
| 260 |
+
" alias = row.get(alias_col)\n",
|
| 261 |
+
" if pd.notna(alias):\n",
|
| 262 |
+
" profession_alias_map[str(alias).strip().lower()] = canonical\n",
|
| 263 |
+
"\n",
|
| 264 |
+
"# Step 5: Infer likely profession from tags\n",
|
| 265 |
+
"def infer_profession_from_tags(tags):\n",
|
| 266 |
+
" matched = []\n",
|
| 267 |
+
" for tag in tags:\n",
|
| 268 |
+
" cleaned = tag.strip().lower()\n",
|
| 269 |
+
" if cleaned in profession_alias_map:\n",
|
| 270 |
+
" matched.append(profession_alias_map[cleaned])\n",
|
| 271 |
+
"\n",
|
| 272 |
+
" if not matched:\n",
|
| 273 |
+
" return \"\"\n",
|
| 274 |
+
" if \"celebrity\" in matched and len(set(matched)) > 1:\n",
|
| 275 |
+
" # Drop 'celebrity' if other professions are present\n",
|
| 276 |
+
" matched = [m for m in matched if m != \"celebrity\"]\n",
|
| 277 |
+
"\n",
|
| 278 |
+
" return matched[0] # Return the first specific match\n",
|
| 279 |
+
"\n",
|
| 280 |
+
"\n",
|
| 281 |
+
"poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
|
| 282 |
+
"\n",
|
| 283 |
+
"# Step 6: Save enriched dataset\n",
|
| 284 |
+
"poi_df.to_csv(outfile, index=False)\n",
|
| 285 |
+
"\n",
|
| 286 |
+
"# Optional: Preview\n",
|
| 287 |
+
"print(poi_df[[\"real_name\", \"likely_country\", \"likely_nationality\", \"likely_profession\"]].head())\n"
|
| 288 |
+
]
|
| 289 |
+
},
|
| 290 |
+
{
|
| 291 |
+
"cell_type": "code",
|
| 292 |
+
"execution_count": 6,
|
| 293 |
+
"id": "054f230b",
|
| 294 |
+
"metadata": {},
|
| 295 |
+
"outputs": [],
|
| 296 |
+
"source": [
|
| 297 |
+
"#!pip install transformers torch\n",
|
| 298 |
+
"#!python -m spacy download en_core_web_trf\n",
|
| 299 |
+
"#pip install openai"
|
| 300 |
+
]
|
| 301 |
+
},
|
| 302 |
+
{
|
| 303 |
+
"cell_type": "markdown",
|
| 304 |
+
"id": "59185461",
|
| 305 |
+
"metadata": {},
|
| 306 |
+
"source": [
|
| 307 |
+
"## Step 3: Query Deepseek-v3 with NAME and HINTS"
|
| 308 |
+
]
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"cell_type": "code",
|
| 312 |
+
"execution_count": 7,
|
| 313 |
+
"id": "504b970f",
|
| 314 |
+
"metadata": {},
|
| 315 |
+
"outputs": [],
|
| 316 |
+
"source": [
|
| 317 |
+
"#!pip install openpyxl"
|
| 318 |
+
]
|
| 319 |
+
},
|
| 320 |
+
{
|
| 321 |
+
"cell_type": "code",
|
| 322 |
+
"execution_count": 10,
|
| 323 |
+
"id": "7c209115",
|
| 324 |
+
"metadata": {},
|
| 325 |
+
"outputs": [
|
| 326 |
+
{
|
| 327 |
+
"name": "stdout",
|
| 328 |
+
"output_type": "stream",
|
| 329 |
+
"text": [
|
| 330 |
+
"Row 1/4...\n",
|
| 331 |
+
"Saved up to row 1\n",
|
| 332 |
+
"Row 2/4...\n",
|
| 333 |
+
"Saved up to row 2\n",
|
| 334 |
+
"Row 3/4...\n",
|
| 335 |
+
"Saved up to row 3\n",
|
| 336 |
+
"Row 4/4...\n",
|
| 337 |
+
"Saved up to row 4\n",
|
| 338 |
+
"All done! Files: /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/data/CSV/Deepseek_annotated_POI.csv /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/data/CSV/Deepseek_annotated_POI.xlsx\n"
|
| 339 |
+
]
|
| 340 |
+
}
|
| 341 |
+
],
|
| 342 |
+
"source": [
|
| 343 |
+
"import pandas as pd\n",
|
| 344 |
+
"import openai\n",
|
| 345 |
+
"import time\n",
|
| 346 |
+
"import os\n",
|
| 347 |
+
"from pathlib import Path\n",
|
| 348 |
+
"from openai import OpenAI # Add this import\n",
|
| 349 |
+
"\n",
|
| 350 |
+
"# === PATHS & CONFIG ===\n",
|
| 351 |
+
"current_dir = Path.cwd()\n",
|
| 352 |
+
"inputCSV = current_dir.parent / \"data/CSV/model_subsets/NER_poi_step_02.csv\"\n",
|
| 353 |
+
"api_key_file = current_dir.parent / \"misc/credentials/deepseek_api_key.txt\" #store your API key under misc/credentials/deepseek_api_key.txt\n",
|
| 354 |
+
"\n",
|
| 355 |
+
"# Output both CSV and Excel for compatibility\n",
|
| 356 |
+
"OUTPUT_CSV = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.csv\"\n",
|
| 357 |
+
"OUTPUT_XLSX = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.xlsx\"\n",
|
| 358 |
+
"INDEX_FILE = current_dir.parent / \"misc/deepseek_query_index.txt\"\n",
|
| 359 |
+
"SAVE_INTERVAL = 1 # Save every N rows\n",
|
| 360 |
+
"START_ROW = 1 # Row index to start from (0-based)\n",
|
| 361 |
+
"END_ROW = 5 # Row index to end (exclusive)\n",
|
| 362 |
+
"\n",
|
| 363 |
+
"# === LOAD API KEY & CLIENT ===\n",
|
| 364 |
+
"with open(api_key_file) as f:\n",
|
| 365 |
+
" api_key = f.read().strip()\n",
|
| 366 |
+
"\n",
|
| 367 |
+
"client = OpenAI(\n",
|
| 368 |
+
" api_key=api_key,\n",
|
| 369 |
+
" base_url=\"https://api.deepseek.com/v1\"\n",
|
| 370 |
+
")\n",
|
| 371 |
+
"\n",
|
| 372 |
+
"# === LOAD DATA ===\n",
|
| 373 |
+
"df = pd.read_csv(inputCSV)\n",
|
| 374 |
+
"df = df.iloc[START_ROW:END_ROW].reset_index(drop=True)\n",
|
| 375 |
+
"\n",
|
| 376 |
+
"# === PREPARE PROMPTS ===\n",
|
| 377 |
+
"def create_prompt(row):\n",
|
| 378 |
+
" name = row['real_name'] if pd.notna(row['real_name']) else row['name']\n",
|
| 379 |
+
" hints = []\n",
|
| 380 |
+
" for col in ('likely_profession','likely_nationality','likely_country'):\n",
|
| 381 |
+
" if pd.notna(row.get(col, None)):\n",
|
| 382 |
+
" hints.append(row[col])\n",
|
| 383 |
+
" if not hints:\n",
|
| 384 |
+
" tags = [row[f'tag_{i}'] for i in range(1,8) if pd.notna(row.get(f'tag_{i}', None))]\n",
|
| 385 |
+
" hints.extend(tags[:3])\n",
|
| 386 |
+
" hint_text = \", \".join(hints[:5])\n",
|
| 387 |
+
" return (\n",
|
| 388 |
+
" f\"Given '{name}' ({hint_text}), provide:\\n\"\n",
|
| 389 |
+
" \"1. Full legal name (Western order if non-latin script)\\n\"\n",
|
| 390 |
+
" \"2. Any stage names/aliases (comma separated)\\n\"\n",
|
| 391 |
+
" \"3. Gender\\n\"\n",
|
| 392 |
+
" \"4. Top 3 most specific, factual professions (use industry-standard terms, no euphemisms)\\n\"\n",
|
| 393 |
+
" \"5. Primary country associated\\n\"\n",
|
| 394 |
+
" \"Use 'Unknown' when uncertain or you encounter a fictional character or place. \"\n",
|
| 395 |
+
" \"For entertainment fields, specify sub-genres when known (kpop, adult industry, etc.).\"\n",
|
| 396 |
+
" )\n",
|
| 397 |
+
"\n",
|
| 398 |
+
"prompts = df.apply(create_prompt, axis=1).tolist()\n",
|
| 399 |
+
"\n",
|
| 400 |
+
"# === CHECK FOR EXISTING OUTPUT ===\n",
|
| 401 |
+
"if os.path.exists(INDEX_FILE):\n",
|
| 402 |
+
" with open(INDEX_FILE, 'r') as f:\n",
|
| 403 |
+
" current_index = int(f.read().strip())\n",
|
| 404 |
+
"else:\n",
|
| 405 |
+
" current_index = 0\n",
|
| 406 |
+
"\n",
|
| 407 |
+
"results = []\n",
|
| 408 |
+
"if os.path.exists(OUTPUT_XLSX):\n",
|
| 409 |
+
" existing = pd.read_excel(OUTPUT_XLSX)\n",
|
| 410 |
+
" if {'full_name','aliases','gender','profession_llm','country'}.issubset(existing.columns):\n",
|
| 411 |
+
" results = existing[['full_name','aliases','gender','profession_llm','country']].values.tolist()\n",
|
| 412 |
+
"\n",
|
| 413 |
+
"# === QUERY & PARSE ===\n",
|
| 414 |
+
"def query_deepseek(prompt):\n",
|
| 415 |
+
" try:\n",
|
| 416 |
+
" resp = client.chat.completions.create(\n",
|
| 417 |
+
" model=\"deepseek-chat\",\n",
|
| 418 |
+
" messages=[\n",
|
| 419 |
+
" {\"role\":\"system\",\"content\":\"You extract key data on a person; respond with exactly 5 numbered lines.\"},\n",
|
| 420 |
+
" {\"role\":\"user\",\"content\":prompt}\n",
|
| 421 |
+
" ],\n",
|
| 422 |
+
" temperature=0.05, top_p=0.8\n",
|
| 423 |
+
" )\n",
|
| 424 |
+
" return resp.choices[0].message.content.strip()\n",
|
| 425 |
+
" except Exception as e:\n",
|
| 426 |
+
" print(\"API error:\", e)\n",
|
| 427 |
+
" return \"\"\n",
|
| 428 |
+
"\n",
|
| 429 |
+
"def parse_response(resp):\n",
|
| 430 |
+
" lines = [l.strip() for l in resp.split('\\n') if l.strip()]\n",
|
| 431 |
+
" out = [\"Unknown\"]*5\n",
|
| 432 |
+
" for l in lines:\n",
|
| 433 |
+
" if l.startswith('1.'): out[0] = l[2:].strip()\n",
|
| 434 |
+
" elif l.startswith('2.'): out[1] = l[2:].strip()\n",
|
| 435 |
+
" elif l.startswith('3.'): out[2] = l[2:].strip()\n",
|
| 436 |
+
" elif l.startswith('4.'): out[3] = l[2:].strip()\n",
|
| 437 |
+
" elif l.startswith('5.'): out[4] = l[2:].strip()\n",
|
| 438 |
+
" return out\n",
|
| 439 |
+
"\n",
|
| 440 |
+
"# === PROCESS & SAVE ===\n",
|
| 441 |
+
"for i in range(current_index, len(df)):\n",
|
| 442 |
+
" print(f\"Row {i+1}/{len(df)}...\")\n",
|
| 443 |
+
" data = parse_response(query_deepseek(prompts[i]))\n",
|
| 444 |
+
" if i < len(results): results[i] = data\n",
|
| 445 |
+
" else: results.append(data)\n",
|
| 446 |
+
" current_index = i+1\n",
|
| 447 |
+
"\n",
|
| 448 |
+
" if current_index % SAVE_INTERVAL == 0 or current_index == len(df):\n",
|
| 449 |
+
" out_df = df.iloc[:current_index].copy()\n",
|
| 450 |
+
" out_df[['full_name','aliases','gender','profession_llm','country']] = pd.DataFrame(results[:current_index])\n",
|
| 451 |
+
" # CSV\n",
|
| 452 |
+
" out_df.to_csv(OUTPUT_CSV, index=False)\n",
|
| 453 |
+
" # Excel\n",
|
| 454 |
+
" out_df.to_excel(OUTPUT_XLSX, index=False)\n",
|
| 455 |
+
" with open(INDEX_FILE, 'w') as f:\n",
|
| 456 |
+
" f.write(str(current_index))\n",
|
| 457 |
+
" print(\"Saved up to row\", current_index)\n",
|
| 458 |
+
" time.sleep(1)\n",
|
| 459 |
+
"\n",
|
| 460 |
+
"print(\"All done! Files:\", OUTPUT_CSV, OUTPUT_XLSX)\n"
|
| 461 |
+
]
|
| 462 |
+
},
|
| 463 |
+
{
|
| 464 |
+
"cell_type": "markdown",
|
| 465 |
+
"id": "7910e574",
|
| 466 |
+
"metadata": {},
|
| 467 |
+
"source": []
|
| 468 |
+
},
|
| 469 |
+
{
|
| 470 |
+
"cell_type": "markdown",
|
| 471 |
+
"id": "9d377005",
|
| 472 |
+
"metadata": {},
|
| 473 |
+
"source": [
|
| 474 |
+
"# Aggregate by individual names"
|
| 475 |
+
]
|
| 476 |
+
},
|
| 477 |
+
{
|
| 478 |
+
"cell_type": "markdown",
|
| 479 |
+
"id": "d2e75354",
|
| 480 |
+
"metadata": {},
|
| 481 |
+
"source": [
|
| 482 |
+
" ##### e.g. Emma Watson [model1, model2, model3] etc."
|
| 483 |
+
]
|
| 484 |
+
},
|
| 485 |
+
{
|
| 486 |
+
"cell_type": "code",
|
| 487 |
+
"execution_count": 11,
|
| 488 |
+
"id": "747c3a2f",
|
| 489 |
+
"metadata": {},
|
| 490 |
+
"outputs": [],
|
| 491 |
+
"source": [
|
| 492 |
+
"import pandas as pd\n",
|
| 493 |
+
"import re\n",
|
| 494 |
+
"from pathlib import Path\n",
|
| 495 |
+
"current_dir = Path.cwd()\n",
|
| 496 |
+
"\n",
|
| 497 |
+
"profession_map = current_dir.parent / \"misc/lists/mapped_professions.csv\"\n",
|
| 498 |
+
"\n",
|
| 499 |
+
"poi_df = current_dir.parent / \"data/CSV/Deepseek_annotated_POI.csv\"\n",
|
| 500 |
+
"\n",
|
| 501 |
+
"output = current_dir.parent / \"data/CSV/Deepseek_annotated_POI_aggregated.csv\"\n",
|
| 502 |
+
"\n",
|
| 503 |
+
"countries_csv = current_dir.parent / \"misc/lists/countries.csv\"\n",
|
| 504 |
+
"countries_df = pd.read_csv(countries_csv)\n",
|
| 505 |
+
"\n",
|
| 506 |
+
"# Extract valid country names (strip whitespace)\n",
|
| 507 |
+
"valid_countries = set(countries_df['en_short_name'].str.strip())\n",
|
| 508 |
+
"\n",
|
| 509 |
+
"\n",
|
| 510 |
+
"# Load the dataset\n",
|
| 511 |
+
"df = pd.read_csv(poi_df) # Update path if needed\n",
|
| 512 |
+
"\n",
|
| 513 |
+
"# Step 1: Group by 'full_name' and aggregate required information\n",
|
| 514 |
+
"grouped_df = df.groupby('full_name').agg(\n",
|
| 515 |
+
" No_of_models=('id', 'count'),\n",
|
| 516 |
+
" modelIDs=('id', lambda x: list(x)),\n",
|
| 517 |
+
" combinedDownloadCount=('downloadCount', 'sum')\n",
|
| 518 |
+
").reset_index()\n",
|
| 519 |
+
"\n",
|
| 520 |
+
"# Step 2: Keep representative info for each person\n",
|
| 521 |
+
"# Keep representative info (including aliases)\n",
|
| 522 |
+
"additional_columns = df.groupby('full_name').agg(\n",
|
| 523 |
+
" country=('country', 'first'),\n",
|
| 524 |
+
" profession_llm=('profession_llm', 'first'),\n",
|
| 525 |
+
" gender=('gender', 'first'),\n",
|
| 526 |
+
" aliases=('aliases', 'first') # ✅ Add this line\n",
|
| 527 |
+
").reset_index()\n",
|
| 528 |
+
"\n",
|
| 529 |
+
"\n",
|
| 530 |
+
"\n",
|
| 531 |
+
"def standardize_country(country):\n",
|
| 532 |
+
" if not isinstance(country, str):\n",
|
| 533 |
+
" return \"Unknown\"\n",
|
| 534 |
+
"\n",
|
| 535 |
+
" country_clean = country.strip()\n",
|
| 536 |
+
" lowered = country_clean.lower()\n",
|
| 537 |
+
"\n",
|
| 538 |
+
" # Handle fictional or fantasy countries\n",
|
| 539 |
+
" fictional_keywords = [\"fictional\", \"westeros\", \"asgard\", \"middle-earth\", \"naboo\", \"middle earth\", \"latveria\"]\n",
|
| 540 |
+
" if any(keyword in lowered for keyword in fictional_keywords):\n",
|
| 541 |
+
" return \"Unknown\"\n",
|
| 542 |
+
"\n",
|
| 543 |
+
" # Handle known region-based adjustments\n",
|
| 544 |
+
" if \"macau\" in lowered:\n",
|
| 545 |
+
" return \"Macau\"\n",
|
| 546 |
+
" elif \"hong kong\" in lowered:\n",
|
| 547 |
+
" return \"Hong Kong\"\n",
|
| 548 |
+
" elif \"taiwan\" in lowered:\n",
|
| 549 |
+
" return \"Taiwan\"\n",
|
| 550 |
+
"\n",
|
| 551 |
+
" # Normalize complex or alternate country names\n",
|
| 552 |
+
" lowered = lowered.replace(\"United Kingdom of Great Britain and Northern Ireland\", \"united kingdom\")\n",
|
| 553 |
+
" lowered = lowered.replace(\"england\", \"united kingdom\")\n",
|
| 554 |
+
" lowered = lowered.replace(\"united states of america\", \"united states\")\n",
|
| 555 |
+
"\n",
|
| 556 |
+
" # Remove anything in brackets and after commas\n",
|
| 557 |
+
" country_clean = re.sub(r\"\\(.*?\\)\", \"\", country_clean)\n",
|
| 558 |
+
" country_clean = country_clean.split(',')[0].strip().lower()\n",
|
| 559 |
+
"\n",
|
| 560 |
+
" # Manual overrides\n",
|
| 561 |
+
" replacements = {\n",
|
| 562 |
+
" \"united kingdom\": \"UK\",\n",
|
| 563 |
+
" \"united kingdom of great britain and northern ireland\": \"UK\",\n",
|
| 564 |
+
" \"french southern territories\": \"Other\",\n",
|
| 565 |
+
" \"united states\": \"US\",\n",
|
| 566 |
+
" \"united states of america\": \"US\",\n",
|
| 567 |
+
" \"turkey\": \"Türkiye\",\n",
|
| 568 |
+
" \"czech republic\": \"Czechia\"\n",
|
| 569 |
+
" }\n",
|
| 570 |
+
"\n",
|
| 571 |
+
" if country_clean in replacements:\n",
|
| 572 |
+
" return replacements[country_clean]\n",
|
| 573 |
+
"\n",
|
| 574 |
+
" # Final check against valid country list (case-insensitive)\n",
|
| 575 |
+
" for valid in valid_countries:\n",
|
| 576 |
+
" if country_clean == valid.lower():\n",
|
| 577 |
+
" return valid\n",
|
| 578 |
+
"\n",
|
| 579 |
+
" return \"Unknown\"\n",
|
| 580 |
+
"\n",
|
| 581 |
+
"\n",
|
| 582 |
+
"# Updated function to fully remove anything in brackets (complete or not)\n",
|
| 583 |
+
"def get_profession_short(profession):\n",
|
| 584 |
+
" if isinstance(profession, str):\n",
|
| 585 |
+
" # Get first part before comma\n",
|
| 586 |
+
" first_prof = profession.split(',')[0].strip()\n",
|
| 587 |
+
" # Remove all bracketed content, even malformed\n",
|
| 588 |
+
" first_prof = re.sub(r\"[\\[].∗?[\\[].*?[\\]]\", \"\", first_prof) # removes properly closed\n",
|
| 589 |
+
" first_prof = re.sub(r\"[\\(\\[].*\", \"\", first_prof) # removes malformed\n",
|
| 590 |
+
" cleaned = first_prof.strip()\n",
|
| 591 |
+
" # Normalize 'Actress' to 'Actor'\n",
|
| 592 |
+
" if cleaned.lower() == \"actress\":\n",
|
| 593 |
+
" return \"Actor\"\n",
|
| 594 |
+
" return cleaned\n",
|
| 595 |
+
" return None\n",
|
| 596 |
+
"\n",
|
| 597 |
+
"# Load your mapping file\n",
|
| 598 |
+
"mapping_df = pd.read_csv(profession_map, on_bad_lines='skip') # or 'warn'\n",
|
| 599 |
+
"\n",
|
| 600 |
+
"\n",
|
| 601 |
+
"# Ensure the mapping columns are named correctly\n",
|
| 602 |
+
"# (Assuming columns are: 'profession_llm' or 'profession_short', and 'category' or 'mapped_profession')\n",
|
| 603 |
+
"# Adjust these as needed\n",
|
| 604 |
+
"mapping_df.columns = [col.strip().lower() for col in mapping_df.columns]\n",
|
| 605 |
+
"\n",
|
| 606 |
+
"# Rename for clarity and consistency\n",
|
| 607 |
+
"if 'profession_llm' in mapping_df.columns:\n",
|
| 608 |
+
" mapping_df = mapping_df.rename(columns={'profession_llm': 'profession_short'})\n",
|
| 609 |
+
"if 'category' in mapping_df.columns:\n",
|
| 610 |
+
" mapping_df = mapping_df.rename(columns={'category': 'mapped_profession'})\n",
|
| 611 |
+
"\n",
|
| 612 |
+
"# Merge the mapped profession into final_df\n",
|
| 613 |
+
"#final_df = final_df.merge(mapping_df[['profession_short', 'mapped_profession']], on='profession_short', how='left')\n",
|
| 614 |
+
"\n",
|
| 615 |
+
"\n",
|
| 616 |
+
"additional_columns = df.groupby('full_name').agg(\n",
|
| 617 |
+
" country=('country', 'first'),\n",
|
| 618 |
+
" profession_llm=('profession_llm', 'first'),\n",
|
| 619 |
+
" gender=('gender', 'first'),\n",
|
| 620 |
+
" aliases=('aliases', 'first') # <-- Added this line\n",
|
| 621 |
+
").reset_index()\n",
|
| 622 |
+
"\n",
|
| 623 |
+
"\n",
|
| 624 |
+
"# Step 3: Merge the aggregated info with the representative info\n",
|
| 625 |
+
"final_df = pd.merge(grouped_df, additional_columns, on='full_name', how='left')\n",
|
| 626 |
+
"\n",
|
| 627 |
+
"# Step 4: Clean and transform columns\n",
|
| 628 |
+
"final_df['profession_short'] = final_df['profession_llm'].apply(get_profession_short)\n",
|
| 629 |
+
"final_df['country'] = final_df['country'].apply(standardize_country)\n",
|
| 630 |
+
"\n",
|
| 631 |
+
"# Step 5: Merge with profession mapping\n",
|
| 632 |
+
"final_df = final_df.merge(mapping_df[['profession_short', 'mapped_profession']], on='profession_short', how='left')\n",
|
| 633 |
+
"\n",
|
| 634 |
+
"# Optional: Save the result to a CSV file\n",
|
| 635 |
+
"final_df.to_csv(output, index=False)\n"
|
| 636 |
+
]
|
| 637 |
+
},
|
| 638 |
+
{
|
| 639 |
+
"cell_type": "code",
|
| 640 |
+
"execution_count": null,
|
| 641 |
+
"id": "704e5246",
|
| 642 |
+
"metadata": {},
|
| 643 |
+
"outputs": [],
|
| 644 |
+
"source": []
|
| 645 |
+
}
|
| 646 |
+
],
|
| 647 |
+
"metadata": {
|
| 648 |
+
"kernelspec": {
|
| 649 |
+
"display_name": "latm",
|
| 650 |
+
"language": "python",
|
| 651 |
+
"name": "python3"
|
| 652 |
+
},
|
| 653 |
+
"language_info": {
|
| 654 |
+
"codemirror_mode": {
|
| 655 |
+
"name": "ipython",
|
| 656 |
+
"version": 3
|
| 657 |
+
},
|
| 658 |
+
"file_extension": ".py",
|
| 659 |
+
"mimetype": "text/x-python",
|
| 660 |
+
"name": "python",
|
| 661 |
+
"nbconvert_exporter": "python",
|
| 662 |
+
"pygments_lexer": "ipython3",
|
| 663 |
+
"version": "3.10.15"
|
| 664 |
+
}
|
| 665 |
+
},
|
| 666 |
+
"nbformat": 4,
|
| 667 |
+
"nbformat_minor": 5
|
| 668 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4__Figure_8b_sunburst_profession-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Prepare *.json for Figure 8b"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 5,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stdout",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"✅ Sunburst data saved to sunburst_data.json\n"
|
| 20 |
+
]
|
| 21 |
+
}
|
| 22 |
+
],
|
| 23 |
+
"source": [
|
| 24 |
+
"import pandas as pd\n",
|
| 25 |
+
"from collections import defaultdict\n",
|
| 26 |
+
"import json\n",
|
| 27 |
+
"from pathlib import Path\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"current_dir = Path.cwd()\n",
|
| 30 |
+
"\n",
|
| 31 |
+
"sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
|
| 32 |
+
"\n",
|
| 33 |
+
"\n",
|
| 34 |
+
"aggregated_poi = current_dir.parent / \"data/CSV/Deepseek_annotated_POI_aggregated.csv\"\n",
|
| 35 |
+
"\n",
|
| 36 |
+
"df = pd.read_csv(aggregated_poi)\n",
|
| 37 |
+
"df['country_cleaned'] = df['country'].apply(lambda x: x if x not in ['Unknown', '', None] else 'Other')\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"# Now get top countries excluding what was forced into 'Other'\n",
|
| 40 |
+
"top_countries = df['country_cleaned'].value_counts().nlargest(15).index.tolist()\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# Final limited country column\n",
|
| 43 |
+
"df['country_limited'] = df['country_cleaned'].apply(lambda x: x if x in top_countries else 'Other')\n",
|
| 44 |
+
"\n",
|
| 45 |
+
"# ---- Step 2: Limit to top 7 professions and combine Unknown and Sports Professional with Other ----\n",
|
| 46 |
+
"top_professions = df['mapped_profession'].value_counts().nlargest(7).index.tolist()\n",
|
| 47 |
+
"\n",
|
| 48 |
+
"# Explicitly remove 'Unknown' and 'Sports Professional' even if they are in the top 7\n",
|
| 49 |
+
"top_professions = [p for p in top_professions if p not in ['Unknown', 'Sports Professional']]\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"# Normalize 'Unknown' and '\n",
|
| 52 |
+
"# Sports Professional' into 'Other'\n",
|
| 53 |
+
"# Normalize 'Unknown' and 'Sports Professional' into 'Other'\n",
|
| 54 |
+
"df['profession_limited'] = df['mapped_profession'].apply(\n",
|
| 55 |
+
" lambda x: 'Other' if x in ['Unknown', 'Sports Professional'] or x not in top_professions else x\n",
|
| 56 |
+
")\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"\n",
|
| 59 |
+
"\n",
|
| 60 |
+
"# ---- Step 3: Group by the limited country and profession ----\n",
|
| 61 |
+
"sunburst_data = df.groupby(['country_limited', 'profession_limited']).size().reset_index(name='count')\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"# ---- Step 4: Create a nested structure for D3.js ----\n",
|
| 64 |
+
"sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
|
| 65 |
+
"country_map = defaultdict(list)\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"for _, row in sunburst_data.iterrows():\n",
|
| 68 |
+
" country = row['country_limited']\n",
|
| 69 |
+
" profession = row['profession_limited']\n",
|
| 70 |
+
" count = int(row['count'])\n",
|
| 71 |
+
" country_map[country].append({\"name\": profession, \"value\": count})\n",
|
| 72 |
+
"\n",
|
| 73 |
+
"# For each country, sort the profession list so that \"Other\" appears at the end\n",
|
| 74 |
+
"for country, professions in country_map.items():\n",
|
| 75 |
+
" professions_sorted = sorted(professions, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
|
| 76 |
+
" country_map[country] = professions_sorted\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"for country, professions in country_map.items():\n",
|
| 79 |
+
" sunburst_dict[\"children\"].append({\"name\": country, \"children\": professions})\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"# ---- Step 5: Save to a JSON file ----\n",
|
| 82 |
+
"with open(sunburst_path, \"w\", encoding='utf-8') as f:\n",
|
| 83 |
+
" json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
|
| 84 |
+
"\n",
|
| 85 |
+
"\n",
|
| 86 |
+
"print(\"✅ Sunburst data saved to sunburst_data.json\")\n"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
{
|
| 90 |
+
"cell_type": "markdown",
|
| 91 |
+
"metadata": {},
|
| 92 |
+
"source": [
|
| 93 |
+
"the resulting *.json is the input for Figure_8.html"
|
| 94 |
+
]
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"cell_type": "markdown",
|
| 98 |
+
"metadata": {},
|
| 99 |
+
"source": []
|
| 100 |
+
}
|
| 101 |
+
],
|
| 102 |
+
"metadata": {
|
| 103 |
+
"kernelspec": {
|
| 104 |
+
"display_name": "latm",
|
| 105 |
+
"language": "python",
|
| 106 |
+
"name": "python3"
|
| 107 |
+
},
|
| 108 |
+
"language_info": {
|
| 109 |
+
"codemirror_mode": {
|
| 110 |
+
"name": "ipython",
|
| 111 |
+
"version": 3
|
| 112 |
+
},
|
| 113 |
+
"file_extension": ".py",
|
| 114 |
+
"mimetype": "text/x-python",
|
| 115 |
+
"name": "python",
|
| 116 |
+
"nbconvert_exporter": "python",
|
| 117 |
+
"pygments_lexer": "ipython3",
|
| 118 |
+
"version": "3.10.15"
|
| 119 |
+
}
|
| 120 |
+
},
|
| 121 |
+
"nbformat": 4,
|
| 122 |
+
"nbformat_minor": 2
|
| 123 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3-4_compare-models-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": null,
|
| 6 |
+
"id": "6cbaef9d-3058-4a59-a8ee-32fcc2062ed6",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [],
|
| 9 |
+
"source": [
|
| 10 |
+
"# Jupyter Notebook Cell for Country Name Standardization\n",
|
| 11 |
+
"# Copy and paste this entire cell into your Jupyter notebook\n",
|
| 12 |
+
"\n",
|
| 13 |
+
"import pandas as pd\n",
|
| 14 |
+
"import os\n",
|
| 15 |
+
"import re\n",
|
| 16 |
+
"\n",
|
| 17 |
+
"# Define country standardization mappings\n",
|
| 18 |
+
"COUNTRY_MAPPINGS = {\n",
|
| 19 |
+
" # USA variations\n",
|
| 20 |
+
" 'united states': 'USA',\n",
|
| 21 |
+
" 'united states of america': 'USA',\n",
|
| 22 |
+
" 'america': 'USA',\n",
|
| 23 |
+
" 'us': 'USA',\n",
|
| 24 |
+
" 'u.s.': 'USA',\n",
|
| 25 |
+
" 'u.s.a.': 'USA',\n",
|
| 26 |
+
" 'states': 'USA',\n",
|
| 27 |
+
" \n",
|
| 28 |
+
" # UK variations\n",
|
| 29 |
+
" 'united kingdom': 'UK',\n",
|
| 30 |
+
" 'england': 'UK',\n",
|
| 31 |
+
" 'britain': 'UK',\n",
|
| 32 |
+
" 'great britain': 'UK',\n",
|
| 33 |
+
" 'uk': 'UK',\n",
|
| 34 |
+
" 'u.k.': 'UK',\n",
|
| 35 |
+
" \n",
|
| 36 |
+
" # Turkey -> Türkiye\n",
|
| 37 |
+
" 'turkey': 'Türkiye',\n",
|
| 38 |
+
" \n",
|
| 39 |
+
" # Czech Republic -> Czechia\n",
|
| 40 |
+
" 'czech republic': 'Czechia',\n",
|
| 41 |
+
" 'czechoslovakia': 'Czechia',\n",
|
| 42 |
+
" 'czechoslowakia': 'Czechia', # Common misspelling\n",
|
| 43 |
+
" \n",
|
| 44 |
+
" # USSR -> Russia\n",
|
| 45 |
+
" 'ussr': 'Russia',\n",
|
| 46 |
+
" 'udssr': 'Russia',\n",
|
| 47 |
+
" 'soviet union': 'Russia',\n",
|
| 48 |
+
" \n",
|
| 49 |
+
" # Korea variations (maintain distinction between North and South)\n",
|
| 50 |
+
" 'korea': 'South Korea', # If just \"Korea\", assume South Korea\n",
|
| 51 |
+
" 'south korea': 'South Korea',\n",
|
| 52 |
+
" 'republic of korea': 'South Korea',\n",
|
| 53 |
+
" 'north korea': 'North Korea',\n",
|
| 54 |
+
" 'dprk': 'North Korea',\n",
|
| 55 |
+
" \n",
|
| 56 |
+
" # China variations\n",
|
| 57 |
+
" 'china': 'China',\n",
|
| 58 |
+
" 'people\\'s republic of china': 'China',\n",
|
| 59 |
+
" 'prc': 'China',\n",
|
| 60 |
+
" 'mainland china': 'China',\n",
|
| 61 |
+
" \n",
|
| 62 |
+
" # Common standardizations\n",
|
| 63 |
+
" 'holland': 'Netherlands',\n",
|
| 64 |
+
" 'the netherlands': 'Netherlands',\n",
|
| 65 |
+
" 'deutschland': 'Germany',\n",
|
| 66 |
+
" 'nippon': 'Japan',\n",
|
| 67 |
+
" 'espana': 'Spain',\n",
|
| 68 |
+
" 'españa': 'Spain',\n",
|
| 69 |
+
" \n",
|
| 70 |
+
" # Keep these as-is but included for completeness\n",
|
| 71 |
+
" 'usa': 'USA',\n",
|
| 72 |
+
" 'russia': 'Russia',\n",
|
| 73 |
+
"}\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"def standardize_country(country_value):\n",
|
| 76 |
+
" \"\"\"\n",
|
| 77 |
+
" Standardize a single country name based on the mapping.\n",
|
| 78 |
+
" \"\"\"\n",
|
| 79 |
+
" if pd.isna(country_value):\n",
|
| 80 |
+
" return country_value\n",
|
| 81 |
+
" \n",
|
| 82 |
+
" # Convert to string and strip whitespace\n",
|
| 83 |
+
" country_str = str(country_value).strip()\n",
|
| 84 |
+
" \n",
|
| 85 |
+
" # Return if empty or 'Unknown'\n",
|
| 86 |
+
" if not country_str or country_str.lower() == 'unknown':\n",
|
| 87 |
+
" return country_str\n",
|
| 88 |
+
" \n",
|
| 89 |
+
" # Convert to lowercase for matching\n",
|
| 90 |
+
" country_lower = country_str.lower()\n",
|
| 91 |
+
" \n",
|
| 92 |
+
" # Check if it matches any of our mappings\n",
|
| 93 |
+
" for pattern, replacement in COUNTRY_MAPPINGS.items():\n",
|
| 94 |
+
" if country_lower == pattern:\n",
|
| 95 |
+
" return replacement\n",
|
| 96 |
+
" \n",
|
| 97 |
+
" # If no exact match found, return original with proper capitalization\n",
|
| 98 |
+
" # This preserves countries not in our mapping\n",
|
| 99 |
+
" return country_str\n",
|
| 100 |
+
"\n",
|
| 101 |
+
"def process_csv_file(input_file, output_file):\n",
|
| 102 |
+
" \"\"\"\n",
|
| 103 |
+
" Process a CSV file to standardize country names.\n",
|
| 104 |
+
" \"\"\"\n",
|
| 105 |
+
" print(f\"Processing: {input_file}\")\n",
|
| 106 |
+
" \n",
|
| 107 |
+
" # For mistral.csv which has no header, we need special handling\n",
|
| 108 |
+
" if 'mistral' in input_file.lower():\n",
|
| 109 |
+
" # Read without header\n",
|
| 110 |
+
" df = pd.read_csv(input_file, header=None)\n",
|
| 111 |
+
" \n",
|
| 112 |
+
" # Check if the last column contains country data\n",
|
| 113 |
+
" # Based on the structure, country should be in the last column\n",
|
| 114 |
+
" last_col = df.columns[-1]\n",
|
| 115 |
+
" \n",
|
| 116 |
+
" # Apply standardization to the last column\n",
|
| 117 |
+
" df[last_col] = df[last_col].apply(standardize_country)\n",
|
| 118 |
+
" \n",
|
| 119 |
+
" print(f\" - Standardized column {last_col} (assumed to be country)\")\n",
|
| 120 |
+
" print(f\" - Sample values after standardization: {df[last_col].dropna().head(10).tolist()}\")\n",
|
| 121 |
+
" else:\n",
|
| 122 |
+
" # Normal CSV with header\n",
|
| 123 |
+
" df = pd.read_csv(input_file)\n",
|
| 124 |
+
" \n",
|
| 125 |
+
" # Check if 'country' column exists\n",
|
| 126 |
+
" if 'country' in df.columns:\n",
|
| 127 |
+
" # Apply standardization\n",
|
| 128 |
+
" df['country'] = df['country'].apply(standardize_country)\n",
|
| 129 |
+
" \n",
|
| 130 |
+
" print(f\" - Found and standardized 'country' column\")\n",
|
| 131 |
+
" print(f\" - Unique countries after standardization: {sorted(df['country'].dropna().unique())}\")\n",
|
| 132 |
+
" else:\n",
|
| 133 |
+
" print(f\" - Warning: No 'country' column found in {input_file}\")\n",
|
| 134 |
+
" \n",
|
| 135 |
+
" # Save the processed file\n",
|
| 136 |
+
" df.to_csv(output_file, index=False)\n",
|
| 137 |
+
" print(f\" - Saved to: {output_file}\\n\")\n",
|
| 138 |
+
" \n",
|
| 139 |
+
" return df\n",
|
| 140 |
+
"\n",
|
| 141 |
+
"# ============================================================\n",
|
| 142 |
+
"# MAIN EXECUTION - Adjust paths as needed for your environment\n",
|
| 143 |
+
"# ============================================================\n",
|
| 144 |
+
"\n",
|
| 145 |
+
"# Input files - adjust these paths to match your file locations\n",
|
| 146 |
+
"input_files = [\n",
|
| 147 |
+
" 'gemma.csv',\n",
|
| 148 |
+
" 'mistral.csv',\n",
|
| 149 |
+
" 'qwen.csv'\n",
|
| 150 |
+
"]\n",
|
| 151 |
+
"\n",
|
| 152 |
+
"# Create a results dictionary to store processed dataframes\n",
|
| 153 |
+
"results = {}\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"for filename in input_files:\n",
|
| 156 |
+
" # Adjust these paths based on where your files are located\n",
|
| 157 |
+
" # For example, if files are in current directory, use: input_path = filename\n",
|
| 158 |
+
" input_path = filename # Modify this based on your file location\n",
|
| 159 |
+
" \n",
|
| 160 |
+
" # Create output filename\n",
|
| 161 |
+
" base_name = filename.replace('.csv', '')\n",
|
| 162 |
+
" output_path = f'{base_name}_standardized_country.csv'\n",
|
| 163 |
+
" \n",
|
| 164 |
+
" # Check if input file exists\n",
|
| 165 |
+
" if os.path.exists(input_path):\n",
|
| 166 |
+
" df = process_csv_file(input_path, output_path)\n",
|
| 167 |
+
" results[base_name] = df\n",
|
| 168 |
+
" else:\n",
|
| 169 |
+
" print(f\"Warning: {input_path} not found! Please check the file path.\\n\")\n",
|
| 170 |
+
"\n",
|
| 171 |
+
"# Display summary statistics\n",
|
| 172 |
+
"print(\"=\" * 60)\n",
|
| 173 |
+
"print(\"SUMMARY OF COUNTRY STANDARDIZATION\")\n",
|
| 174 |
+
"print(\"=\" * 60)\n",
|
| 175 |
+
"\n",
|
| 176 |
+
"for name, df in results.items():\n",
|
| 177 |
+
" print(f\"\\n{name.upper()}:\")\n",
|
| 178 |
+
" print(f\" Total rows: {len(df)}\")\n",
|
| 179 |
+
" \n",
|
| 180 |
+
" # Find the country column\n",
|
| 181 |
+
" if name == 'mistral':\n",
|
| 182 |
+
" # For mistral, assume last column is country\n",
|
| 183 |
+
" country_col = df.columns[-1]\n",
|
| 184 |
+
" else:\n",
|
| 185 |
+
" country_col = 'country' if 'country' in df.columns else None\n",
|
| 186 |
+
" \n",
|
| 187 |
+
" if country_col is not None:\n",
|
| 188 |
+
" country_counts = df[country_col].value_counts()\n",
|
| 189 |
+
" print(f\" Top 10 countries:\")\n",
|
| 190 |
+
" for country, count in country_counts.head(10).items():\n",
|
| 191 |
+
" print(f\" - {country}: {count}\")\n",
|
| 192 |
+
"\n",
|
| 193 |
+
"print(\"\\n\" + \"=\" * 60)\n",
|
| 194 |
+
"print(\"Processing complete!\")\n",
|
| 195 |
+
"print(\"Standardized files have been created with '_standardized_country.csv' suffix\")\n",
|
| 196 |
+
"print(\"=\" * 60)\n",
|
| 197 |
+
"\n",
|
| 198 |
+
"# Optional: Display before/after comparison for a sample\n",
|
| 199 |
+
"print(\"\\n\" + \"=\" * 60)\n",
|
| 200 |
+
"print(\"EXAMPLE TRANSFORMATIONS APPLIED:\")\n",
|
| 201 |
+
"print(\"=\" * 60)\n",
|
| 202 |
+
"print(\"• 'United States' → 'USA'\")\n",
|
| 203 |
+
"print(\"• 'United States of America' → 'USA'\")\n",
|
| 204 |
+
"print(\"• 'States' → 'USA'\")\n",
|
| 205 |
+
"print(\"• 'England' → 'UK'\")\n",
|
| 206 |
+
"print(\"• 'United Kingdom' → 'UK'\")\n",
|
| 207 |
+
"print(\"• 'Britain' → 'UK'\")\n",
|
| 208 |
+
"print(\"• 'Turkey' → 'Türkiye'\")\n",
|
| 209 |
+
"print(\"• 'Czech Republic' → 'Czechia'\")\n",
|
| 210 |
+
"print(\"• 'Korea' → 'South Korea'\")\n",
|
| 211 |
+
"print(\"• 'UDSSR' → 'Russia'\")\n",
|
| 212 |
+
"print(\"=\" * 60)"
|
| 213 |
+
]
|
| 214 |
+
}
|
| 215 |
+
],
|
| 216 |
+
"metadata": {
|
| 217 |
+
"kernelspec": {
|
| 218 |
+
"display_name": "Python 3 (ipykernel)",
|
| 219 |
+
"language": "python",
|
| 220 |
+
"name": "python3"
|
| 221 |
+
},
|
| 222 |
+
"language_info": {
|
| 223 |
+
"codemirror_mode": {
|
| 224 |
+
"name": "ipython",
|
| 225 |
+
"version": 3
|
| 226 |
+
},
|
| 227 |
+
"file_extension": ".py",
|
| 228 |
+
"mimetype": "text/x-python",
|
| 229 |
+
"name": "python",
|
| 230 |
+
"nbconvert_exporter": "python",
|
| 231 |
+
"pygments_lexer": "ipython3",
|
| 232 |
+
"version": "3.12.10"
|
| 233 |
+
}
|
| 234 |
+
},
|
| 235 |
+
"nbformat": 4,
|
| 236 |
+
"nbformat_minor": 5
|
| 237 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-3_Figure_5_co-occurence_promotional_tags-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,314 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Create *.json for figure 5 (Co-occurence network of Tags)"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 5,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stdout",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"Processing: america\n",
|
| 20 |
+
" ✅ Saved to /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/public/json/tags_america.json\n"
|
| 21 |
+
]
|
| 22 |
+
}
|
| 23 |
+
],
|
| 24 |
+
"source": [
|
| 25 |
+
"import pandas as pd\n",
|
| 26 |
+
"from itertools import combinations\n",
|
| 27 |
+
"from collections import Counter, defaultdict\n",
|
| 28 |
+
"import json\n",
|
| 29 |
+
"import re\n",
|
| 30 |
+
"import os\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"from pathlib import Path\n",
|
| 33 |
+
"current_dir = Path.cwd()\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"# === CONFIG ===\n",
|
| 36 |
+
"file_path = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
|
| 37 |
+
"output_dir = current_dir.parent / \"public/json/\"\n",
|
| 38 |
+
"#target_terms = [\"asian\", \"indian\", \"man\", \"woman\", \"german\", \"korean\", \"american\", \"russian\", \"style\", \"japanese\", \"chinese\"] # Add any tags you want to process\n",
|
| 39 |
+
"#target_terms = [\"character\", \"instagram\", \"youtuber\", \"actor\", \"actress\", \"celebrity\", \"vtuber\", \"kpop\"] # Add any tags you want to process\n",
|
| 40 |
+
"target_terms = [\"america\"] # Add any tags you want to process\n",
|
| 41 |
+
"min_connections = 1 # minimum number of link connections per node\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"# === LOAD DATA ===\n",
|
| 44 |
+
"df = pd.read_csv(file_path)\n",
|
| 45 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 46 |
+
"df_tags = df[tag_columns]\n",
|
| 47 |
+
"\n",
|
| 48 |
+
"# === MAIN LOOP ===\n",
|
| 49 |
+
"for target_term in target_terms:\n",
|
| 50 |
+
" print(f\"Processing: {target_term}\")\n",
|
| 51 |
+
" \n",
|
| 52 |
+
" pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
|
| 53 |
+
" df_filtered = df_tags[df_tags.apply(\n",
|
| 54 |
+
" lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
|
| 55 |
+
" axis=1\n",
|
| 56 |
+
" )]\n",
|
| 57 |
+
"\n",
|
| 58 |
+
" # Skip if no data matches\n",
|
| 59 |
+
" if df_filtered.empty:\n",
|
| 60 |
+
" print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
|
| 61 |
+
" continue\n",
|
| 62 |
+
"\n",
|
| 63 |
+
" # === COUNT INDIVIDUAL TAGS ===\n",
|
| 64 |
+
" all_tags = df_filtered.values.flatten()\n",
|
| 65 |
+
" all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 66 |
+
" tag_counts = Counter(all_tags)\n",
|
| 67 |
+
"\n",
|
| 68 |
+
" # === CO-OCCURRENCE ===\n",
|
| 69 |
+
" co_occurrences = defaultdict(int)\n",
|
| 70 |
+
" for tags in df_filtered.itertuples(index=False, name=None):\n",
|
| 71 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 72 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 73 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 74 |
+
"\n",
|
| 75 |
+
" edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
|
| 76 |
+
"\n",
|
| 77 |
+
" # === FILTER BY CONNECTIONS ===\n",
|
| 78 |
+
" connected_tags = Counter()\n",
|
| 79 |
+
" for tag1, tag2, _ in edges:\n",
|
| 80 |
+
" connected_tags[tag1] += 1\n",
|
| 81 |
+
" connected_tags[tag2] += 1\n",
|
| 82 |
+
"\n",
|
| 83 |
+
" nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
|
| 84 |
+
" valid_ids = set(node[\"id\"] for node in nodes)\n",
|
| 85 |
+
" links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
|
| 86 |
+
" for tag1, tag2, weight in edges\n",
|
| 87 |
+
" if tag1 in valid_ids and tag2 in valid_ids]\n",
|
| 88 |
+
"\n",
|
| 89 |
+
" if not nodes or not links:\n",
|
| 90 |
+
" print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
|
| 91 |
+
" continue\n",
|
| 92 |
+
"\n",
|
| 93 |
+
" # === EXPORT ===\n",
|
| 94 |
+
" d3_data = {\"nodes\": nodes, \"links\": links}\n",
|
| 95 |
+
" safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
|
| 96 |
+
" output_file = os.path.join(output_dir, f\"tags_{safe_term}.json\")\n",
|
| 97 |
+
" \n",
|
| 98 |
+
" with open(output_file, \"w\") as f:\n",
|
| 99 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 100 |
+
" \n",
|
| 101 |
+
" print(f\" ✅ Saved to {output_file}\")\n"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "markdown",
|
| 106 |
+
"metadata": {},
|
| 107 |
+
"source": [
|
| 108 |
+
"## Different Countries"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "code",
|
| 113 |
+
"execution_count": 2,
|
| 114 |
+
"metadata": {},
|
| 115 |
+
"outputs": [
|
| 116 |
+
{
|
| 117 |
+
"name": "stderr",
|
| 118 |
+
"output_type": "stream",
|
| 119 |
+
"text": [
|
| 120 |
+
"/tmp/ipykernel_68582/2797381217.py:15: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 121 |
+
" df = pd.read_csv(file_path)\n"
|
| 122 |
+
]
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"name": "stdout",
|
| 126 |
+
"output_type": "stream",
|
| 127 |
+
"text": [
|
| 128 |
+
"Processing: united states\n",
|
| 129 |
+
" ⚠️ No matches for 'united states', skipping.\n",
|
| 130 |
+
"Processing: korea\n",
|
| 131 |
+
" ✅ Saved to data/json/tags_korea_poi.json\n",
|
| 132 |
+
"Processing: uk\n",
|
| 133 |
+
" ✅ Saved to data/json/tags_uk_poi.json\n",
|
| 134 |
+
"Processing: russia\n",
|
| 135 |
+
" ✅ Saved to data/json/tags_russia_poi.json\n",
|
| 136 |
+
"Processing: china\n",
|
| 137 |
+
" ✅ Saved to data/json/tags_china_poi.json\n",
|
| 138 |
+
"Processing: canada\n",
|
| 139 |
+
" ✅ Saved to data/json/tags_canada_poi.json\n",
|
| 140 |
+
"Processing: India\n",
|
| 141 |
+
" ✅ Saved to data/json/tags_india_poi.json\n",
|
| 142 |
+
"Processing: germany\n",
|
| 143 |
+
" ✅ Saved to data/json/tags_germany_poi.json\n"
|
| 144 |
+
]
|
| 145 |
+
}
|
| 146 |
+
],
|
| 147 |
+
"source": [
|
| 148 |
+
"import pandas as pd\n",
|
| 149 |
+
"from itertools import combinations\n",
|
| 150 |
+
"from collections import Counter, defaultdict\n",
|
| 151 |
+
"import json\n",
|
| 152 |
+
"import re\n",
|
| 153 |
+
"import os\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"# === CONFIG ===\n",
|
| 156 |
+
"file_path = \"data/model_subsets/all_models_poi_true.csv\"\n",
|
| 157 |
+
"output_dir = \"data/json/\"\n",
|
| 158 |
+
"target_terms = [\"united states\", \"korea\", \"uk\", \"russia\", \"china\", \"canada\", \"India\", \"germany\"] # Add any tags you want to process\n",
|
| 159 |
+
"min_connections = 1 # minimum number of link connections per node\n",
|
| 160 |
+
"\n",
|
| 161 |
+
"# === LOAD DATA ===\n",
|
| 162 |
+
"df = pd.read_csv(file_path)\n",
|
| 163 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 164 |
+
"df_tags = df[tag_columns]\n",
|
| 165 |
+
"\n",
|
| 166 |
+
"# === MAIN LOOP ===\n",
|
| 167 |
+
"for target_term in target_terms:\n",
|
| 168 |
+
" print(f\"Processing: {target_term}\")\n",
|
| 169 |
+
" \n",
|
| 170 |
+
" pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
|
| 171 |
+
" df_filtered = df_tags[df_tags.apply(\n",
|
| 172 |
+
" lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
|
| 173 |
+
" axis=1\n",
|
| 174 |
+
" )]\n",
|
| 175 |
+
"\n",
|
| 176 |
+
" # Skip if no data matches\n",
|
| 177 |
+
" if df_filtered.empty:\n",
|
| 178 |
+
" print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
|
| 179 |
+
" continue\n",
|
| 180 |
+
"\n",
|
| 181 |
+
" # === COUNT INDIVIDUAL TAGS ===\n",
|
| 182 |
+
" all_tags = df_filtered.values.flatten()\n",
|
| 183 |
+
" all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 184 |
+
" tag_counts = Counter(all_tags)\n",
|
| 185 |
+
"\n",
|
| 186 |
+
" # === CO-OCCURRENCE ===\n",
|
| 187 |
+
" co_occurrences = defaultdict(int)\n",
|
| 188 |
+
" for tags in df_filtered.itertuples(index=False, name=None):\n",
|
| 189 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 190 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 191 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 192 |
+
"\n",
|
| 193 |
+
" edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
|
| 194 |
+
"\n",
|
| 195 |
+
" # === FILTER BY CONNECTIONS ===\n",
|
| 196 |
+
" connected_tags = Counter()\n",
|
| 197 |
+
" for tag1, tag2, _ in edges:\n",
|
| 198 |
+
" connected_tags[tag1] += 1\n",
|
| 199 |
+
" connected_tags[tag2] += 1\n",
|
| 200 |
+
"\n",
|
| 201 |
+
" nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
|
| 202 |
+
" valid_ids = set(node[\"id\"] for node in nodes)\n",
|
| 203 |
+
" links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
|
| 204 |
+
" for tag1, tag2, weight in edges\n",
|
| 205 |
+
" if tag1 in valid_ids and tag2 in valid_ids]\n",
|
| 206 |
+
"\n",
|
| 207 |
+
" if not nodes or not links:\n",
|
| 208 |
+
" print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
|
| 209 |
+
" continue\n",
|
| 210 |
+
"\n",
|
| 211 |
+
" # === EXPORT ===\n",
|
| 212 |
+
" d3_data = {\"nodes\": nodes, \"links\": links}\n",
|
| 213 |
+
" safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
|
| 214 |
+
" output_file = os.path.join(output_dir, f\"tags_{safe_term}_poi.json\")\n",
|
| 215 |
+
" \n",
|
| 216 |
+
" with open(output_file, \"w\") as f:\n",
|
| 217 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 218 |
+
" \n",
|
| 219 |
+
" print(f\" ✅ Saved to {output_file}\")\n"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "code",
|
| 224 |
+
"execution_count": 5,
|
| 225 |
+
"metadata": {},
|
| 226 |
+
"outputs": [
|
| 227 |
+
{
|
| 228 |
+
"name": "stderr",
|
| 229 |
+
"output_type": "stream",
|
| 230 |
+
"text": [
|
| 231 |
+
"/tmp/ipykernel_79893/1420003123.py:13: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 232 |
+
" df = pd.read_csv(file_path)\n"
|
| 233 |
+
]
|
| 234 |
+
},
|
| 235 |
+
{
|
| 236 |
+
"name": "stdout",
|
| 237 |
+
"output_type": "stream",
|
| 238 |
+
"text": [
|
| 239 |
+
"✅ Exported 60330 nodes and 16921 links to public/json/nodes_all.json\n"
|
| 240 |
+
]
|
| 241 |
+
}
|
| 242 |
+
],
|
| 243 |
+
"source": [
|
| 244 |
+
"import pandas as pd\n",
|
| 245 |
+
"from itertools import combinations\n",
|
| 246 |
+
"from collections import Counter, defaultdict\n",
|
| 247 |
+
"import json\n",
|
| 248 |
+
"import os\n",
|
| 249 |
+
"\n",
|
| 250 |
+
"# === CONFIG ===\n",
|
| 251 |
+
"file_path = \"data/model_subsets/all_models_poi_false.csv\"\n",
|
| 252 |
+
"output_file = \"public/json/nodes_all.json\"\n",
|
| 253 |
+
"min_link_threshold = 10 # Only keep edges with co-occurrence >= this\n",
|
| 254 |
+
"\n",
|
| 255 |
+
"# === LOAD DATA ===\n",
|
| 256 |
+
"df = pd.read_csv(file_path)\n",
|
| 257 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 258 |
+
"df_tags = df[tag_columns]\n",
|
| 259 |
+
"\n",
|
| 260 |
+
"# === COUNT INDIVIDUAL TAGS ===\n",
|
| 261 |
+
"all_tags = df_tags.values.flatten()\n",
|
| 262 |
+
"all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 263 |
+
"tag_counts = Counter(all_tags)\n",
|
| 264 |
+
"\n",
|
| 265 |
+
"# === CO-OCCURRENCE ===\n",
|
| 266 |
+
"co_occurrences = defaultdict(int)\n",
|
| 267 |
+
"for tags in df_tags.itertuples(index=False, name=None):\n",
|
| 268 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 269 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 270 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 271 |
+
"\n",
|
| 272 |
+
"# === Build Edges (Filtered by co-occurrence threshold)\n",
|
| 273 |
+
"edges = [\n",
|
| 274 |
+
" {\"source\": list(pair)[0], \"target\": list(pair)[1], \"value\": weight}\n",
|
| 275 |
+
" for pair, weight in co_occurrences.items()\n",
|
| 276 |
+
" if weight >= min_link_threshold\n",
|
| 277 |
+
"]\n",
|
| 278 |
+
"\n",
|
| 279 |
+
"# === Build Nodes (All tags that appear, regardless of links)\n",
|
| 280 |
+
"nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts]\n",
|
| 281 |
+
"\n",
|
| 282 |
+
"# === EXPORT ===\n",
|
| 283 |
+
"d3_data = {\"nodes\": nodes, \"links\": edges}\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"os.makedirs(os.path.dirname(output_file), exist_ok=True)\n",
|
| 286 |
+
"with open(output_file, \"w\") as f:\n",
|
| 287 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 288 |
+
"\n",
|
| 289 |
+
"print(f\"✅ Exported {len(nodes)} nodes and {len(edges)} links to {output_file}\")\n"
|
| 290 |
+
]
|
| 291 |
+
}
|
| 292 |
+
],
|
| 293 |
+
"metadata": {
|
| 294 |
+
"kernelspec": {
|
| 295 |
+
"display_name": "latm",
|
| 296 |
+
"language": "python",
|
| 297 |
+
"name": "python3"
|
| 298 |
+
},
|
| 299 |
+
"language_info": {
|
| 300 |
+
"codemirror_mode": {
|
| 301 |
+
"name": "ipython",
|
| 302 |
+
"version": 3
|
| 303 |
+
},
|
| 304 |
+
"file_extension": ".py",
|
| 305 |
+
"mimetype": "text/x-python",
|
| 306 |
+
"name": "python",
|
| 307 |
+
"nbconvert_exporter": "python",
|
| 308 |
+
"pygments_lexer": "ipython3",
|
| 309 |
+
"version": "3.10.15"
|
| 310 |
+
}
|
| 311 |
+
},
|
| 312 |
+
"nbformat": 4,
|
| 313 |
+
"nbformat_minor": 2
|
| 314 |
+
}
|
jupyter_notebooks/.ipynb_checkpoints/Section_2-4_Figure_9_ectract_LoRA_metadata_v2-checkpoint.ipynb
ADDED
|
@@ -0,0 +1,400 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "f36422c8",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# LoRA metadata"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "raw",
|
| 13 |
+
"id": "8a2feb6e",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"LoRA Metadata Processing Workflow\n",
|
| 17 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 18 |
+
"│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
|
| 19 |
+
"│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
|
| 20 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 21 |
+
" ↓\n",
|
| 22 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 23 |
+
"│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
|
| 24 |
+
"│ Files Using API │ │ using rotating API keys. │\n",
|
| 25 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 26 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 27 |
+
"│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
|
| 28 |
+
"│ from SafeTensor │ │ images, model type, and architecture. │\n",
|
| 29 |
+
"│ Files │ │ │\n",
|
| 30 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 31 |
+
" ↓\n",
|
| 32 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 33 |
+
"│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
|
| 34 |
+
"│ Metadata as JSON │ │ files for later analysis. │\n",
|
| 35 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 36 |
+
"\n",
|
| 37 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 38 |
+
"│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
|
| 39 |
+
"│ for Consolidation │ │ details, and filter necessary attributes. │\n",
|
| 40 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 41 |
+
" ↓\n",
|
| 42 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 43 |
+
"│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
|
| 44 |
+
"│ Tags & Model Info │ │ and systems used for model creation. │\n",
|
| 45 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 46 |
+
" ↓\n",
|
| 47 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 48 |
+
"│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
|
| 49 |
+
"│ Metadata to CSV │ │ format for final analysis. ���\n",
|
| 50 |
+
"└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
|
| 51 |
+
]
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"cell_type": "code",
|
| 55 |
+
"execution_count": null,
|
| 56 |
+
"id": "efc9939d",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"outputs": [],
|
| 59 |
+
"source": [
|
| 60 |
+
"import os\n",
|
| 61 |
+
"import re\n",
|
| 62 |
+
"import json\n",
|
| 63 |
+
"import csv\n",
|
| 64 |
+
"import struct\n",
|
| 65 |
+
"import requests\n",
|
| 66 |
+
"from pathlib import Path\n",
|
| 67 |
+
"import pandas as pd\n",
|
| 68 |
+
"from collections import Counter\n",
|
| 69 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 70 |
+
"from pathlib import Path\n",
|
| 71 |
+
"import matplotlib.pyplot as plt\n",
|
| 72 |
+
"from matplotlib.font_manager import FontProperties\n",
|
| 73 |
+
"from matplotlib import font_manager\n",
|
| 74 |
+
"import pandas as pd\n",
|
| 75 |
+
"from collections import Counter\n",
|
| 76 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"# Define the current directory and important file paths\n",
|
| 79 |
+
"current_dir = Path.cwd()\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"# Define frequently used directories\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"data_dir = current_dir.parent / 'data/csv/adapters.csv'\n",
|
| 84 |
+
"fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
|
| 85 |
+
"plots_dir = current_dir.parent / 'results/plots'\n",
|
| 86 |
+
"raw_data_dir = current_dir.parent / 'data/adapter_metadata/lora' ### location of the LoRA metadata (JSON)\n",
|
| 87 |
+
"temp_dir = current_dir.parent / 'data/raw/adapters_safetensors'\n",
|
| 88 |
+
"misc_dir = current_dir.parent / 'misc'\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"# File paths\n",
|
| 91 |
+
"adapters_csv = current_dir.parent / 'data/csv/adapters.csv'\n",
|
| 92 |
+
"output_json_dir = raw_data_dir\n",
|
| 93 |
+
"api_keys_file = misc_dir / 'credentials/civit.txt'\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"# Ensure directories exist\n",
|
| 96 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 97 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 98 |
+
"\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"# Load fonts into Matplotlib\n",
|
| 101 |
+
"for font_path in font_paths:\n",
|
| 102 |
+
" font_manager.fontManager.addfont(font_path)\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"# Set default font family for plots\n",
|
| 105 |
+
"plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
|
| 106 |
+
"\n",
|
| 107 |
+
"print('Paths and fonts initialized successfully.')\n",
|
| 108 |
+
"\n",
|
| 109 |
+
"print('Paths initialized successfully.')"
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "markdown",
|
| 114 |
+
"id": "87a58593",
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"source": [
|
| 117 |
+
"## Step 2: Download LoRA and extract *.safetensors metadata\n",
|
| 118 |
+
"This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
|
| 119 |
+
]
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"cell_type": "code",
|
| 123 |
+
"execution_count": null,
|
| 124 |
+
"id": "abd3a0bc",
|
| 125 |
+
"metadata": {},
|
| 126 |
+
"outputs": [],
|
| 127 |
+
"source": [
|
| 128 |
+
"import os\n",
|
| 129 |
+
"import sys\n",
|
| 130 |
+
"import csv\n",
|
| 131 |
+
"import json\n",
|
| 132 |
+
"import struct\n",
|
| 133 |
+
"import time\n",
|
| 134 |
+
"import requests\n",
|
| 135 |
+
"import signal\n",
|
| 136 |
+
"import contextlib\n",
|
| 137 |
+
"from pathlib import Path\n",
|
| 138 |
+
"import re\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"# === Paste your API keys here ===\n",
|
| 141 |
+
"API_KEYS = [\n",
|
| 142 |
+
" \"399c06ea6d1b7349556a115376ec346b\", #DISCORD\n",
|
| 143 |
+
" \"213be9d373130f86e394c6fea4d75162\", #ASDD 1\n",
|
| 144 |
+
" \"4f180c0c56334b74394b467c5e5b8201\", #ASDD 2\n",
|
| 145 |
+
" \"bdfba7ac53290f66bc76130f25b74336\", #BSDD \n",
|
| 146 |
+
" \"43294f4a27b388624a896db5a65f445a\"\n",
|
| 147 |
+
"]\n",
|
| 148 |
+
"if not API_KEYS or any(not isinstance(k, str) or not k.strip() for k in API_KEYS):\n",
|
| 149 |
+
" raise ValueError(\"Please paste at least one valid API key into API_KEYS.\")\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"# === Config (adjust paths as needed) ===\n",
|
| 152 |
+
"current_dir = Path.cwd()\n",
|
| 153 |
+
"output_json_dir = current_dir.parent / \"data/adapter_metadata/lora\" # where JSON outputs go\n",
|
| 154 |
+
"temp_dir = current_dir.parent / \"data/raw/adapters_safetensors\" # where downloads go\n",
|
| 155 |
+
"csv_path = current_dir.parent / \"data/csv/adapters_poi_false_sfw.csv\"\n",
|
| 156 |
+
"\n",
|
| 157 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 158 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"# === API key state ===\n",
|
| 161 |
+
"current_key_index = 0\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"def safe_filename(name: str, max_length: int = 100) -> str:\n",
|
| 166 |
+
" # Replace unsafe chars\n",
|
| 167 |
+
" sanitized = re.sub(r'[^a-zA-Z0-9_\\-]', '_', name)\n",
|
| 168 |
+
" # Truncate if too long\n",
|
| 169 |
+
" if len(sanitized) > max_length:\n",
|
| 170 |
+
" sanitized = sanitized[:max_length]\n",
|
| 171 |
+
" return sanitized\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"\n",
|
| 174 |
+
"def get_headers():\n",
|
| 175 |
+
" global current_key_index\n",
|
| 176 |
+
" return {\n",
|
| 177 |
+
" \"Accept\": \"application/json\",\n",
|
| 178 |
+
" \"Authorization\": f\"Bearer {API_KEYS[current_key_index].strip()}\"\n",
|
| 179 |
+
" }\n",
|
| 180 |
+
"\n",
|
| 181 |
+
"def rotate_api_key():\n",
|
| 182 |
+
" global current_key_index\n",
|
| 183 |
+
" if current_key_index < len(API_KEYS) - 1:\n",
|
| 184 |
+
" current_key_index += 1\n",
|
| 185 |
+
" print(f\"🔁 Rotated to API key #{current_key_index + 1}\")\n",
|
| 186 |
+
" else:\n",
|
| 187 |
+
" raise Exception(\"All API keys have been exhausted.\")\n",
|
| 188 |
+
"\n",
|
| 189 |
+
"# === Utilities ===\n",
|
| 190 |
+
"def save_json(data, filename):\n",
|
| 191 |
+
" with open(filename, 'w', encoding=\"utf-8\") as f:\n",
|
| 192 |
+
" json.dump(data, f, indent=4, ensure_ascii=False)\n",
|
| 193 |
+
"\n",
|
| 194 |
+
"def parse_safetensors(file_path):\n",
|
| 195 |
+
" # Minimal, tolerant metadata reader; returns {} on failure.\n",
|
| 196 |
+
" try:\n",
|
| 197 |
+
" with open(file_path, 'rb') as f:\n",
|
| 198 |
+
" file_data = f.read()\n",
|
| 199 |
+
" # Many safetensors use 8-byte header length; this code follows your original logic\n",
|
| 200 |
+
" # (4-byte) but keeps the 8-byte skip. Keep if it's working in your dataset.\n",
|
| 201 |
+
" metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
|
| 202 |
+
" metadata_bytes = file_data[8:8 + metadata_size]\n",
|
| 203 |
+
" metadata_str = metadata_bytes.decode('utf-8', errors='replace')\n",
|
| 204 |
+
" metadata = json.loads(metadata_str)\n",
|
| 205 |
+
" return metadata.get('__metadata__', {})\n",
|
| 206 |
+
" except Exception as e:\n",
|
| 207 |
+
" print(f\"Error parsing safetensors file: {e}\")\n",
|
| 208 |
+
" return {}\n",
|
| 209 |
+
"\n",
|
| 210 |
+
"# === Timeout context ===\n",
|
| 211 |
+
"class TimeoutException(Exception):\n",
|
| 212 |
+
" pass\n",
|
| 213 |
+
"\n",
|
| 214 |
+
"@contextlib.contextmanager\n",
|
| 215 |
+
"def time_limit(seconds):\n",
|
| 216 |
+
" def signal_handler(signum, frame):\n",
|
| 217 |
+
" raise TimeoutException(f\"Timed out after {seconds} seconds\")\n",
|
| 218 |
+
" # Note: SIGALRM works on Unix-like OS; on Windows this will be a no-op.\n",
|
| 219 |
+
" try:\n",
|
| 220 |
+
" signal.signal(signal.SIGALRM, signal_handler)\n",
|
| 221 |
+
" signal.alarm(seconds)\n",
|
| 222 |
+
" except Exception:\n",
|
| 223 |
+
" # Fallback: no hard alarm on non-Unix systems\n",
|
| 224 |
+
" pass\n",
|
| 225 |
+
" try:\n",
|
| 226 |
+
" yield\n",
|
| 227 |
+
" finally:\n",
|
| 228 |
+
" try:\n",
|
| 229 |
+
" signal.alarm(0)\n",
|
| 230 |
+
" except Exception:\n",
|
| 231 |
+
" pass\n",
|
| 232 |
+
"\n",
|
| 233 |
+
"# === Download with timeout, retries, backoff, and key rotation ===\n",
|
| 234 |
+
"def download_file(url, output_folder, timeout=30, overall_timeout=120, max_retries=3):\n",
|
| 235 |
+
" filename = url.split(\"/\")[-1]\n",
|
| 236 |
+
" output_path = os.path.join(output_folder, filename)\n",
|
| 237 |
+
"\n",
|
| 238 |
+
" global current_key_index\n",
|
| 239 |
+
" retries = 0\n",
|
| 240 |
+
" backoff = 2\n",
|
| 241 |
+
"\n",
|
| 242 |
+
" while current_key_index < len(API_KEYS):\n",
|
| 243 |
+
" try:\n",
|
| 244 |
+
" with time_limit(overall_timeout): # global cap per download\n",
|
| 245 |
+
" #print(f\"➡️ GET {url} using key #{current_key_index + 1}\")\n",
|
| 246 |
+
" resp = requests.get(\n",
|
| 247 |
+
" url,\n",
|
| 248 |
+
" headers=get_headers(),\n",
|
| 249 |
+
" stream=True,\n",
|
| 250 |
+
" timeout=(10, timeout), # (connect timeout, per-chunk read timeout)\n",
|
| 251 |
+
" )\n",
|
| 252 |
+
"\n",
|
| 253 |
+
" # Auth errors → rotate key\n",
|
| 254 |
+
" if resp.status_code in (401, 403):\n",
|
| 255 |
+
" print(f\"❌ Auth {resp.status_code} with key #{current_key_index + 1}. Rotating.\")\n",
|
| 256 |
+
" rotate_api_key()\n",
|
| 257 |
+
" retries = 0\n",
|
| 258 |
+
" backoff = 2\n",
|
| 259 |
+
" continue\n",
|
| 260 |
+
"\n",
|
| 261 |
+
" # Not found → bubble up as FileNotFoundError (do not rotate)\n",
|
| 262 |
+
" if resp.status_code == 404:\n",
|
| 263 |
+
" raise FileNotFoundError(f\"Model not found at {url}\")\n",
|
| 264 |
+
"\n",
|
| 265 |
+
" # Rate limit → either rotate or wait/backoff\n",
|
| 266 |
+
" if resp.status_code == 429:\n",
|
| 267 |
+
" print(\"⏳ Rate limited (429).\", end=\" \")\n",
|
| 268 |
+
" if current_key_index < len(API_KEYS) - 1:\n",
|
| 269 |
+
" print(\"Rotating key.\")\n",
|
| 270 |
+
" rotate_api_key()\n",
|
| 271 |
+
" retries = 0\n",
|
| 272 |
+
" backoff = 2\n",
|
| 273 |
+
" continue\n",
|
| 274 |
+
" else:\n",
|
| 275 |
+
" print(f\"Waiting {backoff}s (no other keys).\")\n",
|
| 276 |
+
" time.sleep(backoff)\n",
|
| 277 |
+
" backoff = min(backoff * 2, 60)\n",
|
| 278 |
+
" continue\n",
|
| 279 |
+
"\n",
|
| 280 |
+
" # Other HTTP errors → raise to RequestException path\n",
|
| 281 |
+
" resp.raise_for_status()\n",
|
| 282 |
+
"\n",
|
| 283 |
+
" # Save file\n",
|
| 284 |
+
" with open(output_path, 'wb') as fh:\n",
|
| 285 |
+
" for chunk in resp.iter_content(chunk_size=8192):\n",
|
| 286 |
+
" if chunk:\n",
|
| 287 |
+
" fh.write(chunk)\n",
|
| 288 |
+
"\n",
|
| 289 |
+
" return output_path, filename\n",
|
| 290 |
+
"\n",
|
| 291 |
+
" except TimeoutException as e:\n",
|
| 292 |
+
" # Hard overall timeout → propagate\n",
|
| 293 |
+
" raise e\n",
|
| 294 |
+
" except requests.exceptions.RequestException as e:\n",
|
| 295 |
+
" # Network-ish errors: retry same key with backoff up to max_retries\n",
|
| 296 |
+
" retries += 1\n",
|
| 297 |
+
" if retries <= max_retries:\n",
|
| 298 |
+
" print(f\"🌐 Network error (try {retries}/{max_retries}) with key #{current_key_index + 1}: {e}\")\n",
|
| 299 |
+
" time.sleep(backoff)\n",
|
| 300 |
+
" backoff = min(backoff * 2, 60)\n",
|
| 301 |
+
" continue\n",
|
| 302 |
+
" else:\n",
|
| 303 |
+
" raise Exception(f\"Failed to download {url} after {max_retries} retries: {e}\")\n",
|
| 304 |
+
"\n",
|
| 305 |
+
" # If we exit the loop, we truly ran out\n",
|
| 306 |
+
" raise Exception(\"All API keys have been exhausted or failed.\")\n",
|
| 307 |
+
"\n",
|
| 308 |
+
"# === Main processing ===\n",
|
| 309 |
+
"def process_csv(csv_path):\n",
|
| 310 |
+
" with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
|
| 311 |
+
" reader = csv.DictReader(csvfile)\n",
|
| 312 |
+
" for index, row in enumerate(reader):\n",
|
| 313 |
+
" # Collect up to 20 version IDs; use the most recent\n",
|
| 314 |
+
" version_ids = []\n",
|
| 315 |
+
" for i in range(1, 21):\n",
|
| 316 |
+
" k = f'version_id_{i}'\n",
|
| 317 |
+
" if k in row and row[k]:\n",
|
| 318 |
+
" try:\n",
|
| 319 |
+
" version_ids.append(int(float(row[k])))\n",
|
| 320 |
+
" except ValueError:\n",
|
| 321 |
+
" print(f\"Invalid version_id value '{row[k]}' in row: {row}\")\n",
|
| 322 |
+
"\n",
|
| 323 |
+
" if not version_ids:\n",
|
| 324 |
+
" print(f\"No valid version IDs found in row: {row}\")\n",
|
| 325 |
+
" continue\n",
|
| 326 |
+
"\n",
|
| 327 |
+
" most_recent_version_id = str(max(version_ids))\n",
|
| 328 |
+
" name = row.get('name', 'unknown')\n",
|
| 329 |
+
" sanitized_name = safe_filename(name, max_length=100)\n",
|
| 330 |
+
" new_json_file = os.path.join(\n",
|
| 331 |
+
" output_json_dir,\n",
|
| 332 |
+
" f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
|
| 333 |
+
" )\n",
|
| 334 |
+
"\n",
|
| 335 |
+
" # Skip if JSON already exists\n",
|
| 336 |
+
" if os.path.exists(new_json_file):\n",
|
| 337 |
+
" #print(f\"↩️ Skipping versionID {most_recent_version_id} (JSON already exists)\")\n",
|
| 338 |
+
" continue\n",
|
| 339 |
+
"\n",
|
| 340 |
+
" try:\n",
|
| 341 |
+
" adapter_file, fname = download_file(\n",
|
| 342 |
+
" row['downloadUrl'], str(temp_dir),\n",
|
| 343 |
+
" timeout=30, overall_timeout=180\n",
|
| 344 |
+
" )\n",
|
| 345 |
+
" metadata = parse_safetensors(adapter_file)\n",
|
| 346 |
+
"\n",
|
| 347 |
+
" civitaidata = {\n",
|
| 348 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 349 |
+
" for k, v in row.items()\n",
|
| 350 |
+
" }\n",
|
| 351 |
+
" new_json_data = {\n",
|
| 352 |
+
" \"civitaidata\": civitaidata,\n",
|
| 353 |
+
" \"metadata\": metadata,\n",
|
| 354 |
+
" \"versionID\": most_recent_version_id\n",
|
| 355 |
+
" }\n",
|
| 356 |
+
" save_json(new_json_data, new_json_file)\n",
|
| 357 |
+
" #print(f\"✅ Created JSON for versionID {most_recent_version_id} with file {fname}\")\n",
|
| 358 |
+
"\n",
|
| 359 |
+
" except FileNotFoundError as e:\n",
|
| 360 |
+
" print(f\"⚠️ {e} — saving empty metadata.\")\n",
|
| 361 |
+
" civitaidata = {\n",
|
| 362 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 363 |
+
" for k, v in row.items()\n",
|
| 364 |
+
" }\n",
|
| 365 |
+
" empty_json = {\n",
|
| 366 |
+
" \"civitaidata\": civitaidata,\n",
|
| 367 |
+
" \"metadata\": {},\n",
|
| 368 |
+
" \"versionID\": most_recent_version_id,\n",
|
| 369 |
+
" \"error\": \"Model not found (404)\"\n",
|
| 370 |
+
" }\n",
|
| 371 |
+
" save_json(empty_json, new_json_file)\n",
|
| 372 |
+
" except Exception as e:\n",
|
| 373 |
+
" print(f\"⚠️ Error processing versionID {most_recent_version_id}: {e}\")\n",
|
| 374 |
+
" civitaidata = {\n",
|
| 375 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 376 |
+
" for k, v in row.items()\n",
|
| 377 |
+
" }\n",
|
| 378 |
+
" empty_json = {\n",
|
| 379 |
+
" \"civitaidata\": civitaidata,\n",
|
| 380 |
+
" \"metadata\": {},\n",
|
| 381 |
+
" \"versionID\": most_recent_version_id,\n",
|
| 382 |
+
" \"error\": str(e)\n",
|
| 383 |
+
" }\n",
|
| 384 |
+
" save_json(empty_json, new_json_file)\n",
|
| 385 |
+
" print(f\"💾 Saved empty JSON for versionID {most_recent_version_id} due to failure.\")\n",
|
| 386 |
+
"\n",
|
| 387 |
+
"# === Run ===\n",
|
| 388 |
+
"if __name__ == \"__main__\":\n",
|
| 389 |
+
" process_csv(csv_path)\n"
|
| 390 |
+
]
|
| 391 |
+
}
|
| 392 |
+
],
|
| 393 |
+
"metadata": {
|
| 394 |
+
"language_info": {
|
| 395 |
+
"name": "python"
|
| 396 |
+
}
|
| 397 |
+
},
|
| 398 |
+
"nbformat": 4,
|
| 399 |
+
"nbformat_minor": 5
|
| 400 |
+
}
|
jupyter_notebooks/0_Scraping_image_metadata.ipynb
ADDED
|
@@ -0,0 +1,1345 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "1111ea95-d385-49b9-a4d9-ef886ace5c7a",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-06T11:24:25.566747Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-06T11:24:25.566066Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-06T11:24:25.571748Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-06T11:24:25.571305Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-06T11:24:25.566705Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# 0 Scraping Metadata and Dataset consolidation\n"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "6632505a-e7ca-4463-9ffc-e36fad42235f",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## IMAGES\n",
|
| 25 |
+
"---"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "e3388bac-bb71-40bc-a693-9ac7a2d5f32c",
|
| 31 |
+
"metadata": {
|
| 32 |
+
"execution": {
|
| 33 |
+
"iopub.execute_input": "2025-02-06T10:08:22.229784Z",
|
| 34 |
+
"iopub.status.busy": "2025-02-06T10:08:22.229287Z",
|
| 35 |
+
"iopub.status.idle": "2025-02-06T10:08:22.232210Z",
|
| 36 |
+
"shell.execute_reply": "2025-02-06T10:08:22.231793Z",
|
| 37 |
+
"shell.execute_reply.started": "2025-02-06T10:08:22.229766Z"
|
| 38 |
+
}
|
| 39 |
+
},
|
| 40 |
+
"source": [
|
| 41 |
+
"### Step 1: Image metadata scraping, sorting and CSV consolidation"
|
| 42 |
+
]
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"cell_type": "raw",
|
| 46 |
+
"id": "8b5d623a-a647-42ae-9951-efa99048cdf3",
|
| 47 |
+
"metadata": {},
|
| 48 |
+
"source": [
|
| 49 |
+
"Dataset creation workflow\n",
|
| 50 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 51 |
+
"│ Scrape Data from API │ --> │ Use CivitAI API with a timestamp-based cursor to │\n",
|
| 52 |
+
"│ Paginate Using Cursor│ │ scrape image metadata and save in JSON batches. │\n",
|
| 53 |
+
"│ Save in JSON Format │ └───────────────────────────────────────────────────┘\n",
|
| 54 |
+
"└─────────┬────────────┘\n",
|
| 55 |
+
" ↓\n",
|
| 56 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 57 |
+
"│ Locate JSON Files │ --> │ Find all JSON files saved in the directory. │\n",
|
| 58 |
+
"│ in Directory │ │ │\n",
|
| 59 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 60 |
+
" ↓\n",
|
| 61 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 62 |
+
"│ Read Each JSON File │ --> │ Parse JSON files and extract metadata, reactions, │\n",
|
| 63 |
+
"│ Parse and Extract │ │ and resource details. │\n",
|
| 64 |
+
"│ Metadata, Reactions, │ │ │\n",
|
| 65 |
+
"│ and Resources │ │ │\n",
|
| 66 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 67 |
+
" ↓\n",
|
| 68 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 69 |
+
"│ Sort Items by │ --> │ Sort the extracted items chronologically using │\n",
|
| 70 |
+
"│ createdAt Timestamp │ │ their createdAt timestamp. │\n",
|
| 71 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 72 |
+
" ↓\n",
|
| 73 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 74 |
+
"│ Write Extracted Data │ --> │ Save the processed and sorted data into a │\n",
|
| 75 |
+
"│ to a Consolidated │ │ consolidated CSV file with structured columns. │\n",
|
| 76 |
+
"│ CSV File │ │ │\n",
|
| 77 |
+
"└────────��─────────────┘ └───────────────────────────────────────────────────┘\n"
|
| 78 |
+
]
|
| 79 |
+
},
|
| 80 |
+
{
|
| 81 |
+
"cell_type": "code",
|
| 82 |
+
"execution_count": 1,
|
| 83 |
+
"id": "f8decb63-43f5-4731-823d-94632eee7618",
|
| 84 |
+
"metadata": {
|
| 85 |
+
"execution": {
|
| 86 |
+
"iopub.execute_input": "2025-02-08T19:37:51.115763Z",
|
| 87 |
+
"iopub.status.busy": "2025-02-08T19:37:51.114573Z",
|
| 88 |
+
"iopub.status.idle": "2025-02-08T19:37:51.170027Z",
|
| 89 |
+
"shell.execute_reply": "2025-02-08T19:37:51.169401Z",
|
| 90 |
+
"shell.execute_reply.started": "2025-02-08T19:37:51.115738Z"
|
| 91 |
+
}
|
| 92 |
+
},
|
| 93 |
+
"outputs": [],
|
| 94 |
+
"source": [
|
| 95 |
+
"import os\n",
|
| 96 |
+
"import json\n",
|
| 97 |
+
"import csv\n",
|
| 98 |
+
"import requests\n",
|
| 99 |
+
"from datetime import datetime\n",
|
| 100 |
+
"import time\n",
|
| 101 |
+
"from pathlib import Path\n",
|
| 102 |
+
"import hashlib\n",
|
| 103 |
+
"import pandas as pd\n",
|
| 104 |
+
"import sys\n",
|
| 105 |
+
"from datetime import datetime, timedelta\n",
|
| 106 |
+
"import shutil"
|
| 107 |
+
]
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"cell_type": "code",
|
| 111 |
+
"execution_count": 2,
|
| 112 |
+
"id": "4b2426c3-96a0-468e-b6dc-78dea9c3e92b",
|
| 113 |
+
"metadata": {
|
| 114 |
+
"execution": {
|
| 115 |
+
"iopub.execute_input": "2025-02-08T19:37:51.809027Z",
|
| 116 |
+
"iopub.status.busy": "2025-02-08T19:37:51.808835Z",
|
| 117 |
+
"iopub.status.idle": "2025-02-08T19:37:51.812922Z",
|
| 118 |
+
"shell.execute_reply": "2025-02-08T19:37:51.812429Z",
|
| 119 |
+
"shell.execute_reply.started": "2025-02-08T19:37:51.809009Z"
|
| 120 |
+
}
|
| 121 |
+
},
|
| 122 |
+
"outputs": [],
|
| 123 |
+
"source": [
|
| 124 |
+
"current_dir = Path.cwd()"
|
| 125 |
+
]
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"cell_type": "code",
|
| 129 |
+
"execution_count": 3,
|
| 130 |
+
"id": "dd0daa43-d988-4ac5-b9c7-2f9c65078325",
|
| 131 |
+
"metadata": {
|
| 132 |
+
"execution": {
|
| 133 |
+
"iopub.execute_input": "2025-02-07T20:59:30.413939Z",
|
| 134 |
+
"iopub.status.busy": "2025-02-07T20:59:30.412861Z",
|
| 135 |
+
"iopub.status.idle": "2025-02-07T20:59:30.418660Z",
|
| 136 |
+
"shell.execute_reply": "2025-02-07T20:59:30.418054Z",
|
| 137 |
+
"shell.execute_reply.started": "2025-02-07T20:59:30.413918Z"
|
| 138 |
+
}
|
| 139 |
+
},
|
| 140 |
+
"outputs": [],
|
| 141 |
+
"source": [
|
| 142 |
+
"# Define the input timestamp in ISO 8601 format\n",
|
| 143 |
+
"input_timestamp = \"2025-03-24T12:59:03.335Z\" # point in time from when you want to obtain metadata (you can copy the timestamp from the last *.json batch obtained to get the data of longer timespans)\n",
|
| 144 |
+
"\n",
|
| 145 |
+
"# Function to convert an ISO 8601 date string to a Unix timestamp in milliseconds with a 2-hour offset \n",
|
| 146 |
+
"def iso_to_timestamp(iso_str):\n",
|
| 147 |
+
" # Parse the ISO 8601 date string (including milliseconds and 'Z' indicating UTC)\n",
|
| 148 |
+
" date = datetime.strptime(iso_str, \"%Y-%m-%dT%H:%M:%S.%fZ\") \n",
|
| 149 |
+
" # Add 2 hours to the parsed date\n",
|
| 150 |
+
" adjusted_date = date + timedelta(hours=2)\n",
|
| 151 |
+
" # Convert to Unix timestamp (in milliseconds)\n",
|
| 152 |
+
" timestamp = int(adjusted_date.timestamp() * 1000)\n",
|
| 153 |
+
" return timestamp\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"# Convert the input timestamp to an initial cursor\n",
|
| 156 |
+
"initial_cursor = iso_to_timestamp(input_timestamp)\n"
|
| 157 |
+
]
|
| 158 |
+
},
|
| 159 |
+
{
|
| 160 |
+
"cell_type": "code",
|
| 161 |
+
"execution_count": 4,
|
| 162 |
+
"id": "20b4bc72-80c4-4af6-bc76-aeecfa65f9dc",
|
| 163 |
+
"metadata": {
|
| 164 |
+
"execution": {
|
| 165 |
+
"iopub.execute_input": "2025-02-07T20:59:31.134620Z",
|
| 166 |
+
"iopub.status.busy": "2025-02-07T20:59:31.133548Z",
|
| 167 |
+
"iopub.status.idle": "2025-02-07T20:59:31.138081Z",
|
| 168 |
+
"shell.execute_reply": "2025-02-07T20:59:31.137555Z",
|
| 169 |
+
"shell.execute_reply.started": "2025-02-07T20:59:31.134599Z"
|
| 170 |
+
}
|
| 171 |
+
},
|
| 172 |
+
"outputs": [],
|
| 173 |
+
"source": [
|
| 174 |
+
"data_raw = current_dir.parent / f\"data/raw/image_metadata/001/{input_timestamp.replace(':', '').replace('T', '_').replace('.', '_').replace('Z', '')}\""
|
| 175 |
+
]
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"cell_type": "markdown",
|
| 179 |
+
"id": "f69c921d-ca57-4746-8124-a0fa4fce80ff",
|
| 180 |
+
"metadata": {},
|
| 181 |
+
"source": [
|
| 182 |
+
"#### The *.json files will be saved in the following structure:"
|
| 183 |
+
]
|
| 184 |
+
},
|
| 185 |
+
{
|
| 186 |
+
"cell_type": "raw",
|
| 187 |
+
"id": "d731001e-32a4-4894-a2fe-835ea69214fc",
|
| 188 |
+
"metadata": {},
|
| 189 |
+
"source": [
|
| 190 |
+
"data/\n",
|
| 191 |
+
"└── raw/\n",
|
| 192 |
+
" └── image_metadata\n",
|
| 193 |
+
" └── 001/\n",
|
| 194 |
+
" └── YYYYMMDD_HHMMSS/ # Folder named based on the input timestamp\n",
|
| 195 |
+
" ├── most_recent_1.json # Sequential JSON files\n",
|
| 196 |
+
" ├── most_recent_2.json\n",
|
| 197 |
+
" ├── ...\n",
|
| 198 |
+
" ├── most_recent_50000.json\n",
|
| 199 |
+
" └── YYYYMMDD_HHMMSS_session_TIMESTAMP/ # New folder after API limit is reached\n",
|
| 200 |
+
" ├── most_recent_50001.json\n",
|
| 201 |
+
" ├── most_recent_50002.json\n",
|
| 202 |
+
" ├── ...\n"
|
| 203 |
+
]
|
| 204 |
+
},
|
| 205 |
+
{
|
| 206 |
+
"cell_type": "code",
|
| 207 |
+
"execution_count": 5,
|
| 208 |
+
"id": "4cb3c005-bf70-43fc-afe8-93bf30872fb4",
|
| 209 |
+
"metadata": {
|
| 210 |
+
"execution": {
|
| 211 |
+
"iopub.execute_input": "2025-02-06T16:51:57.402732Z",
|
| 212 |
+
"iopub.status.busy": "2025-02-06T16:51:57.402532Z",
|
| 213 |
+
"iopub.status.idle": "2025-02-06T16:52:00.855604Z",
|
| 214 |
+
"shell.execute_reply": "2025-02-06T16:52:00.854678Z",
|
| 215 |
+
"shell.execute_reply.started": "2025-02-06T16:51:57.402712Z"
|
| 216 |
+
}
|
| 217 |
+
},
|
| 218 |
+
"outputs": [],
|
| 219 |
+
"source": [
|
| 220 |
+
"max_images = 50000 # some huge number now redundant because CivitAI API currently limits free api usage to 50'000\n",
|
| 221 |
+
"\n",
|
| 222 |
+
"def get_image_metadata():\n",
|
| 223 |
+
" base_url = \"https://civitai.com/api/v1/images\"\n",
|
| 224 |
+
" headers = {\n",
|
| 225 |
+
" \"Accept\": \"application/json\",\n",
|
| 226 |
+
" \"Authorization\": \"Bearer APITOKEN\" # Replace with your actual API token\n",
|
| 227 |
+
" }\n",
|
| 228 |
+
" params = {\n",
|
| 229 |
+
" \"sort\": \"Most Reactions\",\n",
|
| 230 |
+
" \"nsfw\": \"X\",\n",
|
| 231 |
+
" \"cursor\": f\"0|{initial_cursor}\"\n",
|
| 232 |
+
" }\n",
|
| 233 |
+
"\n",
|
| 234 |
+
" # Use pathlib to create the base directory for saving files\n",
|
| 235 |
+
"\n",
|
| 236 |
+
" file_counter = 0\n",
|
| 237 |
+
"\n",
|
| 238 |
+
" # Create a folder based on the input timestamp\n",
|
| 239 |
+
" data_raw.mkdir(parents=True, exist_ok=True)\n",
|
| 240 |
+
" sub_directory_path = data_raw\n",
|
| 241 |
+
"\n",
|
| 242 |
+
" retry_delay = 300 # 300 seconds / 5 minutes\n",
|
| 243 |
+
" retries_without_cursor = 0 # Track consecutive retries without a new cursor\n",
|
| 244 |
+
"\n",
|
| 245 |
+
" while True:\n",
|
| 246 |
+
" response = requests.get(base_url, headers=headers, params=params)\n",
|
| 247 |
+
" if response.status_code == 200:\n",
|
| 248 |
+
" data = response.json()\n",
|
| 249 |
+
" items = data.get('items', [])\n",
|
| 250 |
+
" if not items:\n",
|
| 251 |
+
" print(\"No more data available.\")\n",
|
| 252 |
+
" retries_without_cursor += 1\n",
|
| 253 |
+
" if retries_without_cursor > 5: # Allow up to 5 retries before stopping\n",
|
| 254 |
+
" print(\"Reached the end of the data after multiple retries.\")\n",
|
| 255 |
+
" break\n",
|
| 256 |
+
" time.sleep(retry_delay)\n",
|
| 257 |
+
" continue\n",
|
| 258 |
+
"\n",
|
| 259 |
+
" retries_without_cursor = 0 # Reset if we get data\n",
|
| 260 |
+
"\n",
|
| 261 |
+
" next_cursor = data.get('metadata', {}).get('nextCursor')\n",
|
| 262 |
+
" if next_cursor:\n",
|
| 263 |
+
" # Increment the cursor by 50 (e.g., \"0|1722470401000\" -> \"50|1722470401000\")\n",
|
| 264 |
+
" cursor_value = int(params['cursor'].split(\"|\")[0])\n",
|
| 265 |
+
" new_cursor_value = cursor_value + 50\n",
|
| 266 |
+
" params['cursor'] = f\"{new_cursor_value}|{params['cursor'].split('|')[1]}\"\n",
|
| 267 |
+
" else:\n",
|
| 268 |
+
" print(\"No new cursor returned, stopping.\")\n",
|
| 269 |
+
" break\n",
|
| 270 |
+
"\n",
|
| 271 |
+
" file_counter += 1\n",
|
| 272 |
+
" if file_counter % max_images == 0:\n",
|
| 273 |
+
" time_stamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
|
| 274 |
+
" sub_directory_path = data_raw / f\"{input_timestamp.replace(':', '').replace('T', '_').replace('.', '_').replace('Z', '')}_session_{time_stamp}\"\n",
|
| 275 |
+
" sub_directory_path.mkdir(parents=True, exist_ok=True)\n",
|
| 276 |
+
"\n",
|
| 277 |
+
" file_path = sub_directory_path / f'most_recent_{file_counter}.json'\n",
|
| 278 |
+
" with open(file_path, 'w', encoding='utf-8') as file:\n",
|
| 279 |
+
" json.dump(data, file, indent=4)\n",
|
| 280 |
+
"\n",
|
| 281 |
+
" elif response.status_code == 502:\n",
|
| 282 |
+
" print(f\"Received HTTP 502. Retrying in {retry_delay // 60} minutes.\")\n",
|
| 283 |
+
" time.sleep(retry_delay) # Wait for 5 minutes before retrying\n",
|
| 284 |
+
" else:\n",
|
| 285 |
+
" print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
|
| 286 |
+
" break\n"
|
| 287 |
+
]
|
| 288 |
+
},
|
| 289 |
+
{
|
| 290 |
+
"cell_type": "markdown",
|
| 291 |
+
"id": "238044f1-544f-4821-a91e-57386d464f9e",
|
| 292 |
+
"metadata": {
|
| 293 |
+
"execution": {
|
| 294 |
+
"iopub.execute_input": "2025-02-06T16:13:44.748157Z",
|
| 295 |
+
"iopub.status.busy": "2025-02-06T16:13:44.747986Z",
|
| 296 |
+
"iopub.status.idle": "2025-02-06T16:13:44.752230Z",
|
| 297 |
+
"shell.execute_reply": "2025-02-06T16:13:44.751694Z",
|
| 298 |
+
"shell.execute_reply.started": "2025-02-06T16:13:44.748140Z"
|
| 299 |
+
}
|
| 300 |
+
},
|
| 301 |
+
"source": [
|
| 302 |
+
"uncomment this line to scrape image metadata:"
|
| 303 |
+
]
|
| 304 |
+
},
|
| 305 |
+
{
|
| 306 |
+
"cell_type": "code",
|
| 307 |
+
"execution_count": null,
|
| 308 |
+
"id": "7c89cb68-983b-47df-a028-e02d7ca0829d",
|
| 309 |
+
"metadata": {
|
| 310 |
+
"execution": {
|
| 311 |
+
"iopub.execute_input": "2025-02-06T16:52:00.857210Z",
|
| 312 |
+
"iopub.status.busy": "2025-02-06T16:52:00.857015Z",
|
| 313 |
+
"iopub.status.idle": "2025-02-06T16:52:04.901096Z",
|
| 314 |
+
"shell.execute_reply": "2025-02-06T16:52:04.900159Z",
|
| 315 |
+
"shell.execute_reply.started": "2025-02-06T16:52:00.857191Z"
|
| 316 |
+
}
|
| 317 |
+
},
|
| 318 |
+
"outputs": [],
|
| 319 |
+
"source": [
|
| 320 |
+
"get_image_metadata()"
|
| 321 |
+
]
|
| 322 |
+
},
|
| 323 |
+
{
|
| 324 |
+
"cell_type": "markdown",
|
| 325 |
+
"id": "06f5f27e-7278-44c9-ae34-9a2020cac2f2",
|
| 326 |
+
"metadata": {},
|
| 327 |
+
"source": [
|
| 328 |
+
"---\n",
|
| 329 |
+
"---"
|
| 330 |
+
]
|
| 331 |
+
},
|
| 332 |
+
{
|
| 333 |
+
"cell_type": "markdown",
|
| 334 |
+
"id": "648aee1b-9e01-4400-a6a8-3361c97c377a",
|
| 335 |
+
"metadata": {
|
| 336 |
+
"execution": {
|
| 337 |
+
"iopub.execute_input": "2025-02-06T11:33:11.524990Z",
|
| 338 |
+
"iopub.status.busy": "2025-02-06T11:33:11.523287Z",
|
| 339 |
+
"iopub.status.idle": "2025-02-06T11:33:11.531385Z",
|
| 340 |
+
"shell.execute_reply": "2025-02-06T11:33:11.530884Z",
|
| 341 |
+
"shell.execute_reply.started": "2025-02-06T11:33:11.524893Z"
|
| 342 |
+
}
|
| 343 |
+
},
|
| 344 |
+
"source": [
|
| 345 |
+
"### Step 02 Chronological Sorting"
|
| 346 |
+
]
|
| 347 |
+
},
|
| 348 |
+
{
|
| 349 |
+
"cell_type": "code",
|
| 350 |
+
"execution_count": 7,
|
| 351 |
+
"id": "b73d7ea5-50e0-485d-8bc5-a4736f31c789",
|
| 352 |
+
"metadata": {
|
| 353 |
+
"execution": {
|
| 354 |
+
"iopub.execute_input": "2025-02-06T16:52:04.902663Z",
|
| 355 |
+
"iopub.status.busy": "2025-02-06T16:52:04.902478Z",
|
| 356 |
+
"iopub.status.idle": "2025-02-06T16:52:08.491068Z",
|
| 357 |
+
"shell.execute_reply": "2025-02-06T16:52:08.490105Z",
|
| 358 |
+
"shell.execute_reply.started": "2025-02-06T16:52:04.902646Z"
|
| 359 |
+
}
|
| 360 |
+
},
|
| 361 |
+
"outputs": [],
|
| 362 |
+
"source": [
|
| 363 |
+
"def organize_files(source_dir, target_dir, max_items_per_file=100):\n",
|
| 364 |
+
" print(f\"Starting to organize files from {source_dir} to {target_dir}\")\n",
|
| 365 |
+
" item_buffer = []\n",
|
| 366 |
+
" file_count = 0\n",
|
| 367 |
+
"\n",
|
| 368 |
+
" # Walk through all files in the source directory\n",
|
| 369 |
+
" for root, dirs, files in os.walk(source_dir):\n",
|
| 370 |
+
" if '.ipynb_checkpoints' in root:\n",
|
| 371 |
+
" continue # Skip .ipynb_checkpoints directories\n",
|
| 372 |
+
" print(f\"Checking directory: {root}\")\n",
|
| 373 |
+
" for filename in files:\n",
|
| 374 |
+
" if filename.lower().endswith('.json'):\n",
|
| 375 |
+
" file_path = os.path.join(root, filename)\n",
|
| 376 |
+
" try:\n",
|
| 377 |
+
" with open(file_path, 'r') as file:\n",
|
| 378 |
+
" data = json.load(file)\n",
|
| 379 |
+
" items = data.get('items', []) # Get the list of items\n",
|
| 380 |
+
" for item in items:\n",
|
| 381 |
+
" item_buffer.append(item)\n",
|
| 382 |
+
" # Write out the buffer if it has reached the maximum size\n",
|
| 383 |
+
" if len(item_buffer) >= max_items_per_file:\n",
|
| 384 |
+
" write_items(item_buffer[:max_items_per_file], target_dir)\n",
|
| 385 |
+
" item_buffer = item_buffer[max_items_per_file:]\n",
|
| 386 |
+
" file_count += 1\n",
|
| 387 |
+
" except json.JSONDecodeError:\n",
|
| 388 |
+
" print(f\"Error decoding JSON from file {file_path}\")\n",
|
| 389 |
+
" except Exception as e:\n",
|
| 390 |
+
" print(f\"An error occurred with file {file_path}: {e}\")\n",
|
| 391 |
+
"\n",
|
| 392 |
+
" # Write any remaining items in the buffer\n",
|
| 393 |
+
" if item_buffer:\n",
|
| 394 |
+
" write_items(item_buffer, target_dir)\n",
|
| 395 |
+
" file_count += 1\n",
|
| 396 |
+
"\n",
|
| 397 |
+
" #print(f\"Processed {file_count} files.\")\n",
|
| 398 |
+
"\n",
|
| 399 |
+
"def write_items(items, target_dir):\n",
|
| 400 |
+
" # Use the createdAt from the first item to determine the directory\n",
|
| 401 |
+
" created_at = items[0].get('createdAt')\n",
|
| 402 |
+
" if created_at:\n",
|
| 403 |
+
" date_obj = datetime.fromisoformat(created_at.rstrip(\"Z\"))\n",
|
| 404 |
+
" new_dir = os.path.join(target_dir, f\"{date_obj.year}\", f\"{date_obj.year}-{date_obj.month:02d}\", f\"{date_obj.year}-{date_obj.month:02d}-{date_obj.day:02d}\")\n",
|
| 405 |
+
" os.makedirs(new_dir, exist_ok=True)\n",
|
| 406 |
+
" new_file_path = os.path.join(new_dir, f\"batch_{date_obj.strftime('%Y%m%dT%H%M%S')}.json\")\n",
|
| 407 |
+
" with open(new_file_path, 'w', encoding='utf-8') as new_file:\n",
|
| 408 |
+
" json.dump(items, new_file, indent=4)\n",
|
| 409 |
+
" #print(f\"Wrote {len(items)} items to {new_file_path}\")\n"
|
| 410 |
+
]
|
| 411 |
+
},
|
| 412 |
+
{
|
| 413 |
+
"cell_type": "code",
|
| 414 |
+
"execution_count": 8,
|
| 415 |
+
"id": "07cebb8a-2955-4432-a0d1-7ee94975a34b",
|
| 416 |
+
"metadata": {
|
| 417 |
+
"execution": {
|
| 418 |
+
"iopub.execute_input": "2025-02-06T16:52:08.492517Z",
|
| 419 |
+
"iopub.status.busy": "2025-02-06T16:52:08.492328Z",
|
| 420 |
+
"iopub.status.idle": "2025-02-06T16:53:13.648165Z",
|
| 421 |
+
"shell.execute_reply": "2025-02-06T16:53:13.647511Z",
|
| 422 |
+
"shell.execute_reply.started": "2025-02-06T16:52:08.492499Z"
|
| 423 |
+
}
|
| 424 |
+
},
|
| 425 |
+
"outputs": [
|
| 426 |
+
{
|
| 427 |
+
"name": "stdout",
|
| 428 |
+
"output_type": "stream",
|
| 429 |
+
"text": [
|
| 430 |
+
"Starting to organize files from /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/image_metadata/001/2025-02-01_000000_000 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/sorted/image_metadata\n",
|
| 431 |
+
"Checking directory: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/image_metadata/001/2025-02-01_000000_000\n"
|
| 432 |
+
]
|
| 433 |
+
}
|
| 434 |
+
],
|
| 435 |
+
"source": [
|
| 436 |
+
"data_sorted = current_dir.parent / 'data/sorted/image_metadata'\n",
|
| 437 |
+
"\n",
|
| 438 |
+
"origin = data_raw\n",
|
| 439 |
+
"target = data_sorted\n",
|
| 440 |
+
"\n",
|
| 441 |
+
"organize_files( origin, target)"
|
| 442 |
+
]
|
| 443 |
+
},
|
| 444 |
+
{
|
| 445 |
+
"cell_type": "markdown",
|
| 446 |
+
"id": "9255fd24-54ee-4235-bc44-a0246e2fd927",
|
| 447 |
+
"metadata": {},
|
| 448 |
+
"source": [
|
| 449 |
+
"---\n",
|
| 450 |
+
"---"
|
| 451 |
+
]
|
| 452 |
+
},
|
| 453 |
+
{
|
| 454 |
+
"cell_type": "markdown",
|
| 455 |
+
"id": "64639464-6dc4-4997-a805-3f9f41f305c7",
|
| 456 |
+
"metadata": {},
|
| 457 |
+
"source": [
|
| 458 |
+
"### Step 3 CSV consolidation"
|
| 459 |
+
]
|
| 460 |
+
},
|
| 461 |
+
{
|
| 462 |
+
"cell_type": "markdown",
|
| 463 |
+
"id": "add5f97b-562a-4af5-a8f2-635433a8158f",
|
| 464 |
+
"metadata": {
|
| 465 |
+
"execution": {
|
| 466 |
+
"iopub.execute_input": "2025-02-06T12:03:15.512133Z",
|
| 467 |
+
"iopub.status.busy": "2025-02-06T12:03:15.511399Z",
|
| 468 |
+
"iopub.status.idle": "2025-02-06T12:03:15.516297Z",
|
| 469 |
+
"shell.execute_reply": "2025-02-06T12:03:15.515815Z",
|
| 470 |
+
"shell.execute_reply.started": "2025-02-06T12:03:15.512114Z"
|
| 471 |
+
}
|
| 472 |
+
},
|
| 473 |
+
"source": [
|
| 474 |
+
"#### this script walks through the directories in **image-metadata/** and creates the dataset as **.csv** \n"
|
| 475 |
+
]
|
| 476 |
+
},
|
| 477 |
+
{
|
| 478 |
+
"cell_type": "markdown",
|
| 479 |
+
"id": "0f8c83aa-c143-4fac-b890-a8d4249067a0",
|
| 480 |
+
"metadata": {
|
| 481 |
+
"execution": {
|
| 482 |
+
"iopub.execute_input": "2025-02-06T12:02:29.925370Z",
|
| 483 |
+
"iopub.status.busy": "2025-02-06T12:02:29.923935Z",
|
| 484 |
+
"iopub.status.idle": "2025-02-06T12:02:31.421575Z",
|
| 485 |
+
"shell.execute_reply": "2025-02-06T12:02:31.420575Z",
|
| 486 |
+
"shell.execute_reply.started": "2025-02-06T12:02:29.925332Z"
|
| 487 |
+
}
|
| 488 |
+
},
|
| 489 |
+
"source": [
|
| 490 |
+
"\n",
|
| 491 |
+
"### Dataset Columns\n",
|
| 492 |
+
"\n",
|
| 493 |
+
"| **Column Name** | **Description** |\n",
|
| 494 |
+
"|------------------------|-----------------------------------------------------------------------------------------------|\n",
|
| 495 |
+
"| `createdAt` | Timestamp when the item was created. |\n",
|
| 496 |
+
"| `url` | URL associated with the item. |\n",
|
| 497 |
+
"| `positivePrompt` | Positive prompts used in the generation process. |\n",
|
| 498 |
+
"| `negativePrompt` | Negative prompts used in the generation process. |\n",
|
| 499 |
+
"| `nsfw` | Indicates whether the item is NSFW (Not Safe for Work). |\n",
|
| 500 |
+
"| `nsfwLevel` | Level of NSFW content (e.g., Soft, Mature). we only considered SFW!! |\n",
|
| 501 |
+
"| `browsingLevel` | Browsing level required to access the content. |\n",
|
| 502 |
+
"| `statsSummary` | All social reactions summed up: cryCount, likeCount, heartCount, CommentCount |\n",
|
| 503 |
+
"| `commentCount` | Number of comments received. |\n",
|
| 504 |
+
"| `username` | Username of the creator of the item. |\n",
|
| 505 |
+
"| `Model` | Model used to generate the content. |\n",
|
| 506 |
+
"| `Meta` | Simplified meta details, including size, seed, steps, sampler, and version. |\n",
|
| 507 |
+
"| `VAE` | Variational Autoencoder (VAE) used, if any. |\n",
|
| 508 |
+
"| `resourceIDs`| Array of up to six resources (LoRA etc.), including name, type, and weight. |\n",
|
| 509 |
+
"\n"
|
| 510 |
+
]
|
| 511 |
+
},
|
| 512 |
+
{
|
| 513 |
+
"cell_type": "code",
|
| 514 |
+
"execution_count": 5,
|
| 515 |
+
"id": "f8db24f9-6678-4e7d-a3b7-453f33ad0f71",
|
| 516 |
+
"metadata": {
|
| 517 |
+
"execution": {
|
| 518 |
+
"iopub.execute_input": "2025-02-07T20:59:38.328530Z",
|
| 519 |
+
"iopub.status.busy": "2025-02-07T20:59:38.328049Z",
|
| 520 |
+
"iopub.status.idle": "2025-02-07T20:59:38.342970Z",
|
| 521 |
+
"shell.execute_reply": "2025-02-07T20:59:38.342400Z",
|
| 522 |
+
"shell.execute_reply.started": "2025-02-07T20:59:38.328508Z"
|
| 523 |
+
}
|
| 524 |
+
},
|
| 525 |
+
"outputs": [],
|
| 526 |
+
"source": [
|
| 527 |
+
"def find_json_files(directory):\n",
|
| 528 |
+
" \"\"\"Walk through the directory and its subdirectories to find all JSON files.\"\"\"\n",
|
| 529 |
+
" json_files = []\n",
|
| 530 |
+
" for root, _, files in os.walk(directory):\n",
|
| 531 |
+
" for file in files:\n",
|
| 532 |
+
" if file.endswith('.json'):\n",
|
| 533 |
+
" json_files.append(os.path.join(root, file))\n",
|
| 534 |
+
" return json_files\n",
|
| 535 |
+
"\n",
|
| 536 |
+
"def hash_username(username):\n",
|
| 537 |
+
" \"\"\"Convert a username into a unique hash.\"\"\"\n",
|
| 538 |
+
" return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16] # Use first 16 characters for brevity\n",
|
| 539 |
+
"\n",
|
| 540 |
+
"def extract_resource_ids(meta, resources):\n",
|
| 541 |
+
" \"\"\"Extract resource IDs from metadata and resources.\"\"\"\n",
|
| 542 |
+
" resource_ids = set() # Use a set to avoid duplicates\n",
|
| 543 |
+
" hash_ids = set() # Separate set to track hashes specifically\n",
|
| 544 |
+
"\n",
|
| 545 |
+
" # Extract CivitAI model IDs\n",
|
| 546 |
+
" if \"civitaiResources\" in meta:\n",
|
| 547 |
+
" for resource in meta[\"civitaiResources\"]:\n",
|
| 548 |
+
" if \"modelVersionId\" in resource:\n",
|
| 549 |
+
" resource_ids.add(str(resource[\"modelVersionId\"])) # Ensure all IDs are strings\n",
|
| 550 |
+
"\n",
|
| 551 |
+
" # Extract model hashes\n",
|
| 552 |
+
" if \"Model hash\" in meta:\n",
|
| 553 |
+
" hash_ids.add(meta[\"Model hash\"])\n",
|
| 554 |
+
" if \"base_model_hash\" in meta:\n",
|
| 555 |
+
" hash_ids.add(meta[\"base_model_hash\"])\n",
|
| 556 |
+
" if \"models\" in meta and isinstance(meta[\"models\"], list):\n",
|
| 557 |
+
" hash_ids.update(str(model) for model in meta[\"models\"])\n",
|
| 558 |
+
"\n",
|
| 559 |
+
" # Extract identifiers from resources list\n",
|
| 560 |
+
" for resource in resources:\n",
|
| 561 |
+
" if isinstance(resource, dict): # Ensure resource is a dictionary\n",
|
| 562 |
+
" resource_hash = resource.get(\"hash\")\n",
|
| 563 |
+
" if resource_hash:\n",
|
| 564 |
+
" hash_ids.add(str(resource_hash)) # Add to hash set\n",
|
| 565 |
+
" else:\n",
|
| 566 |
+
" name = resource.get(\"name\")\n",
|
| 567 |
+
" if name:\n",
|
| 568 |
+
" resource_ids.add(str(name))\n",
|
| 569 |
+
"\n",
|
| 570 |
+
" # Remove any named IDs if their hash exists\n",
|
| 571 |
+
" resource_ids = {rid for rid in resource_ids if rid not in hash_ids}\n",
|
| 572 |
+
"\n",
|
| 573 |
+
" # Combine hashes and remaining resource IDs\n",
|
| 574 |
+
" final_ids = list(hash_ids) + list(resource_ids)\n",
|
| 575 |
+
" return final_ids\n",
|
| 576 |
+
"\n",
|
| 577 |
+
"def format_resource_ids(resource_ids):\n",
|
| 578 |
+
" \"\"\"Format resource IDs in a standardized list format.\"\"\"\n",
|
| 579 |
+
" formatted_ids = [str(rid).strip() for rid in resource_ids if rid]\n",
|
| 580 |
+
" return f\"[{', '.join(formatted_ids)}]\"\n",
|
| 581 |
+
"\n",
|
| 582 |
+
"def truncate_prompt(prompt, max_tokens=77):\n",
|
| 583 |
+
" \"\"\"Truncate a prompt to a maximum number of tokens.\"\"\"\n",
|
| 584 |
+
" return ' '.join(prompt.split()[:max_tokens]) if prompt else \"\"\n",
|
| 585 |
+
"\n",
|
| 586 |
+
"def write_to_csv(json_files, output_csv):\n",
|
| 587 |
+
" \"\"\"Read JSON files, extract data, and write to a CSV file.\"\"\"\n",
|
| 588 |
+
" with open(output_csv, 'w', newline='', encoding='utf-8') as csvfile:\n",
|
| 589 |
+
" fieldnames = [\n",
|
| 590 |
+
" 'id', 'createdAt', 'url', 'positivePrompt', 'negativePrompt', 'nsfw',\n",
|
| 591 |
+
" 'browsingLevel', 'statsSummary', 'usernameHash', 'Model', 'cfgScale',\n",
|
| 592 |
+
" 'sampler', 'Size', 'seed', 'VAE', 'generationSystem', 'resourceIDs'\n",
|
| 593 |
+
" ]\n",
|
| 594 |
+
" writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n",
|
| 595 |
+
" writer.writeheader()\n",
|
| 596 |
+
"\n",
|
| 597 |
+
" for json_file in json_files:\n",
|
| 598 |
+
" with open(json_file, 'r', encoding='utf-8') as file:\n",
|
| 599 |
+
" try:\n",
|
| 600 |
+
" data = json.load(file)\n",
|
| 601 |
+
" for item in data:\n",
|
| 602 |
+
" meta = item.get('meta', {}) or {}\n",
|
| 603 |
+
" stats = item.get('stats', {}) or {}\n",
|
| 604 |
+
" resources = meta.get('resources', []) if isinstance(meta, dict) else []\n",
|
| 605 |
+
"\n",
|
| 606 |
+
" # Anonymize username\n",
|
| 607 |
+
" username = item.get('username', '')\n",
|
| 608 |
+
" username_hash = hash_username(username) if username else ''\n",
|
| 609 |
+
"\n",
|
| 610 |
+
" # Extract resource IDs\n",
|
| 611 |
+
" resource_ids = extract_resource_ids(meta, resources)\n",
|
| 612 |
+
" formatted_resource_ids = format_resource_ids(resource_ids)\n",
|
| 613 |
+
"\n",
|
| 614 |
+
" # Summarize stats counts\n",
|
| 615 |
+
" stats_summary = sum(stats.get(key, 0) for key in stats)\n",
|
| 616 |
+
"\n",
|
| 617 |
+
" # Truncate prompts\n",
|
| 618 |
+
" positive_prompt = truncate_prompt(meta.get('prompt', ''))\n",
|
| 619 |
+
" negative_prompt = truncate_prompt(meta.get('negativePrompt', ''))\n",
|
| 620 |
+
"\n",
|
| 621 |
+
" # Extract metadata fields\n",
|
| 622 |
+
" cfg_scale = meta.get('cfgScale', 'N/A')\n",
|
| 623 |
+
" sampler = meta.get('sampler', 'N/A')\n",
|
| 624 |
+
" size = meta.get('Size', 'N/A')\n",
|
| 625 |
+
" seed = meta.get('seed', 'N/A')\n",
|
| 626 |
+
" vae = meta.get('VAE', 'N/A')\n",
|
| 627 |
+
" generation_system = \"CivitAI\" if \"civitaiResources\" in meta else \"Undetermined\"\n",
|
| 628 |
+
"\n",
|
| 629 |
+
" # Process a single row\n",
|
| 630 |
+
" row = {\n",
|
| 631 |
+
" 'id': item.get('id', ''),\n",
|
| 632 |
+
" 'createdAt': item.get('createdAt', ''),\n",
|
| 633 |
+
" 'url': item.get('url', ''),\n",
|
| 634 |
+
" 'positivePrompt': positive_prompt,\n",
|
| 635 |
+
" 'negativePrompt': negative_prompt,\n",
|
| 636 |
+
" 'nsfw': item.get('nsfw', False),\n",
|
| 637 |
+
" 'browsingLevel': item.get('browsingLevel', 'N/A'),\n",
|
| 638 |
+
" 'statsSummary': stats_summary,\n",
|
| 639 |
+
" 'usernameHash': username_hash,\n",
|
| 640 |
+
" 'Model': meta.get('Model', ''),\n",
|
| 641 |
+
" 'cfgScale': cfg_scale,\n",
|
| 642 |
+
" 'sampler': sampler,\n",
|
| 643 |
+
" 'Size': size,\n",
|
| 644 |
+
" 'seed': seed,\n",
|
| 645 |
+
" 'VAE': vae,\n",
|
| 646 |
+
" 'generationSystem': generation_system,\n",
|
| 647 |
+
" 'resourceIDs': formatted_resource_ids\n",
|
| 648 |
+
" }\n",
|
| 649 |
+
"\n",
|
| 650 |
+
" writer.writerow(row)\n",
|
| 651 |
+
" except (json.JSONDecodeError, KeyError, TypeError) as e:\n",
|
| 652 |
+
" print(f\"Error processing file {json_file}: {e}\")\n"
|
| 653 |
+
]
|
| 654 |
+
},
|
| 655 |
+
{
|
| 656 |
+
"cell_type": "markdown",
|
| 657 |
+
"id": "1c51189e-5eb2-40ef-a5dd-807cd9cded5f",
|
| 658 |
+
"metadata": {
|
| 659 |
+
"execution": {
|
| 660 |
+
"iopub.execute_input": "2025-02-06T12:49:11.129158Z",
|
| 661 |
+
"iopub.status.busy": "2025-02-06T12:49:11.128195Z",
|
| 662 |
+
"iopub.status.idle": "2025-02-06T12:49:11.132440Z",
|
| 663 |
+
"shell.execute_reply": "2025-02-06T12:49:11.131779Z",
|
| 664 |
+
"shell.execute_reply.started": "2025-02-06T12:49:11.129132Z"
|
| 665 |
+
}
|
| 666 |
+
},
|
| 667 |
+
"source": [
|
| 668 |
+
"### Save as 'Civiverse' dataset in data/CSV"
|
| 669 |
+
]
|
| 670 |
+
},
|
| 671 |
+
{
|
| 672 |
+
"cell_type": "code",
|
| 673 |
+
"execution_count": 6,
|
| 674 |
+
"id": "07c97ff9-7ce0-43b7-a156-f5fb79c49cd9",
|
| 675 |
+
"metadata": {
|
| 676 |
+
"execution": {
|
| 677 |
+
"iopub.execute_input": "2025-02-07T20:59:40.218975Z",
|
| 678 |
+
"iopub.status.busy": "2025-02-07T20:59:40.218224Z",
|
| 679 |
+
"iopub.status.idle": "2025-02-07T21:04:11.221872Z",
|
| 680 |
+
"shell.execute_reply": "2025-02-07T21:04:11.220974Z",
|
| 681 |
+
"shell.execute_reply.started": "2025-02-07T20:59:40.218926Z"
|
| 682 |
+
}
|
| 683 |
+
},
|
| 684 |
+
"outputs": [
|
| 685 |
+
{
|
| 686 |
+
"name": "stdout",
|
| 687 |
+
"output_type": "stream",
|
| 688 |
+
"text": [
|
| 689 |
+
"Error processing file /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/sorted/image_metadata/2024/2024-07/2024-07-29/batch_20240729T192800.json: Expecting value: line 1 column 1 (char 0)\n"
|
| 690 |
+
]
|
| 691 |
+
}
|
| 692 |
+
],
|
| 693 |
+
"source": [
|
| 694 |
+
"from pathlib import Path\n",
|
| 695 |
+
"\n",
|
| 696 |
+
"raw = current_dir.parent / 'data/sorted/image_metadata/'\n",
|
| 697 |
+
"csv_output = current_dir.parent / 'data/CSV/Civiverse.csv'\n",
|
| 698 |
+
"\n",
|
| 699 |
+
"csv_output.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 700 |
+
"\n",
|
| 701 |
+
"json_files = find_json_files(raw)\n",
|
| 702 |
+
"write_to_csv(json_files, csv_output)"
|
| 703 |
+
]
|
| 704 |
+
},
|
| 705 |
+
{
|
| 706 |
+
"cell_type": "markdown",
|
| 707 |
+
"id": "b3e64b07-98d1-480d-b4e1-f624a17db848",
|
| 708 |
+
"metadata": {
|
| 709 |
+
"execution": {
|
| 710 |
+
"iopub.execute_input": "2025-02-06T13:50:08.120671Z",
|
| 711 |
+
"iopub.status.busy": "2025-02-06T13:50:08.119949Z",
|
| 712 |
+
"iopub.status.idle": "2025-02-06T13:50:08.124761Z",
|
| 713 |
+
"shell.execute_reply": "2025-02-06T13:50:08.124298Z",
|
| 714 |
+
"shell.execute_reply.started": "2025-02-06T13:50:08.120632Z"
|
| 715 |
+
}
|
| 716 |
+
},
|
| 717 |
+
"source": [
|
| 718 |
+
"### Optional step 4: Create sampled subset"
|
| 719 |
+
]
|
| 720 |
+
},
|
| 721 |
+
{
|
| 722 |
+
"cell_type": "code",
|
| 723 |
+
"execution_count": 11,
|
| 724 |
+
"id": "b531b8cd-4653-451d-a6d5-386a2721827c",
|
| 725 |
+
"metadata": {
|
| 726 |
+
"execution": {
|
| 727 |
+
"iopub.execute_input": "2025-02-06T16:55:01.916121Z",
|
| 728 |
+
"iopub.status.busy": "2025-02-06T16:55:01.915489Z",
|
| 729 |
+
"iopub.status.idle": "2025-02-06T16:55:01.920043Z",
|
| 730 |
+
"shell.execute_reply": "2025-02-06T16:55:01.919655Z",
|
| 731 |
+
"shell.execute_reply.started": "2025-02-06T16:55:01.916103Z"
|
| 732 |
+
}
|
| 733 |
+
},
|
| 734 |
+
"outputs": [],
|
| 735 |
+
"source": [
|
| 736 |
+
"input_file = current_dir.parent / 'data/CSV/Civiverse.csv'\n",
|
| 737 |
+
"output_file = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
|
| 738 |
+
"output_file.parent.mkdir(parents=True, exist_ok=True)"
|
| 739 |
+
]
|
| 740 |
+
},
|
| 741 |
+
{
|
| 742 |
+
"cell_type": "code",
|
| 743 |
+
"execution_count": 12,
|
| 744 |
+
"id": "b82db8b3-310c-49e4-9d7a-aa34662f9556",
|
| 745 |
+
"metadata": {
|
| 746 |
+
"execution": {
|
| 747 |
+
"iopub.execute_input": "2025-02-06T16:55:01.920861Z",
|
| 748 |
+
"iopub.status.busy": "2025-02-06T16:55:01.920586Z",
|
| 749 |
+
"iopub.status.idle": "2025-02-06T16:55:06.790164Z",
|
| 750 |
+
"shell.execute_reply": "2025-02-06T16:55:06.789503Z",
|
| 751 |
+
"shell.execute_reply.started": "2025-02-06T16:55:01.920844Z"
|
| 752 |
+
}
|
| 753 |
+
},
|
| 754 |
+
"outputs": [
|
| 755 |
+
{
|
| 756 |
+
"name": "stderr",
|
| 757 |
+
"output_type": "stream",
|
| 758 |
+
"text": [
|
| 759 |
+
"/sctmp/lauwag/ipykernel_1043676/3764041865.py:4: DtypeWarning: Columns (13) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 760 |
+
" df = pd.read_csv(input_file)\n",
|
| 761 |
+
"/sctmp/lauwag/ipykernel_1043676/3764041865.py:10: SettingWithCopyWarning: \n",
|
| 762 |
+
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
|
| 763 |
+
"Try using .loc[row_indexer,col_indexer] = value instead\n",
|
| 764 |
+
"\n",
|
| 765 |
+
"See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
|
| 766 |
+
" df_sample['createdAt'] = pd.to_datetime(df_sample['createdAt'], errors='coerce')\n"
|
| 767 |
+
]
|
| 768 |
+
}
|
| 769 |
+
],
|
| 770 |
+
"source": [
|
| 771 |
+
"\n",
|
| 772 |
+
"# Define the sampling rate\n",
|
| 773 |
+
"n = 100\n",
|
| 774 |
+
"# Read the CSV file\n",
|
| 775 |
+
"df = pd.read_csv(input_file)\n",
|
| 776 |
+
"\n",
|
| 777 |
+
"# Sample every 15th row\n",
|
| 778 |
+
"df_sample = df.iloc[::n, :]\n",
|
| 779 |
+
"\n",
|
| 780 |
+
"# Convert the 'createdAt' column to datetime, inferring mixed formats\n",
|
| 781 |
+
"df_sample['createdAt'] = pd.to_datetime(df_sample['createdAt'], errors='coerce')\n",
|
| 782 |
+
"\n",
|
| 783 |
+
"# Sort the sampled data by the 'createdAt' column\n",
|
| 784 |
+
"df_sample_sorted = df_sample.sort_values(by='createdAt')\n",
|
| 785 |
+
"\n",
|
| 786 |
+
"# Save the sorted sample DataFrame to a new CSV file\n",
|
| 787 |
+
"df_sample_sorted.to_csv(output_file, index=False)"
|
| 788 |
+
]
|
| 789 |
+
},
|
| 790 |
+
{
|
| 791 |
+
"cell_type": "markdown",
|
| 792 |
+
"id": "d915b7d6-937a-4fa9-9edc-ded67e3606f5",
|
| 793 |
+
"metadata": {},
|
| 794 |
+
"source": [
|
| 795 |
+
"---\n",
|
| 796 |
+
"---\n",
|
| 797 |
+
"---"
|
| 798 |
+
]
|
| 799 |
+
},
|
| 800 |
+
{
|
| 801 |
+
"cell_type": "markdown",
|
| 802 |
+
"id": "11647bb7-5ce9-414a-8486-5bdce8d9cfea",
|
| 803 |
+
"metadata": {
|
| 804 |
+
"execution": {
|
| 805 |
+
"iopub.execute_input": "2025-02-06T12:49:43.126762Z",
|
| 806 |
+
"iopub.status.busy": "2025-02-06T12:49:43.125797Z",
|
| 807 |
+
"iopub.status.idle": "2025-02-06T12:49:43.129759Z",
|
| 808 |
+
"shell.execute_reply": "2025-02-06T12:49:43.129176Z",
|
| 809 |
+
"shell.execute_reply.started": "2025-02-06T12:49:43.126736Z"
|
| 810 |
+
}
|
| 811 |
+
},
|
| 812 |
+
"source": [
|
| 813 |
+
"## MODELS"
|
| 814 |
+
]
|
| 815 |
+
},
|
| 816 |
+
{
|
| 817 |
+
"cell_type": "markdown",
|
| 818 |
+
"id": "5c9cb5e3-7cea-4574-9319-f3cd89354b1f",
|
| 819 |
+
"metadata": {
|
| 820 |
+
"execution": {
|
| 821 |
+
"iopub.execute_input": "2025-02-06T13:23:38.017078Z",
|
| 822 |
+
"iopub.status.busy": "2025-02-06T13:23:38.016639Z",
|
| 823 |
+
"iopub.status.idle": "2025-02-06T13:23:38.019993Z",
|
| 824 |
+
"shell.execute_reply": "2025-02-06T13:23:38.019549Z",
|
| 825 |
+
"shell.execute_reply.started": "2025-02-06T13:23:38.017053Z"
|
| 826 |
+
}
|
| 827 |
+
},
|
| 828 |
+
"source": [
|
| 829 |
+
"### Step 1: Scrape model metadata"
|
| 830 |
+
]
|
| 831 |
+
},
|
| 832 |
+
{
|
| 833 |
+
"cell_type": "markdown",
|
| 834 |
+
"id": "83e12b9f-5ae1-407d-9754-5979d837f787",
|
| 835 |
+
"metadata": {
|
| 836 |
+
"execution": {
|
| 837 |
+
"iopub.execute_input": "2025-02-06T14:06:04.857874Z",
|
| 838 |
+
"iopub.status.busy": "2025-02-06T14:06:04.857500Z",
|
| 839 |
+
"iopub.status.idle": "2025-02-06T14:06:04.860438Z",
|
| 840 |
+
"shell.execute_reply": "2025-02-06T14:06:04.860030Z",
|
| 841 |
+
"shell.execute_reply.started": "2025-02-06T14:06:04.857856Z"
|
| 842 |
+
}
|
| 843 |
+
},
|
| 844 |
+
"source": [
|
| 845 |
+
"#### the resulting files will appear in data/raw/model_metadata as *.json"
|
| 846 |
+
]
|
| 847 |
+
},
|
| 848 |
+
{
|
| 849 |
+
"cell_type": "code",
|
| 850 |
+
"execution_count": 5,
|
| 851 |
+
"id": "41ee14e0-fb78-4f91-aba1-13faf05af7d8",
|
| 852 |
+
"metadata": {
|
| 853 |
+
"execution": {
|
| 854 |
+
"iopub.execute_input": "2025-02-08T19:37:56.687832Z",
|
| 855 |
+
"iopub.status.busy": "2025-02-08T19:37:56.687251Z",
|
| 856 |
+
"iopub.status.idle": "2025-02-08T19:37:56.696572Z",
|
| 857 |
+
"shell.execute_reply": "2025-02-08T19:37:56.696059Z",
|
| 858 |
+
"shell.execute_reply.started": "2025-02-08T19:37:56.687809Z"
|
| 859 |
+
}
|
| 860 |
+
},
|
| 861 |
+
"outputs": [],
|
| 862 |
+
"source": [
|
| 863 |
+
"import datetime\n",
|
| 864 |
+
"\n",
|
| 865 |
+
"def load_api_keys():\n",
|
| 866 |
+
" \"\"\"Load API keys from a text file, one per line.\"\"\"\n",
|
| 867 |
+
" if not os.path.exists(key_karussell):\n",
|
| 868 |
+
" raise FileNotFoundError(f\"API key file '{API_KEYS_FILE}' not found!\")\n",
|
| 869 |
+
" \n",
|
| 870 |
+
" with open(key_karussell, 'r') as file:\n",
|
| 871 |
+
" keys = [line.strip() for line in file if line.strip()]\n",
|
| 872 |
+
" \n",
|
| 873 |
+
" if not keys:\n",
|
| 874 |
+
" raise ValueError(\"No API keys found in the file!\")\n",
|
| 875 |
+
" \n",
|
| 876 |
+
" return keys\n",
|
| 877 |
+
"\n",
|
| 878 |
+
"def get_model_metadata():\n",
|
| 879 |
+
" base_url = \"https://civitai.com/api/v1/models\"\n",
|
| 880 |
+
" params = {\"sort\": \"Newest\", \"nsfw\": True}\n",
|
| 881 |
+
"\n",
|
| 882 |
+
" # Load API keys\n",
|
| 883 |
+
" api_keys = load_api_keys()\n",
|
| 884 |
+
" key_index = 0 # Start with the first key\n",
|
| 885 |
+
"\n",
|
| 886 |
+
" page_counter = 0\n",
|
| 887 |
+
" max_pages = 300000000 # Adjust as needed\n",
|
| 888 |
+
" os.makedirs(directory_path, exist_ok=True)\n",
|
| 889 |
+
"\n",
|
| 890 |
+
" while True:\n",
|
| 891 |
+
" if page_counter >= max_pages:\n",
|
| 892 |
+
" print(f\"Reached the limit of {max_pages} pages.\")\n",
|
| 893 |
+
" break\n",
|
| 894 |
+
"\n",
|
| 895 |
+
" headers = {\n",
|
| 896 |
+
" \"Accept\": \"application/json\",\n",
|
| 897 |
+
" \"Authorization\": f\"Bearer {api_keys[key_index]}\"\n",
|
| 898 |
+
" }\n",
|
| 899 |
+
"\n",
|
| 900 |
+
" response = requests.get(base_url, headers=headers, params=params)\n",
|
| 901 |
+
"\n",
|
| 902 |
+
" if response.status_code == 200:\n",
|
| 903 |
+
" data = response.json()\n",
|
| 904 |
+
" page_counter += 1\n",
|
| 905 |
+
"\n",
|
| 906 |
+
" # Add timestamp\n",
|
| 907 |
+
" formatted_timestamp = datetime.datetime.now().strftime(\"data obtained on the %d.%m.%Y at %H:%M CEST\")\n",
|
| 908 |
+
"\n",
|
| 909 |
+
" data['timestamp'] = formatted_timestamp\n",
|
| 910 |
+
"\n",
|
| 911 |
+
" # Save data to file\n",
|
| 912 |
+
" file_path = os.path.join(directory_path, f'newest_models_{page_counter}.json')\n",
|
| 913 |
+
" with open(file_path, 'w', encoding='utf-8') as file:\n",
|
| 914 |
+
" json.dump(data, file, indent=4)\n",
|
| 915 |
+
"\n",
|
| 916 |
+
" # Check for nextCursor\n",
|
| 917 |
+
" next_cursor = data.get('metadata', {}).get('nextCursor')\n",
|
| 918 |
+
" if not next_cursor:\n",
|
| 919 |
+
" print(\"No more data available.\")\n",
|
| 920 |
+
" break\n",
|
| 921 |
+
" else:\n",
|
| 922 |
+
" params['cursor'] = next_cursor\n",
|
| 923 |
+
" \n",
|
| 924 |
+
" elif response.status_code in (401, 403): # Unauthorized or Forbidden\n",
|
| 925 |
+
" print(f\"API Key {key_index + 1} failed with status {response.status_code}. Trying next key...\")\n",
|
| 926 |
+
" key_index += 1\n",
|
| 927 |
+
"\n",
|
| 928 |
+
" if key_index >= len(api_keys):\n",
|
| 929 |
+
" print(\"All API keys failed. Exiting.\")\n",
|
| 930 |
+
" break # Stop if all keys fail\n",
|
| 931 |
+
" \n",
|
| 932 |
+
" else:\n",
|
| 933 |
+
" print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
|
| 934 |
+
" break # Stop on other errors\n"
|
| 935 |
+
]
|
| 936 |
+
},
|
| 937 |
+
{
|
| 938 |
+
"cell_type": "code",
|
| 939 |
+
"execution_count": 6,
|
| 940 |
+
"id": "0b574bc0-8bc5-48fd-9ced-8723d07e0d0e",
|
| 941 |
+
"metadata": {
|
| 942 |
+
"execution": {
|
| 943 |
+
"iopub.execute_input": "2025-02-08T19:37:57.395918Z",
|
| 944 |
+
"iopub.status.busy": "2025-02-08T19:37:57.395733Z",
|
| 945 |
+
"iopub.status.idle": "2025-02-08T19:37:57.399679Z",
|
| 946 |
+
"shell.execute_reply": "2025-02-08T19:37:57.399183Z",
|
| 947 |
+
"shell.execute_reply.started": "2025-02-08T19:37:57.395902Z"
|
| 948 |
+
}
|
| 949 |
+
},
|
| 950 |
+
"outputs": [],
|
| 951 |
+
"source": [
|
| 952 |
+
"key_karussell = current_dir.parent / 'misc/api_keys.txt'\n",
|
| 953 |
+
"directory_path = current_dir.parent / 'data/raw/model_metadata/'"
|
| 954 |
+
]
|
| 955 |
+
},
|
| 956 |
+
{
|
| 957 |
+
"cell_type": "markdown",
|
| 958 |
+
"id": "8f43ce23-5986-4b67-be2d-453831af9a6e",
|
| 959 |
+
"metadata": {},
|
| 960 |
+
"source": [
|
| 961 |
+
"uncomment this to get model metadata"
|
| 962 |
+
]
|
| 963 |
+
},
|
| 964 |
+
{
|
| 965 |
+
"cell_type": "code",
|
| 966 |
+
"execution_count": 7,
|
| 967 |
+
"id": "3f95b4ba-5742-4268-b2e8-de9145faf495",
|
| 968 |
+
"metadata": {
|
| 969 |
+
"execution": {
|
| 970 |
+
"iopub.execute_input": "2025-02-08T19:37:58.117386Z",
|
| 971 |
+
"iopub.status.busy": "2025-02-08T19:37:58.117158Z",
|
| 972 |
+
"iopub.status.idle": "2025-02-08T19:37:58.121162Z",
|
| 973 |
+
"shell.execute_reply": "2025-02-08T19:37:58.120580Z",
|
| 974 |
+
"shell.execute_reply.started": "2025-02-08T19:37:58.117369Z"
|
| 975 |
+
}
|
| 976 |
+
},
|
| 977 |
+
"outputs": [],
|
| 978 |
+
"source": [
|
| 979 |
+
"#get_model_metadata()"
|
| 980 |
+
]
|
| 981 |
+
},
|
| 982 |
+
{
|
| 983 |
+
"cell_type": "markdown",
|
| 984 |
+
"id": "1ed5f44e-612e-4b6d-8974-904fb3e058d6",
|
| 985 |
+
"metadata": {
|
| 986 |
+
"execution": {
|
| 987 |
+
"iopub.execute_input": "2025-02-06T12:52:40.162173Z",
|
| 988 |
+
"iopub.status.busy": "2025-02-06T12:52:40.159989Z",
|
| 989 |
+
"iopub.status.idle": "2025-02-06T12:52:40.170634Z",
|
| 990 |
+
"shell.execute_reply": "2025-02-06T12:52:40.169945Z",
|
| 991 |
+
"shell.execute_reply.started": "2025-02-06T12:52:40.162124Z"
|
| 992 |
+
}
|
| 993 |
+
},
|
| 994 |
+
"source": [
|
| 995 |
+
"### Step 2 Consolidate Model-dataset CSV"
|
| 996 |
+
]
|
| 997 |
+
},
|
| 998 |
+
{
|
| 999 |
+
"cell_type": "code",
|
| 1000 |
+
"execution_count": 8,
|
| 1001 |
+
"id": "464d82f5-c24e-4b53-9b50-682fa4bf3430",
|
| 1002 |
+
"metadata": {
|
| 1003 |
+
"execution": {
|
| 1004 |
+
"iopub.execute_input": "2025-02-08T19:37:59.052245Z",
|
| 1005 |
+
"iopub.status.busy": "2025-02-08T19:37:59.051645Z",
|
| 1006 |
+
"iopub.status.idle": "2025-02-08T19:37:59.056017Z",
|
| 1007 |
+
"shell.execute_reply": "2025-02-08T19:37:59.055598Z",
|
| 1008 |
+
"shell.execute_reply.started": "2025-02-08T19:37:59.052224Z"
|
| 1009 |
+
}
|
| 1010 |
+
},
|
| 1011 |
+
"outputs": [],
|
| 1012 |
+
"source": [
|
| 1013 |
+
"## path thingy\n",
|
| 1014 |
+
"try: #scripts\n",
|
| 1015 |
+
" current_dir = Path(__file__).resolve().parent\n",
|
| 1016 |
+
"except NameError:\n",
|
| 1017 |
+
" # jupyter\n",
|
| 1018 |
+
" current_dir = Path.cwd()"
|
| 1019 |
+
]
|
| 1020 |
+
},
|
| 1021 |
+
{
|
| 1022 |
+
"cell_type": "code",
|
| 1023 |
+
"execution_count": 9,
|
| 1024 |
+
"id": "14f3db4d-ef93-4629-8a27-971adb49248b",
|
| 1025 |
+
"metadata": {
|
| 1026 |
+
"execution": {
|
| 1027 |
+
"iopub.execute_input": "2025-02-08T19:37:59.443457Z",
|
| 1028 |
+
"iopub.status.busy": "2025-02-08T19:37:59.443236Z",
|
| 1029 |
+
"iopub.status.idle": "2025-02-08T19:37:59.890103Z",
|
| 1030 |
+
"shell.execute_reply": "2025-02-08T19:37:59.889571Z",
|
| 1031 |
+
"shell.execute_reply.started": "2025-02-08T19:37:59.443439Z"
|
| 1032 |
+
}
|
| 1033 |
+
},
|
| 1034 |
+
"outputs": [
|
| 1035 |
+
{
|
| 1036 |
+
"name": "stdout",
|
| 1037 |
+
"output_type": "stream",
|
| 1038 |
+
"text": [
|
| 1039 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_2.json\n",
|
| 1040 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_1.json\n",
|
| 1041 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_4.json\n",
|
| 1042 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_6.json\n",
|
| 1043 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_7.json\n",
|
| 1044 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_5.json\n",
|
| 1045 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_3.json\n"
|
| 1046 |
+
]
|
| 1047 |
+
}
|
| 1048 |
+
],
|
| 1049 |
+
"source": [
|
| 1050 |
+
"import os\n",
|
| 1051 |
+
"import json\n",
|
| 1052 |
+
"import pandas as pd\n",
|
| 1053 |
+
"import hashlib\n",
|
| 1054 |
+
"\n",
|
| 1055 |
+
"\n",
|
| 1056 |
+
"json_path = current_dir.parent / 'data/raw/model_metadata/'\n",
|
| 1057 |
+
"\n",
|
| 1058 |
+
"# Initialize a list to hold all the processed records\n",
|
| 1059 |
+
"data_records = []\n",
|
| 1060 |
+
"\n",
|
| 1061 |
+
"def hash_username(username):\n",
|
| 1062 |
+
" \"\"\"Convert a username into a unique hash.\"\"\"\n",
|
| 1063 |
+
" return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16] # Use first 16 characters for brevity\n",
|
| 1064 |
+
"\n",
|
| 1065 |
+
"# Loop through each file in the directory\n",
|
| 1066 |
+
"for filename in os.listdir(json_path):\n",
|
| 1067 |
+
" # Construct the full path to the JSON file\n",
|
| 1068 |
+
" file_path = os.path.join(json_path, filename)\n",
|
| 1069 |
+
"\n",
|
| 1070 |
+
" # Only process files with .json extension\n",
|
| 1071 |
+
" if filename.endswith('.json') and os.path.isfile(file_path):\n",
|
| 1072 |
+
" print(f\"Processing file: {file_path}\")\n",
|
| 1073 |
+
"\n",
|
| 1074 |
+
" # Open and load the JSON file\n",
|
| 1075 |
+
" with open(file_path, 'r') as file:\n",
|
| 1076 |
+
" try:\n",
|
| 1077 |
+
" data = json.load(file)\n",
|
| 1078 |
+
" except json.JSONDecodeError as e:\n",
|
| 1079 |
+
" print(f\"Error decoding JSON in file {filename}: {e}\")\n",
|
| 1080 |
+
" continue\n",
|
| 1081 |
+
"\n",
|
| 1082 |
+
" # Determine how the items are structured within the dictionary\n",
|
| 1083 |
+
" if 'items' in data:\n",
|
| 1084 |
+
" items = data['items']\n",
|
| 1085 |
+
" elif 'data' in data:\n",
|
| 1086 |
+
" items = data['data']\n",
|
| 1087 |
+
" else:\n",
|
| 1088 |
+
" print(f\"No known item list key found in {filename}\")\n",
|
| 1089 |
+
" continue\n",
|
| 1090 |
+
"\n",
|
| 1091 |
+
" for item in items:\n",
|
| 1092 |
+
" if not isinstance(item, dict):\n",
|
| 1093 |
+
" print(f\"Expected a dictionary for each item, but found: {type(item)} in {filename}\")\n",
|
| 1094 |
+
" continue\n",
|
| 1095 |
+
"\n",
|
| 1096 |
+
" # Extract required information\n",
|
| 1097 |
+
" model_versions = item.get('modelVersions', [])\n",
|
| 1098 |
+
" download_url = model_versions[0]['files'][0]['downloadUrl'] if model_versions and 'files' in model_versions[0] and model_versions[0]['files'] else ''\n",
|
| 1099 |
+
" auto_hashes = model_versions[0]['files'][0]['hashes'] if model_versions and 'files' in model_versions[0] and model_versions[0]['files'] else {}\n",
|
| 1100 |
+
"\n",
|
| 1101 |
+
" # Collect up to 20 modelVersion IDs\n",
|
| 1102 |
+
" model_version_ids = [mv.get('id', '') for mv in model_versions[:20]]\n",
|
| 1103 |
+
"\n",
|
| 1104 |
+
" # Get preview images: first and latest\n",
|
| 1105 |
+
" first_image_url = model_versions[0]['images'][0]['url'] if model_versions and 'images' in model_versions[0] and model_versions[0]['images'] else ''\n",
|
| 1106 |
+
" latest_index = len(model_versions) - 1\n",
|
| 1107 |
+
" latest_image_url = model_versions[latest_index]['images'][0]['url'] if model_versions and 'images' in model_versions[latest_index] and model_versions[latest_index]['images'] else ''\n",
|
| 1108 |
+
"\n",
|
| 1109 |
+
" # Hash the username\n",
|
| 1110 |
+
" username = item.get('creator', {}).get('username', '')\n",
|
| 1111 |
+
" username_hash = hash_username(username) if username else ''\n",
|
| 1112 |
+
"\n",
|
| 1113 |
+
" # Extract additional fields\n",
|
| 1114 |
+
" record = {\n",
|
| 1115 |
+
" 'id': item.get('id', ''),\n",
|
| 1116 |
+
" 'name': item.get('name', ''),\n",
|
| 1117 |
+
" 'type': item.get('type', ''),\n",
|
| 1118 |
+
" 'baseModel': model_versions[0].get('baseModel', '') if model_versions else '',\n",
|
| 1119 |
+
" 'downloadCount': item.get('stats', {}).get('downloadCount', 0),\n",
|
| 1120 |
+
" 'nsfwLevel': item.get('nsfwLevel', 0),\n",
|
| 1121 |
+
" 'modelVersions': len(model_versions),\n",
|
| 1122 |
+
" 'publishedAt': model_versions[0].get('publishedAt', '') if model_versions else '',\n",
|
| 1123 |
+
" 'usernameHash': username_hash,\n",
|
| 1124 |
+
" 'downloadUrl': download_url,\n",
|
| 1125 |
+
" 'firstImageUrl': first_image_url,\n",
|
| 1126 |
+
" 'latestImageUrl': latest_image_url,\n",
|
| 1127 |
+
" 'poi': item.get('poi', False),\n",
|
| 1128 |
+
" 'AutoV1': auto_hashes.get('AutoV1', ''),\n",
|
| 1129 |
+
" 'AutoV2': auto_hashes.get('AutoV2', ''),\n",
|
| 1130 |
+
" 'AutoV3': auto_hashes.get('AutoV3', ''),\n",
|
| 1131 |
+
" 'SHA256': auto_hashes.get('SHA256', ''),\n",
|
| 1132 |
+
" 'CRC32': auto_hashes.get('CRC32', ''),\n",
|
| 1133 |
+
" 'BLAKE3': auto_hashes.get('BLAKE3', ''),\n",
|
| 1134 |
+
" 'previewImage': latest_image_url\n",
|
| 1135 |
+
" }\n",
|
| 1136 |
+
"\n",
|
| 1137 |
+
" # Add version IDs to the record\n",
|
| 1138 |
+
" for i in range(20): # Collect up to 20 version IDs\n",
|
| 1139 |
+
" version_key = f'version_id_{i+1}'\n",
|
| 1140 |
+
" record[version_key] = model_version_ids[i] if len(model_version_ids) > i else ''\n",
|
| 1141 |
+
"\n",
|
| 1142 |
+
" # Add tags\n",
|
| 1143 |
+
" tags = item.get('tags', [])\n",
|
| 1144 |
+
" for i in range(7): # To retrieve up to tag 6\n",
|
| 1145 |
+
" tag_key = f'tag_{i+1}'\n",
|
| 1146 |
+
" record[tag_key] = tags[i] if len(tags) > i else ''\n",
|
| 1147 |
+
"\n",
|
| 1148 |
+
" data_records.append(record)\n",
|
| 1149 |
+
"\n",
|
| 1150 |
+
"# Create a DataFrame and sort by download count\n",
|
| 1151 |
+
"df = pd.DataFrame(data_records)\n",
|
| 1152 |
+
"df_sorted = df.sort_values(by='downloadCount', ascending=False)"
|
| 1153 |
+
]
|
| 1154 |
+
},
|
| 1155 |
+
{
|
| 1156 |
+
"cell_type": "markdown",
|
| 1157 |
+
"id": "2d6f822f-754f-4feb-9ed6-6f868421c454",
|
| 1158 |
+
"metadata": {},
|
| 1159 |
+
"source": [
|
| 1160 |
+
"### Save model-data to CSV"
|
| 1161 |
+
]
|
| 1162 |
+
},
|
| 1163 |
+
{
|
| 1164 |
+
"cell_type": "code",
|
| 1165 |
+
"execution_count": 10,
|
| 1166 |
+
"id": "f861cf7b-5eb3-46ad-ab2d-9e2fdf2169b4",
|
| 1167 |
+
"metadata": {
|
| 1168 |
+
"execution": {
|
| 1169 |
+
"iopub.execute_input": "2025-02-08T19:38:02.047421Z",
|
| 1170 |
+
"iopub.status.busy": "2025-02-08T19:38:02.046761Z",
|
| 1171 |
+
"iopub.status.idle": "2025-02-08T19:38:02.193381Z",
|
| 1172 |
+
"shell.execute_reply": "2025-02-08T19:38:02.192886Z",
|
| 1173 |
+
"shell.execute_reply.started": "2025-02-08T19:38:02.047397Z"
|
| 1174 |
+
}
|
| 1175 |
+
},
|
| 1176 |
+
"outputs": [
|
| 1177 |
+
{
|
| 1178 |
+
"name": "stdout",
|
| 1179 |
+
"output_type": "stream",
|
| 1180 |
+
"text": [
|
| 1181 |
+
"Data has been saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/Civiverse-Models.csv\n"
|
| 1182 |
+
]
|
| 1183 |
+
}
|
| 1184 |
+
],
|
| 1185 |
+
"source": [
|
| 1186 |
+
"output_csv = current_dir.parent / 'data/CSV/Civiverse-Models.csv'\n",
|
| 1187 |
+
"output_csv.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 1188 |
+
"df_sorted.to_csv(output_csv, index=False)\n",
|
| 1189 |
+
"print(f\"Data has been saved to {output_csv}\")"
|
| 1190 |
+
]
|
| 1191 |
+
},
|
| 1192 |
+
{
|
| 1193 |
+
"cell_type": "markdown",
|
| 1194 |
+
"id": "9b948933-2a2f-42b0-9083-2d81586ae3f2",
|
| 1195 |
+
"metadata": {
|
| 1196 |
+
"execution": {
|
| 1197 |
+
"iopub.execute_input": "2025-02-06T13:42:32.447860Z",
|
| 1198 |
+
"iopub.status.busy": "2025-02-06T13:42:32.446832Z",
|
| 1199 |
+
"iopub.status.idle": "2025-02-06T13:42:32.453827Z",
|
| 1200 |
+
"shell.execute_reply": "2025-02-06T13:42:32.453271Z",
|
| 1201 |
+
"shell.execute_reply.started": "2025-02-06T13:42:32.447821Z"
|
| 1202 |
+
}
|
| 1203 |
+
},
|
| 1204 |
+
"source": [
|
| 1205 |
+
"### Step 3 Create Subsets: Checkpoint only, POI True, POI False"
|
| 1206 |
+
]
|
| 1207 |
+
},
|
| 1208 |
+
{
|
| 1209 |
+
"cell_type": "code",
|
| 1210 |
+
"execution_count": 11,
|
| 1211 |
+
"id": "ae894e26-4984-40e6-80a6-54fb6b61c873",
|
| 1212 |
+
"metadata": {
|
| 1213 |
+
"execution": {
|
| 1214 |
+
"iopub.execute_input": "2025-02-08T19:38:03.089160Z",
|
| 1215 |
+
"iopub.status.busy": "2025-02-08T19:38:03.088490Z",
|
| 1216 |
+
"iopub.status.idle": "2025-02-08T19:38:03.093067Z",
|
| 1217 |
+
"shell.execute_reply": "2025-02-08T19:38:03.092634Z",
|
| 1218 |
+
"shell.execute_reply.started": "2025-02-08T19:38:03.089138Z"
|
| 1219 |
+
}
|
| 1220 |
+
},
|
| 1221 |
+
"outputs": [],
|
| 1222 |
+
"source": [
|
| 1223 |
+
"file_path = current_dir.parent / 'data/CSV/Civiverse-Models.csv' # Update this with your actual file path\n",
|
| 1224 |
+
"(current_dir.parent / 'data/CSV/model_subsets').mkdir(parents=True, exist_ok=True)\n"
|
| 1225 |
+
]
|
| 1226 |
+
},
|
| 1227 |
+
{
|
| 1228 |
+
"cell_type": "code",
|
| 1229 |
+
"execution_count": 13,
|
| 1230 |
+
"id": "88438f88-c723-423e-a4e5-e07f59096b72",
|
| 1231 |
+
"metadata": {
|
| 1232 |
+
"execution": {
|
| 1233 |
+
"iopub.execute_input": "2025-02-08T19:41:39.279495Z",
|
| 1234 |
+
"iopub.status.busy": "2025-02-08T19:41:39.279039Z",
|
| 1235 |
+
"iopub.status.idle": "2025-02-08T19:41:39.674445Z",
|
| 1236 |
+
"shell.execute_reply": "2025-02-08T19:41:39.673981Z",
|
| 1237 |
+
"shell.execute_reply.started": "2025-02-08T19:41:39.279476Z"
|
| 1238 |
+
}
|
| 1239 |
+
},
|
| 1240 |
+
"outputs": [
|
| 1241 |
+
{
|
| 1242 |
+
"name": "stdout",
|
| 1243 |
+
"output_type": "stream",
|
| 1244 |
+
"text": [
|
| 1245 |
+
"Files saved successfully!\n"
|
| 1246 |
+
]
|
| 1247 |
+
}
|
| 1248 |
+
],
|
| 1249 |
+
"source": [
|
| 1250 |
+
"import pandas as pd\n",
|
| 1251 |
+
"\n",
|
| 1252 |
+
"# Load the dataset\n",
|
| 1253 |
+
"\n",
|
| 1254 |
+
"data = pd.read_csv(file_path)\n",
|
| 1255 |
+
"\n",
|
| 1256 |
+
"# Version 1: Only 'poi' true models\n",
|
| 1257 |
+
"poi_true_models = data[data['poi'] == True]\n",
|
| 1258 |
+
"\n",
|
| 1259 |
+
"# Version 2: Only types lora, dora, locon, textual inversion\n",
|
| 1260 |
+
"specific_types = ['LORA', 'DORA', 'LOCON', 'textualInversion']\n",
|
| 1261 |
+
"adapters = data[data['type'].isin(specific_types)]\n",
|
| 1262 |
+
"\n",
|
| 1263 |
+
"# Version 3: Only type checkpoint\n",
|
| 1264 |
+
"checkpoint_models = data[data['type'] == 'Checkpoint']\n",
|
| 1265 |
+
"\n",
|
| 1266 |
+
"# Version 4: All models apart from 'poi' true\n",
|
| 1267 |
+
"non_poi_models = data[data['poi'] != True]\n",
|
| 1268 |
+
"\n",
|
| 1269 |
+
"# Version 5: All models apart from 'poi' true and with nsfwLevel below 13\n",
|
| 1270 |
+
"non_poi_low_nsfw_models = data[(data['poi'] != True) & (data['nsfwLevel'] < 13)]\n",
|
| 1271 |
+
"\n",
|
| 1272 |
+
"# Save the versions as separate CSV files\n",
|
| 1273 |
+
"poi_true_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_true.csv', index=False)\n",
|
| 1274 |
+
"adapters.to_csv(current_dir.parent / 'data/CSV/adapters.csv', index=False)\n",
|
| 1275 |
+
"checkpoint_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv', index=False)\n",
|
| 1276 |
+
"non_poi_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_false.csv', index=False)\n",
|
| 1277 |
+
"\n",
|
| 1278 |
+
"print(\"Files saved successfully!\")\n"
|
| 1279 |
+
]
|
| 1280 |
+
},
|
| 1281 |
+
{
|
| 1282 |
+
"cell_type": "code",
|
| 1283 |
+
"execution_count": null,
|
| 1284 |
+
"id": "49e35088-8b2e-4189-83e1-3098d55dcad2",
|
| 1285 |
+
"metadata": {},
|
| 1286 |
+
"outputs": [],
|
| 1287 |
+
"source": [
|
| 1288 |
+
"import pandas as pd\n",
|
| 1289 |
+
"import os\n",
|
| 1290 |
+
"\n",
|
| 1291 |
+
"# Load the dataset\n",
|
| 1292 |
+
"df = pd.read_csv('data/all_models_with_tags.csv')\n",
|
| 1293 |
+
"\n",
|
| 1294 |
+
"# Filter for rows where poi is True\n",
|
| 1295 |
+
"filtered_df = df[df['poi'] == True]\n",
|
| 1296 |
+
"os.makedirs('data/model_subsets', exist_ok=True)\n",
|
| 1297 |
+
"\n",
|
| 1298 |
+
"# Save the filtered DataFrame to a new CSV file\n",
|
| 1299 |
+
"filtered_df.to_csv('data/model_subsets/all_models_poi.csv', index=False)\n"
|
| 1300 |
+
]
|
| 1301 |
+
},
|
| 1302 |
+
{
|
| 1303 |
+
"cell_type": "code",
|
| 1304 |
+
"execution_count": null,
|
| 1305 |
+
"id": "06c15f2c",
|
| 1306 |
+
"metadata": {},
|
| 1307 |
+
"outputs": [],
|
| 1308 |
+
"source": [
|
| 1309 |
+
"import pandas as pd\n",
|
| 1310 |
+
"import os\n",
|
| 1311 |
+
"\n",
|
| 1312 |
+
"# Load the dataset\n",
|
| 1313 |
+
"df = pd.read_csv('data/all_models_with_tags.csv')\n",
|
| 1314 |
+
"\n",
|
| 1315 |
+
"# Filter for rows where poi is True\n",
|
| 1316 |
+
"filtered_df = df[df['poi'] == False]\n",
|
| 1317 |
+
"os.makedirs('data/model_subsets', exist_ok=True)\n",
|
| 1318 |
+
"\n",
|
| 1319 |
+
"# Save the filtered DataFrame to a new CSV file\n",
|
| 1320 |
+
"filtered_df.to_csv('data/model_subsets/all_models_poi_false.csv', index=False)\n"
|
| 1321 |
+
]
|
| 1322 |
+
}
|
| 1323 |
+
],
|
| 1324 |
+
"metadata": {
|
| 1325 |
+
"kernelspec": {
|
| 1326 |
+
"display_name": "latm",
|
| 1327 |
+
"language": "python",
|
| 1328 |
+
"name": "python3"
|
| 1329 |
+
},
|
| 1330 |
+
"language_info": {
|
| 1331 |
+
"codemirror_mode": {
|
| 1332 |
+
"name": "ipython",
|
| 1333 |
+
"version": 3
|
| 1334 |
+
},
|
| 1335 |
+
"file_extension": ".py",
|
| 1336 |
+
"mimetype": "text/x-python",
|
| 1337 |
+
"name": "python",
|
| 1338 |
+
"nbconvert_exporter": "python",
|
| 1339 |
+
"pygments_lexer": "ipython3",
|
| 1340 |
+
"version": "3.10.15"
|
| 1341 |
+
}
|
| 1342 |
+
},
|
| 1343 |
+
"nbformat": 4,
|
| 1344 |
+
"nbformat_minor": 5
|
| 1345 |
+
}
|
jupyter_notebooks/0_Scraping_model_metadata.ipynb
ADDED
|
@@ -0,0 +1,643 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "1111ea95-d385-49b9-a4d9-ef886ace5c7a",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-06T11:24:25.566747Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-06T11:24:25.566066Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-06T11:24:25.571748Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-06T11:24:25.571305Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-06T11:24:25.566705Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# 0 Scraping Metadata and Dataset consolidation\n"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "6632505a-e7ca-4463-9ffc-e36fad42235f",
|
| 22 |
+
"metadata": {},
|
| 23 |
+
"source": [
|
| 24 |
+
"## IMAGES\n",
|
| 25 |
+
"---"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "e3388bac-bb71-40bc-a693-9ac7a2d5f32c",
|
| 31 |
+
"metadata": {
|
| 32 |
+
"execution": {
|
| 33 |
+
"iopub.execute_input": "2025-02-06T10:08:22.229784Z",
|
| 34 |
+
"iopub.status.busy": "2025-02-06T10:08:22.229287Z",
|
| 35 |
+
"iopub.status.idle": "2025-02-06T10:08:22.232210Z",
|
| 36 |
+
"shell.execute_reply": "2025-02-06T10:08:22.231793Z",
|
| 37 |
+
"shell.execute_reply.started": "2025-02-06T10:08:22.229766Z"
|
| 38 |
+
}
|
| 39 |
+
},
|
| 40 |
+
"source": [
|
| 41 |
+
"### Step 1: Image metadata scraping, sorting and CSV consolidation"
|
| 42 |
+
]
|
| 43 |
+
},
|
| 44 |
+
{
|
| 45 |
+
"cell_type": "code",
|
| 46 |
+
"execution_count": 1,
|
| 47 |
+
"id": "f8decb63-43f5-4731-823d-94632eee7618",
|
| 48 |
+
"metadata": {
|
| 49 |
+
"execution": {
|
| 50 |
+
"iopub.execute_input": "2025-02-08T19:37:51.115763Z",
|
| 51 |
+
"iopub.status.busy": "2025-02-08T19:37:51.114573Z",
|
| 52 |
+
"iopub.status.idle": "2025-02-08T19:37:51.170027Z",
|
| 53 |
+
"shell.execute_reply": "2025-02-08T19:37:51.169401Z",
|
| 54 |
+
"shell.execute_reply.started": "2025-02-08T19:37:51.115738Z"
|
| 55 |
+
}
|
| 56 |
+
},
|
| 57 |
+
"outputs": [],
|
| 58 |
+
"source": [
|
| 59 |
+
"import os\n",
|
| 60 |
+
"import json\n",
|
| 61 |
+
"import csv\n",
|
| 62 |
+
"import requests\n",
|
| 63 |
+
"from datetime import datetime\n",
|
| 64 |
+
"import time\n",
|
| 65 |
+
"from pathlib import Path\n",
|
| 66 |
+
"import hashlib\n",
|
| 67 |
+
"import pandas as pd\n",
|
| 68 |
+
"import sys\n",
|
| 69 |
+
"from datetime import datetime, timedelta\n",
|
| 70 |
+
"import shutil"
|
| 71 |
+
]
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"cell_type": "code",
|
| 75 |
+
"execution_count": 2,
|
| 76 |
+
"id": "4b2426c3-96a0-468e-b6dc-78dea9c3e92b",
|
| 77 |
+
"metadata": {
|
| 78 |
+
"execution": {
|
| 79 |
+
"iopub.execute_input": "2025-02-08T19:37:51.809027Z",
|
| 80 |
+
"iopub.status.busy": "2025-02-08T19:37:51.808835Z",
|
| 81 |
+
"iopub.status.idle": "2025-02-08T19:37:51.812922Z",
|
| 82 |
+
"shell.execute_reply": "2025-02-08T19:37:51.812429Z",
|
| 83 |
+
"shell.execute_reply.started": "2025-02-08T19:37:51.809009Z"
|
| 84 |
+
}
|
| 85 |
+
},
|
| 86 |
+
"outputs": [],
|
| 87 |
+
"source": [
|
| 88 |
+
"current_dir = Path.cwd()"
|
| 89 |
+
]
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"cell_type": "markdown",
|
| 93 |
+
"id": "11647bb7-5ce9-414a-8486-5bdce8d9cfea",
|
| 94 |
+
"metadata": {
|
| 95 |
+
"execution": {
|
| 96 |
+
"iopub.execute_input": "2025-02-06T12:49:43.126762Z",
|
| 97 |
+
"iopub.status.busy": "2025-02-06T12:49:43.125797Z",
|
| 98 |
+
"iopub.status.idle": "2025-02-06T12:49:43.129759Z",
|
| 99 |
+
"shell.execute_reply": "2025-02-06T12:49:43.129176Z",
|
| 100 |
+
"shell.execute_reply.started": "2025-02-06T12:49:43.126736Z"
|
| 101 |
+
}
|
| 102 |
+
},
|
| 103 |
+
"source": [
|
| 104 |
+
"## MODELS"
|
| 105 |
+
]
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"cell_type": "markdown",
|
| 109 |
+
"id": "5c9cb5e3-7cea-4574-9319-f3cd89354b1f",
|
| 110 |
+
"metadata": {
|
| 111 |
+
"execution": {
|
| 112 |
+
"iopub.execute_input": "2025-02-06T13:23:38.017078Z",
|
| 113 |
+
"iopub.status.busy": "2025-02-06T13:23:38.016639Z",
|
| 114 |
+
"iopub.status.idle": "2025-02-06T13:23:38.019993Z",
|
| 115 |
+
"shell.execute_reply": "2025-02-06T13:23:38.019549Z",
|
| 116 |
+
"shell.execute_reply.started": "2025-02-06T13:23:38.017053Z"
|
| 117 |
+
}
|
| 118 |
+
},
|
| 119 |
+
"source": [
|
| 120 |
+
"### Step 1: Scrape model metadata"
|
| 121 |
+
]
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"cell_type": "markdown",
|
| 125 |
+
"id": "83e12b9f-5ae1-407d-9754-5979d837f787",
|
| 126 |
+
"metadata": {
|
| 127 |
+
"execution": {
|
| 128 |
+
"iopub.execute_input": "2025-02-06T14:06:04.857874Z",
|
| 129 |
+
"iopub.status.busy": "2025-02-06T14:06:04.857500Z",
|
| 130 |
+
"iopub.status.idle": "2025-02-06T14:06:04.860438Z",
|
| 131 |
+
"shell.execute_reply": "2025-02-06T14:06:04.860030Z",
|
| 132 |
+
"shell.execute_reply.started": "2025-02-06T14:06:04.857856Z"
|
| 133 |
+
}
|
| 134 |
+
},
|
| 135 |
+
"source": [
|
| 136 |
+
"#### the resulting files will appear in data/raw/model_metadata as *.json"
|
| 137 |
+
]
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"cell_type": "code",
|
| 141 |
+
"execution_count": 12,
|
| 142 |
+
"id": "5db9c00e",
|
| 143 |
+
"metadata": {},
|
| 144 |
+
"outputs": [],
|
| 145 |
+
"source": [
|
| 146 |
+
"key_karussell = current_dir.parent / 'misc/credentials/civitai_api_keys.txt'\n",
|
| 147 |
+
"directory_path = current_dir.parent / 'data/raw/model_metadata/'"
|
| 148 |
+
]
|
| 149 |
+
},
|
| 150 |
+
{
|
| 151 |
+
"cell_type": "code",
|
| 152 |
+
"execution_count": 13,
|
| 153 |
+
"id": "41ee14e0-fb78-4f91-aba1-13faf05af7d8",
|
| 154 |
+
"metadata": {
|
| 155 |
+
"execution": {
|
| 156 |
+
"iopub.execute_input": "2025-02-08T19:37:56.687832Z",
|
| 157 |
+
"iopub.status.busy": "2025-02-08T19:37:56.687251Z",
|
| 158 |
+
"iopub.status.idle": "2025-02-08T19:37:56.696572Z",
|
| 159 |
+
"shell.execute_reply": "2025-02-08T19:37:56.696059Z",
|
| 160 |
+
"shell.execute_reply.started": "2025-02-08T19:37:56.687809Z"
|
| 161 |
+
}
|
| 162 |
+
},
|
| 163 |
+
"outputs": [],
|
| 164 |
+
"source": [
|
| 165 |
+
"import datetime\n",
|
| 166 |
+
"\n",
|
| 167 |
+
"def load_api_keys():\n",
|
| 168 |
+
" \"\"\"Load API keys from a text file, one per line.\"\"\"\n",
|
| 169 |
+
" if not os.path.exists(key_karussell):\n",
|
| 170 |
+
" raise FileNotFoundError(f\"API key file '{API_KEYS_FILE}' not found!\")\n",
|
| 171 |
+
" \n",
|
| 172 |
+
" with open(key_karussell, 'r') as file:\n",
|
| 173 |
+
" keys = [line.strip() for line in file if line.strip()]\n",
|
| 174 |
+
" \n",
|
| 175 |
+
" if not keys:\n",
|
| 176 |
+
" raise ValueError(\"No API keys found in the file!\")\n",
|
| 177 |
+
" \n",
|
| 178 |
+
" return keys\n",
|
| 179 |
+
"\n",
|
| 180 |
+
"def get_model_metadata():\n",
|
| 181 |
+
" base_url = \"https://civitai.com/api/v1/models\"\n",
|
| 182 |
+
" params = {\"sort\": \"Newest\", \"nsfw\": True}\n",
|
| 183 |
+
"\n",
|
| 184 |
+
" # Load API keys\n",
|
| 185 |
+
" api_keys = load_api_keys()\n",
|
| 186 |
+
" key_index = 0 # Start with the first key\n",
|
| 187 |
+
"\n",
|
| 188 |
+
" page_counter = 0\n",
|
| 189 |
+
" max_pages = 300000000 # Adjust as needed\n",
|
| 190 |
+
" os.makedirs(directory_path, exist_ok=True)\n",
|
| 191 |
+
"\n",
|
| 192 |
+
" while True:\n",
|
| 193 |
+
" if page_counter >= max_pages:\n",
|
| 194 |
+
" print(f\"Reached the limit of {max_pages} pages.\")\n",
|
| 195 |
+
" break\n",
|
| 196 |
+
"\n",
|
| 197 |
+
" headers = {\n",
|
| 198 |
+
" \"Accept\": \"application/json\",\n",
|
| 199 |
+
" \"Authorization\": f\"Bearer {api_keys[key_index]}\"\n",
|
| 200 |
+
" }\n",
|
| 201 |
+
"\n",
|
| 202 |
+
" response = requests.get(base_url, headers=headers, params=params)\n",
|
| 203 |
+
"\n",
|
| 204 |
+
" if response.status_code == 200:\n",
|
| 205 |
+
" data = response.json()\n",
|
| 206 |
+
" page_counter += 1\n",
|
| 207 |
+
"\n",
|
| 208 |
+
" # Add timestamp\n",
|
| 209 |
+
" formatted_timestamp = datetime.datetime.now().strftime(\"data obtained on the %d.%m.%Y at %H:%M CEST\")\n",
|
| 210 |
+
"\n",
|
| 211 |
+
" data['timestamp'] = formatted_timestamp\n",
|
| 212 |
+
"\n",
|
| 213 |
+
" # Save data to file\n",
|
| 214 |
+
" file_path = os.path.join(directory_path, f'newest_models_{page_counter}.json')\n",
|
| 215 |
+
" with open(file_path, 'w', encoding='utf-8') as file:\n",
|
| 216 |
+
" json.dump(data, file, indent=4)\n",
|
| 217 |
+
"\n",
|
| 218 |
+
" # Check for nextCursor\n",
|
| 219 |
+
" next_cursor = data.get('metadata', {}).get('nextCursor')\n",
|
| 220 |
+
" if not next_cursor:\n",
|
| 221 |
+
" print(\"No more data available.\")\n",
|
| 222 |
+
" break\n",
|
| 223 |
+
" else:\n",
|
| 224 |
+
" params['cursor'] = next_cursor\n",
|
| 225 |
+
" \n",
|
| 226 |
+
" elif response.status_code in (401, 403): # Unauthorized or Forbidden\n",
|
| 227 |
+
" print(f\"API Key {key_index + 1} failed with status {response.status_code}. Trying next key...\")\n",
|
| 228 |
+
" key_index += 1\n",
|
| 229 |
+
"\n",
|
| 230 |
+
" if key_index >= len(api_keys):\n",
|
| 231 |
+
" print(\"All API keys failed. Exiting.\")\n",
|
| 232 |
+
" break # Stop if all keys fail\n",
|
| 233 |
+
" \n",
|
| 234 |
+
" else:\n",
|
| 235 |
+
" print(f\"Failed to fetch data: HTTP {response.status_code}\")\n",
|
| 236 |
+
" break # Stop on other errors\n"
|
| 237 |
+
]
|
| 238 |
+
},
|
| 239 |
+
{
|
| 240 |
+
"cell_type": "markdown",
|
| 241 |
+
"id": "8f43ce23-5986-4b67-be2d-453831af9a6e",
|
| 242 |
+
"metadata": {},
|
| 243 |
+
"source": [
|
| 244 |
+
"uncomment this to get model metadata"
|
| 245 |
+
]
|
| 246 |
+
},
|
| 247 |
+
{
|
| 248 |
+
"cell_type": "code",
|
| 249 |
+
"execution_count": 14,
|
| 250 |
+
"id": "3f95b4ba-5742-4268-b2e8-de9145faf495",
|
| 251 |
+
"metadata": {
|
| 252 |
+
"execution": {
|
| 253 |
+
"iopub.execute_input": "2025-02-08T19:37:58.117386Z",
|
| 254 |
+
"iopub.status.busy": "2025-02-08T19:37:58.117158Z",
|
| 255 |
+
"iopub.status.idle": "2025-02-08T19:37:58.121162Z",
|
| 256 |
+
"shell.execute_reply": "2025-02-08T19:37:58.120580Z",
|
| 257 |
+
"shell.execute_reply.started": "2025-02-08T19:37:58.117369Z"
|
| 258 |
+
}
|
| 259 |
+
},
|
| 260 |
+
"outputs": [
|
| 261 |
+
{
|
| 262 |
+
"name": "stdout",
|
| 263 |
+
"output_type": "stream",
|
| 264 |
+
"text": [
|
| 265 |
+
"Failed to fetch data: HTTP 500\n"
|
| 266 |
+
]
|
| 267 |
+
}
|
| 268 |
+
],
|
| 269 |
+
"source": [
|
| 270 |
+
"get_model_metadata()"
|
| 271 |
+
]
|
| 272 |
+
},
|
| 273 |
+
{
|
| 274 |
+
"cell_type": "markdown",
|
| 275 |
+
"id": "1ed5f44e-612e-4b6d-8974-904fb3e058d6",
|
| 276 |
+
"metadata": {
|
| 277 |
+
"execution": {
|
| 278 |
+
"iopub.execute_input": "2025-02-06T12:52:40.162173Z",
|
| 279 |
+
"iopub.status.busy": "2025-02-06T12:52:40.159989Z",
|
| 280 |
+
"iopub.status.idle": "2025-02-06T12:52:40.170634Z",
|
| 281 |
+
"shell.execute_reply": "2025-02-06T12:52:40.169945Z",
|
| 282 |
+
"shell.execute_reply.started": "2025-02-06T12:52:40.162124Z"
|
| 283 |
+
}
|
| 284 |
+
},
|
| 285 |
+
"source": [
|
| 286 |
+
"### Step 2 Consolidate Model-dataset CSV"
|
| 287 |
+
]
|
| 288 |
+
},
|
| 289 |
+
{
|
| 290 |
+
"cell_type": "code",
|
| 291 |
+
"execution_count": 8,
|
| 292 |
+
"id": "464d82f5-c24e-4b53-9b50-682fa4bf3430",
|
| 293 |
+
"metadata": {
|
| 294 |
+
"execution": {
|
| 295 |
+
"iopub.execute_input": "2025-02-08T19:37:59.052245Z",
|
| 296 |
+
"iopub.status.busy": "2025-02-08T19:37:59.051645Z",
|
| 297 |
+
"iopub.status.idle": "2025-02-08T19:37:59.056017Z",
|
| 298 |
+
"shell.execute_reply": "2025-02-08T19:37:59.055598Z",
|
| 299 |
+
"shell.execute_reply.started": "2025-02-08T19:37:59.052224Z"
|
| 300 |
+
}
|
| 301 |
+
},
|
| 302 |
+
"outputs": [],
|
| 303 |
+
"source": [
|
| 304 |
+
"## path thingy\n",
|
| 305 |
+
"try: #scripts\n",
|
| 306 |
+
" current_dir = Path(__file__).resolve().parent\n",
|
| 307 |
+
"except NameError:\n",
|
| 308 |
+
" # jupyter\n",
|
| 309 |
+
" current_dir = Path.cwd()"
|
| 310 |
+
]
|
| 311 |
+
},
|
| 312 |
+
{
|
| 313 |
+
"cell_type": "code",
|
| 314 |
+
"execution_count": null,
|
| 315 |
+
"id": "14f3db4d-ef93-4629-8a27-971adb49248b",
|
| 316 |
+
"metadata": {
|
| 317 |
+
"execution": {
|
| 318 |
+
"iopub.execute_input": "2025-02-08T19:37:59.443457Z",
|
| 319 |
+
"iopub.status.busy": "2025-02-08T19:37:59.443236Z",
|
| 320 |
+
"iopub.status.idle": "2025-02-08T19:37:59.890103Z",
|
| 321 |
+
"shell.execute_reply": "2025-02-08T19:37:59.889571Z",
|
| 322 |
+
"shell.execute_reply.started": "2025-02-08T19:37:59.443439Z"
|
| 323 |
+
}
|
| 324 |
+
},
|
| 325 |
+
"outputs": [
|
| 326 |
+
{
|
| 327 |
+
"name": "stdout",
|
| 328 |
+
"output_type": "stream",
|
| 329 |
+
"text": [
|
| 330 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_2.json\n",
|
| 331 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_1.json\n",
|
| 332 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_4.json\n",
|
| 333 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_6.json\n",
|
| 334 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_7.json\n",
|
| 335 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_5.json\n",
|
| 336 |
+
"Processing file: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/model_metadata/newest_models_3.json\n"
|
| 337 |
+
]
|
| 338 |
+
}
|
| 339 |
+
],
|
| 340 |
+
"source": [
|
| 341 |
+
"import os\n",
|
| 342 |
+
"import json\n",
|
| 343 |
+
"import pandas as pd\n",
|
| 344 |
+
"import hashlib\n",
|
| 345 |
+
"from pathlib import Path\n",
|
| 346 |
+
"from datetime import datetime, timezone\n",
|
| 347 |
+
"\n",
|
| 348 |
+
"\n",
|
| 349 |
+
"\n",
|
| 350 |
+
"\n",
|
| 351 |
+
"def hash_username(username):\n",
|
| 352 |
+
" return hashlib.sha256(username.encode('utf-8')).hexdigest()[:16]\n",
|
| 353 |
+
"\n",
|
| 354 |
+
"def parse_date(date_str):\n",
|
| 355 |
+
" try:\n",
|
| 356 |
+
" return datetime.fromisoformat(date_str.replace('Z', '+00:00'))\n",
|
| 357 |
+
" except Exception:\n",
|
| 358 |
+
" return datetime.min.replace(tzinfo=timezone.utc) # Make it timezone-aware\n",
|
| 359 |
+
"\n",
|
| 360 |
+
"def get_latest_model_version(model_versions):\n",
|
| 361 |
+
" return max(model_versions, key=lambda mv: parse_date(mv.get('publishedAt', '')))\n",
|
| 362 |
+
"\n",
|
| 363 |
+
"def process_directory_recursively(root_dir):\n",
|
| 364 |
+
" root_path = Path(root_dir)\n",
|
| 365 |
+
" seen = {} # id -> (publishedAt, record)\n",
|
| 366 |
+
" data_records = []\n",
|
| 367 |
+
"\n",
|
| 368 |
+
" for json_file in root_path.rglob('*.json'):\n",
|
| 369 |
+
" if not json_file.is_file():\n",
|
| 370 |
+
" continue\n",
|
| 371 |
+
"\n",
|
| 372 |
+
" #print(f\"Processing file: {json_file}\")\n",
|
| 373 |
+
" try:\n",
|
| 374 |
+
" with open(json_file, 'r', encoding='utf-8') as f:\n",
|
| 375 |
+
" data = json.load(f)\n",
|
| 376 |
+
" except Exception as e:\n",
|
| 377 |
+
" print(f\"Failed to load {json_file}: {e}\")\n",
|
| 378 |
+
" continue\n",
|
| 379 |
+
"\n",
|
| 380 |
+
" items = data.get('items') or data.get('data') or []\n",
|
| 381 |
+
" for item in items:\n",
|
| 382 |
+
" if not isinstance(item, dict):\n",
|
| 383 |
+
" continue\n",
|
| 384 |
+
"\n",
|
| 385 |
+
" model_id = item.get('id')\n",
|
| 386 |
+
" model_versions = item.get('modelVersions', [])\n",
|
| 387 |
+
" if not model_versions:\n",
|
| 388 |
+
" continue\n",
|
| 389 |
+
"\n",
|
| 390 |
+
" latest_version = get_latest_model_version(model_versions)\n",
|
| 391 |
+
" published_at = latest_version.get('publishedAt', '')\n",
|
| 392 |
+
" current_dt = parse_date(published_at)\n",
|
| 393 |
+
"\n",
|
| 394 |
+
" if model_id in seen and current_dt <= seen[model_id][0]:\n",
|
| 395 |
+
" continue\n",
|
| 396 |
+
" seen[model_id] = (current_dt, item)\n",
|
| 397 |
+
"\n",
|
| 398 |
+
" for model_id, (_, item) in seen.items():\n",
|
| 399 |
+
" model_versions = item.get('modelVersions', [])\n",
|
| 400 |
+
" latest_version = get_latest_model_version(model_versions)\n",
|
| 401 |
+
" version_ids = [mv.get('id', '') for mv in model_versions[:20]]\n",
|
| 402 |
+
"\n",
|
| 403 |
+
" files = latest_version.get('files', [])\n",
|
| 404 |
+
" auto_hashes = files[0].get('hashes', {}) if files else {}\n",
|
| 405 |
+
" images = latest_version.get('images', [])\n",
|
| 406 |
+
" first_image_url = images[0]['url'] if images else ''\n",
|
| 407 |
+
" latest_image_url = images[-1]['url'] if images else ''\n",
|
| 408 |
+
"\n",
|
| 409 |
+
" username = item.get('creator', {}).get('username', '')\n",
|
| 410 |
+
" record = {\n",
|
| 411 |
+
" 'id': item.get('id', ''),\n",
|
| 412 |
+
" 'name': item.get('name', ''),\n",
|
| 413 |
+
" 'type': item.get('type', ''),\n",
|
| 414 |
+
" 'baseModel': latest_version.get('baseModel', ''),\n",
|
| 415 |
+
" 'downloadCount': item.get('stats', {}).get('downloadCount', 0),\n",
|
| 416 |
+
" 'nsfwLevel': item.get('nsfwLevel', 0),\n",
|
| 417 |
+
" 'modelVersions': len(model_versions),\n",
|
| 418 |
+
" 'publishedAt': latest_version.get('publishedAt', ''),\n",
|
| 419 |
+
" 'usernameHash': hash_username(username) if username else '',\n",
|
| 420 |
+
" 'downloadUrl': latest_version.get('downloadUrl', ''),\n",
|
| 421 |
+
" 'firstImageUrl': first_image_url,\n",
|
| 422 |
+
" 'latestImageUrl': latest_image_url,\n",
|
| 423 |
+
" 'poi': item.get('poi', False),\n",
|
| 424 |
+
" 'AutoV1': auto_hashes.get('AutoV1', ''),\n",
|
| 425 |
+
" 'AutoV2': auto_hashes.get('AutoV2', ''),\n",
|
| 426 |
+
" 'AutoV3': auto_hashes.get('AutoV3', ''),\n",
|
| 427 |
+
" 'SHA256': auto_hashes.get('SHA256', ''),\n",
|
| 428 |
+
" 'CRC32': auto_hashes.get('CRC32', ''),\n",
|
| 429 |
+
" 'BLAKE3': auto_hashes.get('BLAKE3', ''),\n",
|
| 430 |
+
" 'previewImage': latest_image_url\n",
|
| 431 |
+
" }\n",
|
| 432 |
+
"\n",
|
| 433 |
+
" for i in range(20):\n",
|
| 434 |
+
" record[f'version_id_{i+1}'] = version_ids[i] if i < len(version_ids) else ''\n",
|
| 435 |
+
"\n",
|
| 436 |
+
" tags = item.get('tags', [])\n",
|
| 437 |
+
" for i in range(7):\n",
|
| 438 |
+
" record[f'tag_{i+1}'] = tags[i] if i < len(tags) else ''\n",
|
| 439 |
+
"\n",
|
| 440 |
+
" data_records.append(record)\n",
|
| 441 |
+
"\n",
|
| 442 |
+
" return pd.DataFrame(data_records)\n",
|
| 443 |
+
"\n",
|
| 444 |
+
"# Usage Example\n",
|
| 445 |
+
"root_directory = current_dir.parent / 'data/raw/model_metadata/'\n",
|
| 446 |
+
"df = process_directory_recursively(root_directory)\n",
|
| 447 |
+
"df_sorted = df.sort_values(by='downloadCount', ascending=False)\n",
|
| 448 |
+
"\n",
|
| 449 |
+
"# Optionally save\n",
|
| 450 |
+
"# df_sorted.to_csv('combined_metadata.csv', index=False)\n"
|
| 451 |
+
]
|
| 452 |
+
},
|
| 453 |
+
{
|
| 454 |
+
"cell_type": "markdown",
|
| 455 |
+
"id": "2d6f822f-754f-4feb-9ed6-6f868421c454",
|
| 456 |
+
"metadata": {},
|
| 457 |
+
"source": [
|
| 458 |
+
"### Save model-data to CSV"
|
| 459 |
+
]
|
| 460 |
+
},
|
| 461 |
+
{
|
| 462 |
+
"cell_type": "code",
|
| 463 |
+
"execution_count": null,
|
| 464 |
+
"id": "f861cf7b-5eb3-46ad-ab2d-9e2fdf2169b4",
|
| 465 |
+
"metadata": {
|
| 466 |
+
"execution": {
|
| 467 |
+
"iopub.execute_input": "2025-02-08T19:38:02.047421Z",
|
| 468 |
+
"iopub.status.busy": "2025-02-08T19:38:02.046761Z",
|
| 469 |
+
"iopub.status.idle": "2025-02-08T19:38:02.193381Z",
|
| 470 |
+
"shell.execute_reply": "2025-02-08T19:38:02.192886Z",
|
| 471 |
+
"shell.execute_reply.started": "2025-02-08T19:38:02.047397Z"
|
| 472 |
+
}
|
| 473 |
+
},
|
| 474 |
+
"outputs": [
|
| 475 |
+
{
|
| 476 |
+
"name": "stdout",
|
| 477 |
+
"output_type": "stream",
|
| 478 |
+
"text": [
|
| 479 |
+
"Data has been saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/Civiverse-Models.csv\n"
|
| 480 |
+
]
|
| 481 |
+
}
|
| 482 |
+
],
|
| 483 |
+
"source": [
|
| 484 |
+
"output_csv = current_dir.parent / 'data/CSV/Civiverse-Models_2025.csv'\n",
|
| 485 |
+
"output_csv.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 486 |
+
"df_sorted.to_csv(output_csv, index=False)\n",
|
| 487 |
+
"print(f\"Data has been saved to {output_csv}\")"
|
| 488 |
+
]
|
| 489 |
+
},
|
| 490 |
+
{
|
| 491 |
+
"cell_type": "markdown",
|
| 492 |
+
"id": "9b948933-2a2f-42b0-9083-2d81586ae3f2",
|
| 493 |
+
"metadata": {
|
| 494 |
+
"execution": {
|
| 495 |
+
"iopub.execute_input": "2025-02-06T13:42:32.447860Z",
|
| 496 |
+
"iopub.status.busy": "2025-02-06T13:42:32.446832Z",
|
| 497 |
+
"iopub.status.idle": "2025-02-06T13:42:32.453827Z",
|
| 498 |
+
"shell.execute_reply": "2025-02-06T13:42:32.453271Z",
|
| 499 |
+
"shell.execute_reply.started": "2025-02-06T13:42:32.447821Z"
|
| 500 |
+
}
|
| 501 |
+
},
|
| 502 |
+
"source": [
|
| 503 |
+
"### Step 3 Create Subsets: Checkpoint only, POI True, POI False"
|
| 504 |
+
]
|
| 505 |
+
},
|
| 506 |
+
{
|
| 507 |
+
"cell_type": "code",
|
| 508 |
+
"execution_count": 11,
|
| 509 |
+
"id": "ae894e26-4984-40e6-80a6-54fb6b61c873",
|
| 510 |
+
"metadata": {
|
| 511 |
+
"execution": {
|
| 512 |
+
"iopub.execute_input": "2025-02-08T19:38:03.089160Z",
|
| 513 |
+
"iopub.status.busy": "2025-02-08T19:38:03.088490Z",
|
| 514 |
+
"iopub.status.idle": "2025-02-08T19:38:03.093067Z",
|
| 515 |
+
"shell.execute_reply": "2025-02-08T19:38:03.092634Z",
|
| 516 |
+
"shell.execute_reply.started": "2025-02-08T19:38:03.089138Z"
|
| 517 |
+
}
|
| 518 |
+
},
|
| 519 |
+
"outputs": [],
|
| 520 |
+
"source": [
|
| 521 |
+
"file_path = current_dir.parent / 'data/CSV/Civiverse-Models.csv' # Update this with your actual file path\n",
|
| 522 |
+
"(current_dir.parent / 'data/CSV/model_subsets').mkdir(parents=True, exist_ok=True)\n"
|
| 523 |
+
]
|
| 524 |
+
},
|
| 525 |
+
{
|
| 526 |
+
"cell_type": "code",
|
| 527 |
+
"execution_count": 13,
|
| 528 |
+
"id": "88438f88-c723-423e-a4e5-e07f59096b72",
|
| 529 |
+
"metadata": {
|
| 530 |
+
"execution": {
|
| 531 |
+
"iopub.execute_input": "2025-02-08T19:41:39.279495Z",
|
| 532 |
+
"iopub.status.busy": "2025-02-08T19:41:39.279039Z",
|
| 533 |
+
"iopub.status.idle": "2025-02-08T19:41:39.674445Z",
|
| 534 |
+
"shell.execute_reply": "2025-02-08T19:41:39.673981Z",
|
| 535 |
+
"shell.execute_reply.started": "2025-02-08T19:41:39.279476Z"
|
| 536 |
+
}
|
| 537 |
+
},
|
| 538 |
+
"outputs": [
|
| 539 |
+
{
|
| 540 |
+
"name": "stdout",
|
| 541 |
+
"output_type": "stream",
|
| 542 |
+
"text": [
|
| 543 |
+
"Files saved successfully!\n"
|
| 544 |
+
]
|
| 545 |
+
}
|
| 546 |
+
],
|
| 547 |
+
"source": [
|
| 548 |
+
"import pandas as pd\n",
|
| 549 |
+
"\n",
|
| 550 |
+
"# Load the dataset\n",
|
| 551 |
+
"\n",
|
| 552 |
+
"data = pd.read_csv(file_path)\n",
|
| 553 |
+
"\n",
|
| 554 |
+
"# Version 1: Only 'poi' true models\n",
|
| 555 |
+
"poi_true_models = data[data['poi'] == True]\n",
|
| 556 |
+
"\n",
|
| 557 |
+
"# Version 2: Only types lora, dora, locon, textual inversion\n",
|
| 558 |
+
"specific_types = ['LORA', 'DORA', 'LOCON', 'textualInversion']\n",
|
| 559 |
+
"adapters = data[data['type'].isin(specific_types)]\n",
|
| 560 |
+
"\n",
|
| 561 |
+
"# Version 3: Only type checkpoint\n",
|
| 562 |
+
"checkpoint_models = data[data['type'] == 'Checkpoint']\n",
|
| 563 |
+
"\n",
|
| 564 |
+
"# Version 4: All models apart from 'poi' true\n",
|
| 565 |
+
"non_poi_models = data[data['poi'] != True]\n",
|
| 566 |
+
"\n",
|
| 567 |
+
"# Version 5: All models apart from 'poi' true and with nsfwLevel below 13\n",
|
| 568 |
+
"non_poi_low_nsfw_models = data[(data['poi'] != True) & (data['nsfwLevel'] < 13)]\n",
|
| 569 |
+
"\n",
|
| 570 |
+
"# Save the versions as separate CSV files\n",
|
| 571 |
+
"poi_true_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_true.csv', index=False)\n",
|
| 572 |
+
"adapters.to_csv(current_dir.parent / 'data/CSV/adapters.csv', index=False)\n",
|
| 573 |
+
"checkpoint_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv', index=False)\n",
|
| 574 |
+
"non_poi_models.to_csv(current_dir.parent / 'data/CSV/model_subsets/Civiverse_adapters_poi_false.csv', index=False)\n",
|
| 575 |
+
"\n",
|
| 576 |
+
"print(\"Files saved successfully!\")\n"
|
| 577 |
+
]
|
| 578 |
+
},
|
| 579 |
+
{
|
| 580 |
+
"cell_type": "code",
|
| 581 |
+
"execution_count": null,
|
| 582 |
+
"id": "49e35088-8b2e-4189-83e1-3098d55dcad2",
|
| 583 |
+
"metadata": {},
|
| 584 |
+
"outputs": [],
|
| 585 |
+
"source": [
|
| 586 |
+
"import pandas as pd\n",
|
| 587 |
+
"import os\n",
|
| 588 |
+
"\n",
|
| 589 |
+
"# Load the dataset\n",
|
| 590 |
+
"df = pd.read_csv('data/all_models_with_tags.csv')\n",
|
| 591 |
+
"\n",
|
| 592 |
+
"# Filter for rows where poi is True\n",
|
| 593 |
+
"filtered_df = df[df['poi'] == True]\n",
|
| 594 |
+
"os.makedirs('data/model_subsets', exist_ok=True)\n",
|
| 595 |
+
"\n",
|
| 596 |
+
"# Save the filtered DataFrame to a new CSV file\n",
|
| 597 |
+
"filtered_df.to_csv('data/model_subsets/all_models_poi.csv', index=False)\n"
|
| 598 |
+
]
|
| 599 |
+
},
|
| 600 |
+
{
|
| 601 |
+
"cell_type": "code",
|
| 602 |
+
"execution_count": null,
|
| 603 |
+
"id": "06c15f2c",
|
| 604 |
+
"metadata": {},
|
| 605 |
+
"outputs": [],
|
| 606 |
+
"source": [
|
| 607 |
+
"import pandas as pd\n",
|
| 608 |
+
"import os\n",
|
| 609 |
+
"\n",
|
| 610 |
+
"# Load the dataset\n",
|
| 611 |
+
"df = pd.read_csv('data/all_models_with_tags.csv')\n",
|
| 612 |
+
"\n",
|
| 613 |
+
"# Filter for rows where poi is True\n",
|
| 614 |
+
"filtered_df = df[df['poi'] == False]\n",
|
| 615 |
+
"os.makedirs('data/model_subsets', exist_ok=True)\n",
|
| 616 |
+
"\n",
|
| 617 |
+
"# Save the filtered DataFrame to a new CSV file\n",
|
| 618 |
+
"filtered_df.to_csv('data/model_subsets/all_models_poi_false.csv', index=False)\n"
|
| 619 |
+
]
|
| 620 |
+
}
|
| 621 |
+
],
|
| 622 |
+
"metadata": {
|
| 623 |
+
"kernelspec": {
|
| 624 |
+
"display_name": "latm",
|
| 625 |
+
"language": "python",
|
| 626 |
+
"name": "python3"
|
| 627 |
+
},
|
| 628 |
+
"language_info": {
|
| 629 |
+
"codemirror_mode": {
|
| 630 |
+
"name": "ipython",
|
| 631 |
+
"version": 3
|
| 632 |
+
},
|
| 633 |
+
"file_extension": ".py",
|
| 634 |
+
"mimetype": "text/x-python",
|
| 635 |
+
"name": "python",
|
| 636 |
+
"nbconvert_exporter": "python",
|
| 637 |
+
"pygments_lexer": "ipython3",
|
| 638 |
+
"version": "3.10.15"
|
| 639 |
+
}
|
| 640 |
+
},
|
| 641 |
+
"nbformat": 4,
|
| 642 |
+
"nbformat_minor": 5
|
| 643 |
+
}
|
jupyter_notebooks/Section_1_Figure_1_image_grid.ipynb
ADDED
|
@@ -0,0 +1,417 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "e8752a29-06b6-49a2-a724-a7b1c8f3d23b",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-06T14:16:50.014780Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-06T14:16:50.013936Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-06T14:16:50.018030Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-06T14:16:50.017506Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-06T14:16:50.014757Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# Section 2. Download Images and create image Grid"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "6f81ff2d-7763-4ac5-a9e3-d9a74541e143",
|
| 22 |
+
"metadata": {
|
| 23 |
+
"execution": {
|
| 24 |
+
"iopub.execute_input": "2025-02-06T16:47:42.502304Z",
|
| 25 |
+
"iopub.status.busy": "2025-02-06T16:47:42.501200Z",
|
| 26 |
+
"iopub.status.idle": "2025-02-06T16:47:42.746600Z",
|
| 27 |
+
"shell.execute_reply": "2025-02-06T16:47:42.745736Z",
|
| 28 |
+
"shell.execute_reply.started": "2025-02-06T16:47:42.502269Z"
|
| 29 |
+
}
|
| 30 |
+
},
|
| 31 |
+
"source": [
|
| 32 |
+
"This script downloads images and places them under data/sorted/YYYY/YYYY-MM/YYYY-MM-DD. A grid is created that blurs all NSFW true images and assigns a color overlay based on the color-coding of the BrowsingLevel"
|
| 33 |
+
]
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"cell_type": "markdown",
|
| 37 |
+
"id": "d6d7d826-353c-4c1b-9d69-fa67f07e5245",
|
| 38 |
+
"metadata": {
|
| 39 |
+
"execution": {
|
| 40 |
+
"iopub.execute_input": "2025-02-06T16:17:00.985310Z",
|
| 41 |
+
"iopub.status.busy": "2025-02-06T16:17:00.984540Z",
|
| 42 |
+
"iopub.status.idle": "2025-02-06T16:17:01.124348Z",
|
| 43 |
+
"shell.execute_reply": "2025-02-06T16:17:01.123740Z",
|
| 44 |
+
"shell.execute_reply.started": "2025-02-06T16:17:00.985282Z"
|
| 45 |
+
}
|
| 46 |
+
},
|
| 47 |
+
"source": [
|
| 48 |
+
""
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"cell_type": "code",
|
| 53 |
+
"execution_count": 1,
|
| 54 |
+
"id": "6e5227f5-925c-4155-8015-53045794b986",
|
| 55 |
+
"metadata": {
|
| 56 |
+
"execution": {
|
| 57 |
+
"iopub.execute_input": "2025-02-08T21:58:48.202108Z",
|
| 58 |
+
"iopub.status.busy": "2025-02-08T21:58:48.201629Z",
|
| 59 |
+
"iopub.status.idle": "2025-02-08T21:58:48.485464Z",
|
| 60 |
+
"shell.execute_reply": "2025-02-08T21:58:48.484783Z",
|
| 61 |
+
"shell.execute_reply.started": "2025-02-08T21:58:48.202077Z"
|
| 62 |
+
}
|
| 63 |
+
},
|
| 64 |
+
"outputs": [],
|
| 65 |
+
"source": [
|
| 66 |
+
"from PIL import Image, ImageFilter, ImageDraw\n",
|
| 67 |
+
"import os\n",
|
| 68 |
+
"from pathlib import Path\n",
|
| 69 |
+
"import json\n",
|
| 70 |
+
"import matplotlib.colors as mcolors\n",
|
| 71 |
+
"import os\n",
|
| 72 |
+
"import json\n",
|
| 73 |
+
"import requests\n",
|
| 74 |
+
"from datetime import datetime\n",
|
| 75 |
+
"import argparse"
|
| 76 |
+
]
|
| 77 |
+
},
|
| 78 |
+
{
|
| 79 |
+
"cell_type": "code",
|
| 80 |
+
"execution_count": 2,
|
| 81 |
+
"id": "25cb523d-6d22-4d13-84f1-f9e38534d08e",
|
| 82 |
+
"metadata": {
|
| 83 |
+
"execution": {
|
| 84 |
+
"iopub.execute_input": "2025-02-08T21:58:49.153040Z",
|
| 85 |
+
"iopub.status.busy": "2025-02-08T21:58:49.152592Z",
|
| 86 |
+
"iopub.status.idle": "2025-02-08T21:58:49.156982Z",
|
| 87 |
+
"shell.execute_reply": "2025-02-08T21:58:49.156561Z",
|
| 88 |
+
"shell.execute_reply.started": "2025-02-08T21:58:49.153017Z"
|
| 89 |
+
}
|
| 90 |
+
},
|
| 91 |
+
"outputs": [],
|
| 92 |
+
"source": [
|
| 93 |
+
"current_dir = Path.cwd()"
|
| 94 |
+
]
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"cell_type": "markdown",
|
| 98 |
+
"id": "037ecae3-a060-4d90-beaf-040dfb018696",
|
| 99 |
+
"metadata": {},
|
| 100 |
+
"source": [
|
| 101 |
+
"## Step 1: Download images "
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "code",
|
| 106 |
+
"execution_count": 3,
|
| 107 |
+
"id": "948809b0-e78a-4821-86d0-fe4eb156265a",
|
| 108 |
+
"metadata": {
|
| 109 |
+
"execution": {
|
| 110 |
+
"iopub.execute_input": "2025-02-08T21:58:50.811934Z",
|
| 111 |
+
"iopub.status.busy": "2025-02-08T21:58:50.811459Z",
|
| 112 |
+
"iopub.status.idle": "2025-02-08T21:58:50.823112Z",
|
| 113 |
+
"shell.execute_reply": "2025-02-08T21:58:50.822600Z",
|
| 114 |
+
"shell.execute_reply.started": "2025-02-08T21:58:50.811911Z"
|
| 115 |
+
}
|
| 116 |
+
},
|
| 117 |
+
"outputs": [],
|
| 118 |
+
"source": [
|
| 119 |
+
"import os\n",
|
| 120 |
+
"import json\n",
|
| 121 |
+
"import requests\n",
|
| 122 |
+
"from datetime import datetime\n",
|
| 123 |
+
"from PIL import Image\n",
|
| 124 |
+
"\n",
|
| 125 |
+
"def download_and_save_data(input_dir, output_dir):\n",
|
| 126 |
+
" print(f\"Scanning directory: {input_dir}\")\n",
|
| 127 |
+
" found_json = False\n",
|
| 128 |
+
" for root, dirs, files in os.walk(input_dir):\n",
|
| 129 |
+
" for file in files:\n",
|
| 130 |
+
" if file.endswith('.json'):\n",
|
| 131 |
+
" found_json = True\n",
|
| 132 |
+
" file_path = os.path.join(root, file)\n",
|
| 133 |
+
" print(f\"Processing JSON file: {file_path}\")\n",
|
| 134 |
+
" try:\n",
|
| 135 |
+
" with open(file_path, 'r', encoding='utf-8') as json_file:\n",
|
| 136 |
+
" items = json.load(json_file)\n",
|
| 137 |
+
" for item in items:\n",
|
| 138 |
+
" if isinstance(item, dict):\n",
|
| 139 |
+
" process_item(item, root, output_dir, input_dir)\n",
|
| 140 |
+
" except json.JSONDecodeError:\n",
|
| 141 |
+
" print(f\"Error decoding JSON from file {file_path}\")\n",
|
| 142 |
+
" except Exception as e:\n",
|
| 143 |
+
" print(f\"An error occurred with file {file_path}: {e}\")\n",
|
| 144 |
+
" if not found_json:\n",
|
| 145 |
+
" print(\"No JSON files found in the directory.\")\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"def process_item(item, root, output_dir, input_dir):\n",
|
| 148 |
+
" if 'createdAt' in item and 'url' in item:\n",
|
| 149 |
+
" created_at = datetime.strptime(item['createdAt'], \"%Y-%m-%dT%H:%M:%S.%fZ\")\n",
|
| 150 |
+
" time_str = created_at.strftime(\"%Y%m%dT%H%M%S%f\") # Include microseconds in the filename\n",
|
| 151 |
+
" relative_path = os.path.relpath(root, input_dir)\n",
|
| 152 |
+
" save_path = os.path.join(output_dir, relative_path)\n",
|
| 153 |
+
" os.makedirs(save_path, exist_ok=True)\n",
|
| 154 |
+
" image_path = os.path.join(save_path, f\"{time_str}.jpeg\")\n",
|
| 155 |
+
" download_image(item['url'], image_path)\n",
|
| 156 |
+
" resize_image(image_path, 512) # Resize if necessary\n",
|
| 157 |
+
" save_prompt(item.get('meta', {}).get('prompt', 'No prompt available'), os.path.join(save_path, f\"{time_str}_positive.txt\"))\n",
|
| 158 |
+
" save_prompt(item.get('meta', {}).get('negativePrompt', 'No negative prompt available'), os.path.join(save_path, f\"{time_str}_negative.txt\"))\n",
|
| 159 |
+
" save_json(item, os.path.join(save_path, f\"{time_str}.json\"))\n",
|
| 160 |
+
" else:\n",
|
| 161 |
+
" print(\"Item missing 'createdAt' or 'url', skipping...\")\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"def download_image(url, path):\n",
|
| 164 |
+
" try:\n",
|
| 165 |
+
" response = requests.get(url)\n",
|
| 166 |
+
" if response.status_code == 200:\n",
|
| 167 |
+
" with open(path, 'wb') as f:\n",
|
| 168 |
+
" f.write(response.content)\n",
|
| 169 |
+
" except requests.RequestException as e:\n",
|
| 170 |
+
" print(f\"Request failed for {url}: {e}\")\n",
|
| 171 |
+
"\n",
|
| 172 |
+
"def resize_image(image_path, max_size):\n",
|
| 173 |
+
" with Image.open(image_path) as img:\n",
|
| 174 |
+
" if img.width > max_size or img.height > max_size:\n",
|
| 175 |
+
" img.thumbnail((max_size, max_size))\n",
|
| 176 |
+
" img.save(image_path)\n",
|
| 177 |
+
"\n",
|
| 178 |
+
"def save_prompt(prompt, path):\n",
|
| 179 |
+
" with open(path, 'w') as f:\n",
|
| 180 |
+
" f.write(prompt)\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"def save_json(data, path):\n",
|
| 183 |
+
" with open(path, 'w', encoding='utf-8') as f:\n",
|
| 184 |
+
" json.dump(data, f, indent=4)\n",
|
| 185 |
+
"\n"
|
| 186 |
+
]
|
| 187 |
+
},
|
| 188 |
+
{
|
| 189 |
+
"cell_type": "code",
|
| 190 |
+
"execution_count": 4,
|
| 191 |
+
"id": "133e143d-1bb5-4145-8d59-323570bb6e95",
|
| 192 |
+
"metadata": {
|
| 193 |
+
"execution": {
|
| 194 |
+
"iopub.execute_input": "2025-02-08T21:58:51.566234Z",
|
| 195 |
+
"iopub.status.busy": "2025-02-08T21:58:51.565711Z",
|
| 196 |
+
"iopub.status.idle": "2025-02-08T21:58:51.568973Z",
|
| 197 |
+
"shell.execute_reply": "2025-02-08T21:58:51.568474Z",
|
| 198 |
+
"shell.execute_reply.started": "2025-02-08T21:58:51.566214Z"
|
| 199 |
+
}
|
| 200 |
+
},
|
| 201 |
+
"outputs": [],
|
| 202 |
+
"source": [
|
| 203 |
+
"input_dir = current_dir.parent / 'data/sorted/image_metadata/'\n",
|
| 204 |
+
"images = current_dir.parent / 'data/sorted/images'"
|
| 205 |
+
]
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"cell_type": "markdown",
|
| 209 |
+
"id": "e11c4ad5-c040-42b7-a1b9-a0e21b33ce20",
|
| 210 |
+
"metadata": {},
|
| 211 |
+
"source": [
|
| 212 |
+
"uncomment this to download the images, otherwise proceed with grid creation"
|
| 213 |
+
]
|
| 214 |
+
},
|
| 215 |
+
{
|
| 216 |
+
"cell_type": "code",
|
| 217 |
+
"execution_count": 6,
|
| 218 |
+
"id": "88a6ca8e-51ee-4d27-ba0c-5f026d5750df",
|
| 219 |
+
"metadata": {
|
| 220 |
+
"execution": {
|
| 221 |
+
"iopub.execute_input": "2025-02-08T21:58:52.986899Z",
|
| 222 |
+
"iopub.status.busy": "2025-02-08T21:58:52.986456Z",
|
| 223 |
+
"iopub.status.idle": "2025-02-08T21:58:52.989508Z",
|
| 224 |
+
"shell.execute_reply": "2025-02-08T21:58:52.988892Z",
|
| 225 |
+
"shell.execute_reply.started": "2025-02-08T21:58:52.986878Z"
|
| 226 |
+
}
|
| 227 |
+
},
|
| 228 |
+
"outputs": [
|
| 229 |
+
{
|
| 230 |
+
"name": "stdout",
|
| 231 |
+
"output_type": "stream",
|
| 232 |
+
"text": [
|
| 233 |
+
"Scanning directory: /home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/sorted/image_metadata\n",
|
| 234 |
+
"No JSON files found in the directory.\n"
|
| 235 |
+
]
|
| 236 |
+
}
|
| 237 |
+
],
|
| 238 |
+
"source": [
|
| 239 |
+
"download_and_save_data(input_dir, images) "
|
| 240 |
+
]
|
| 241 |
+
},
|
| 242 |
+
{
|
| 243 |
+
"cell_type": "code",
|
| 244 |
+
"execution_count": null,
|
| 245 |
+
"id": "f247091b-7b45-4cd4-a6ca-3d4dea414e19",
|
| 246 |
+
"metadata": {
|
| 247 |
+
"execution": {
|
| 248 |
+
"iopub.execute_input": "2025-02-08T21:58:54.127935Z",
|
| 249 |
+
"iopub.status.busy": "2025-02-08T21:58:54.127626Z",
|
| 250 |
+
"iopub.status.idle": "2025-02-08T21:58:54.254290Z",
|
| 251 |
+
"shell.execute_reply": "2025-02-08T21:58:54.253243Z",
|
| 252 |
+
"shell.execute_reply.started": "2025-02-08T21:58:54.127914Z"
|
| 253 |
+
}
|
| 254 |
+
},
|
| 255 |
+
"outputs": [
|
| 256 |
+
{
|
| 257 |
+
"name": "stdout",
|
| 258 |
+
"output_type": "stream",
|
| 259 |
+
"text": [
|
| 260 |
+
"Grid image saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/plots/grid10x120.png\n"
|
| 261 |
+
]
|
| 262 |
+
}
|
| 263 |
+
],
|
| 264 |
+
"source": [
|
| 265 |
+
"import random\n",
|
| 266 |
+
"\n",
|
| 267 |
+
"# For creating the figure grid\n",
|
| 268 |
+
"images = []\n",
|
| 269 |
+
"all_valid_images = []\n",
|
| 270 |
+
"\n",
|
| 271 |
+
"\n",
|
| 272 |
+
"directory = current_dir.parent / 'data/sorted/images/'\n",
|
| 273 |
+
"#directory = '/home/lauwag/shares/laura_wagner/Civitai_page_analysis/Civitai_dataset/dataset/chronological/full/prompts-images/2024/2024-05/2024-05-31/'\n",
|
| 274 |
+
"output = current_dir.parent / 'plots/grid10x120.png'\n",
|
| 275 |
+
"\n",
|
| 276 |
+
"cell_size = 64 # Resize to 32x32 pixels\n",
|
| 277 |
+
"grid_size = (10, 120)\n",
|
| 278 |
+
"\n",
|
| 279 |
+
"\n",
|
| 280 |
+
"def apply_pixelation(img, level):\n",
|
| 281 |
+
" \"\"\" Apply pixelation based on the level. \"\"\"\n",
|
| 282 |
+
" if level > 1:\n",
|
| 283 |
+
" pixel_size = 8 # Adjust this value for more or less pixelation\n",
|
| 284 |
+
" img_small = img.resize(\n",
|
| 285 |
+
" (img.width // pixel_size, img.height // pixel_size), Image.NEAREST\n",
|
| 286 |
+
" )\n",
|
| 287 |
+
" img_pixelated = img_small.resize(img.size, Image.NEAREST)\n",
|
| 288 |
+
" return img_pixelated\n",
|
| 289 |
+
" return img\n",
|
| 290 |
+
"\n",
|
| 291 |
+
"def apply_blur(img, level):\n",
|
| 292 |
+
" \"\"\" Apply Gaussian blur based on the level of browsing. \"\"\"\n",
|
| 293 |
+
" if level > 1:\n",
|
| 294 |
+
" return img.filter(ImageFilter.GaussianBlur(radius=1.75)) # Adjust radius as needed\n",
|
| 295 |
+
" return img\n",
|
| 296 |
+
"\n",
|
| 297 |
+
"def apply_border_and_overlay(img, color):\n",
|
| 298 |
+
" \"\"\" Apply a colored border and a slight color overlay. \"\"\"\n",
|
| 299 |
+
" # Convert Matplotlib color name to RGB\n",
|
| 300 |
+
" rgb_color = tuple(int(x * 255) for x in mcolors.to_rgb(color))\n",
|
| 301 |
+
" # Create a border\n",
|
| 302 |
+
" border_size = 0 # Adjust border size as needed\n",
|
| 303 |
+
" border_img = Image.new('RGB', (img.width + 2 * border_size, img.height + 2 * border_size), rgb_color)\n",
|
| 304 |
+
" border_img.paste(img, (border_size, border_size))\n",
|
| 305 |
+
" # Create an overlay\n",
|
| 306 |
+
" overlay = Image.new('RGBA', border_img.size, (*rgb_color, 0)) # Semi-transparent overlay\n",
|
| 307 |
+
" #overlay = Image.new('RGBA', border_img.size, (*rgb_color, 128)) # Semi-transparent overlay\n",
|
| 308 |
+
" final_img = Image.alpha_composite(border_img.convert('RGBA'), overlay)\n",
|
| 309 |
+
" return final_img.convert('RGB')\n",
|
| 310 |
+
"\n",
|
| 311 |
+
"def process_image(image_path, json_path, cell_size):\n",
|
| 312 |
+
" \"\"\" Process each image: resize, crop, blur, and add border and overlay based on JSON metadata. \"\"\"\n",
|
| 313 |
+
" with open(json_path, 'r') as f:\n",
|
| 314 |
+
" metadata = json.load(f)\n",
|
| 315 |
+
" browsing_level = metadata.get('browsingLevel', 1)\n",
|
| 316 |
+
"\n",
|
| 317 |
+
" with Image.open(image_path) as img:\n",
|
| 318 |
+
" img = resize_and_crop_image(img, cell_size)\n",
|
| 319 |
+
" img = apply_pixelation(img, browsing_level)\n",
|
| 320 |
+
"\n",
|
| 321 |
+
" color_map = {\n",
|
| 322 |
+
" 2: 'rosybrown', # Matplotlib color name\n",
|
| 323 |
+
" 4: 'coral',\n",
|
| 324 |
+
" 8: 'red',\n",
|
| 325 |
+
" 16: 'magenta'\n",
|
| 326 |
+
" }\n",
|
| 327 |
+
" if browsing_level in color_map:\n",
|
| 328 |
+
" img = apply_border_and_overlay(img, color_map[browsing_level])\n",
|
| 329 |
+
"\n",
|
| 330 |
+
" return img\n",
|
| 331 |
+
"\n",
|
| 332 |
+
"def resize_and_crop_image(img, output_size):\n",
|
| 333 |
+
" \"\"\" Resize and crop the image to a square of the specified size. \"\"\"\n",
|
| 334 |
+
" ratio = min(img.width / output_size, img.height / output_size)\n",
|
| 335 |
+
" new_size = (int(img.width / ratio), int(img.height / ratio))\n",
|
| 336 |
+
" img = img.resize(new_size, Image.Resampling.LANCZOS)\n",
|
| 337 |
+
" left = (img.width - output_size) // 2\n",
|
| 338 |
+
" top = (img.height - output_size) // 2\n",
|
| 339 |
+
" right = left + output_size\n",
|
| 340 |
+
" bottom = top + output_size\n",
|
| 341 |
+
" return img.crop((left, top, right, bottom))\n",
|
| 342 |
+
"\n",
|
| 343 |
+
"# Example usage and the rest of your script remains unchanged\n",
|
| 344 |
+
"\n",
|
| 345 |
+
"# Example usage\n",
|
| 346 |
+
" # 4x4 grid\n",
|
| 347 |
+
"\n",
|
| 348 |
+
"file_types = ('png', 'jpg', 'jpeg') # Define acceptable image file types\n",
|
| 349 |
+
"\n",
|
| 350 |
+
"\n",
|
| 351 |
+
"\n",
|
| 352 |
+
"\n",
|
| 353 |
+
"\n",
|
| 354 |
+
"# First, collect all valid image paths\n",
|
| 355 |
+
"for root, dirs, files in os.walk(directory):\n",
|
| 356 |
+
" for file in files:\n",
|
| 357 |
+
" if file.lower().endswith(file_types):\n",
|
| 358 |
+
" image_path = os.path.join(root, file)\n",
|
| 359 |
+
" json_path = image_path.rsplit('.', 1)[0] + '.json'\n",
|
| 360 |
+
" if os.path.exists(json_path):\n",
|
| 361 |
+
" all_valid_images.append((image_path, json_path))\n",
|
| 362 |
+
"\n",
|
| 363 |
+
"# Randomly sample from valid images\n",
|
| 364 |
+
"num_needed = grid_size[0] * grid_size[1]\n",
|
| 365 |
+
"if len(all_valid_images) >= num_needed:\n",
|
| 366 |
+
" random.seed(42) # For reproducibility\n",
|
| 367 |
+
" sampled_images = random.sample(all_valid_images, num_needed)\n",
|
| 368 |
+
" \n",
|
| 369 |
+
" for image_path, json_path in sampled_images:\n",
|
| 370 |
+
" img = process_image(image_path, json_path, cell_size)\n",
|
| 371 |
+
" images.append(img)\n",
|
| 372 |
+
"else:\n",
|
| 373 |
+
" print(f\"Warning: Only {len(all_valid_images)} valid images found, need {num_needed}\")\n",
|
| 374 |
+
"\n",
|
| 375 |
+
"\n",
|
| 376 |
+
"# Create the grid image\n",
|
| 377 |
+
"grid_img = Image.new('RGB', (grid_size[1] * cell_size, grid_size[0] * cell_size))\n",
|
| 378 |
+
"for index, img in enumerate(images):\n",
|
| 379 |
+
" x = (index % grid_size[1]) * cell_size\n",
|
| 380 |
+
" y = (index // grid_size[1]) * cell_size\n",
|
| 381 |
+
" grid_img.paste(img, (x, y))\n",
|
| 382 |
+
"\n",
|
| 383 |
+
"grid_img.save(output)\n",
|
| 384 |
+
"print(f\"Grid image saved to {output}\")"
|
| 385 |
+
]
|
| 386 |
+
},
|
| 387 |
+
{
|
| 388 |
+
"cell_type": "code",
|
| 389 |
+
"execution_count": null,
|
| 390 |
+
"id": "28bba299-c5ef-449b-a7fa-1afdc5e26262",
|
| 391 |
+
"metadata": {},
|
| 392 |
+
"outputs": [],
|
| 393 |
+
"source": []
|
| 394 |
+
}
|
| 395 |
+
],
|
| 396 |
+
"metadata": {
|
| 397 |
+
"kernelspec": {
|
| 398 |
+
"display_name": "latm",
|
| 399 |
+
"language": "python",
|
| 400 |
+
"name": "python3"
|
| 401 |
+
},
|
| 402 |
+
"language_info": {
|
| 403 |
+
"codemirror_mode": {
|
| 404 |
+
"name": "ipython",
|
| 405 |
+
"version": 3
|
| 406 |
+
},
|
| 407 |
+
"file_extension": ".py",
|
| 408 |
+
"mimetype": "text/x-python",
|
| 409 |
+
"name": "python",
|
| 410 |
+
"nbconvert_exporter": "python",
|
| 411 |
+
"pygments_lexer": "ipython3",
|
| 412 |
+
"version": "3.10.15"
|
| 413 |
+
}
|
| 414 |
+
},
|
| 415 |
+
"nbformat": 4,
|
| 416 |
+
"nbformat_minor": 5
|
| 417 |
+
}
|
jupyter_notebooks/Section_2-2-2_Figure_3_histogram_monthly_images_nsfw_levels.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/Section_2-2-2_Figure_4_Demographic_patterns_in_gen_images.ipynb
ADDED
|
@@ -0,0 +1,1795 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "d8161f06-4fd2-436d-9e59-3f68b5a67f2c",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-06T18:30:16.974712Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-06T18:30:16.974296Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-06T18:30:16.976909Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-06T18:30:16.976526Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-06T18:30:16.974692Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# Section 6.2: Age and Gender Estimation using MiVOLO"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "64aaedec-ef56-4a62-b61e-12de2675a1ae",
|
| 22 |
+
"metadata": {
|
| 23 |
+
"execution": {
|
| 24 |
+
"iopub.execute_input": "2025-02-06T19:52:51.171282Z",
|
| 25 |
+
"iopub.status.busy": "2025-02-06T19:52:51.170711Z",
|
| 26 |
+
"iopub.status.idle": "2025-02-06T19:52:55.405039Z",
|
| 27 |
+
"shell.execute_reply": "2025-02-06T19:52:55.404308Z",
|
| 28 |
+
"shell.execute_reply.started": "2025-02-06T19:52:51.171245Z"
|
| 29 |
+
}
|
| 30 |
+
},
|
| 31 |
+
"source": [
|
| 32 |
+
""
|
| 33 |
+
]
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"cell_type": "code",
|
| 37 |
+
"execution_count": 1,
|
| 38 |
+
"id": "4293f307-44fd-455e-90fe-6e6928be9af5",
|
| 39 |
+
"metadata": {
|
| 40 |
+
"execution": {
|
| 41 |
+
"iopub.execute_input": "2025-02-08T21:59:21.970807Z",
|
| 42 |
+
"iopub.status.busy": "2025-02-08T21:59:21.969931Z",
|
| 43 |
+
"iopub.status.idle": "2025-02-08T22:00:09.724295Z",
|
| 44 |
+
"shell.execute_reply": "2025-02-08T22:00:09.723583Z",
|
| 45 |
+
"shell.execute_reply.started": "2025-02-08T21:59:21.970784Z"
|
| 46 |
+
}
|
| 47 |
+
},
|
| 48 |
+
"outputs": [
|
| 49 |
+
{
|
| 50 |
+
"name": "stderr",
|
| 51 |
+
"output_type": "stream",
|
| 52 |
+
"text": [
|
| 53 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 54 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 55 |
+
]
|
| 56 |
+
}
|
| 57 |
+
],
|
| 58 |
+
"source": [
|
| 59 |
+
"import csv\n",
|
| 60 |
+
"from pathlib import Path\n",
|
| 61 |
+
"import logging\n",
|
| 62 |
+
"import os\n",
|
| 63 |
+
"import pandas as pd\n",
|
| 64 |
+
"import requests\n",
|
| 65 |
+
"import numpy as np\n",
|
| 66 |
+
"import torch\n",
|
| 67 |
+
"import cv2\n",
|
| 68 |
+
"from io import BytesIO\n",
|
| 69 |
+
"from PIL import Image, UnidentifiedImageError\n",
|
| 70 |
+
"from datetime import datetime, timedelta\n",
|
| 71 |
+
"from dateutil.relativedelta import relativedelta\n",
|
| 72 |
+
"from mivolo.predictor import Predictor\n",
|
| 73 |
+
"import matplotlib.pyplot as plt\n",
|
| 74 |
+
"import matplotlib.patches as mpatches"
|
| 75 |
+
]
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"cell_type": "code",
|
| 79 |
+
"execution_count": 2,
|
| 80 |
+
"id": "63a54f5f-900c-48dd-8932-2632e56c5670",
|
| 81 |
+
"metadata": {
|
| 82 |
+
"execution": {
|
| 83 |
+
"iopub.execute_input": "2025-02-08T22:00:09.726069Z",
|
| 84 |
+
"iopub.status.busy": "2025-02-08T22:00:09.725699Z",
|
| 85 |
+
"iopub.status.idle": "2025-02-08T22:00:09.730626Z",
|
| 86 |
+
"shell.execute_reply": "2025-02-08T22:00:09.730099Z",
|
| 87 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.726050Z"
|
| 88 |
+
}
|
| 89 |
+
},
|
| 90 |
+
"outputs": [],
|
| 91 |
+
"source": [
|
| 92 |
+
"current_dir = Path.cwd()\n",
|
| 93 |
+
"mini = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini.csv'\n",
|
| 94 |
+
"mivolo_in = current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/'\n",
|
| 95 |
+
"(current_dir.parent / 'data/CSV/image_subsets/Civiverse-mini-by-month/').mkdir(parents=True, exist_ok=True)"
|
| 96 |
+
]
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"cell_type": "code",
|
| 100 |
+
"execution_count": 3,
|
| 101 |
+
"id": "1fdfb89a-6094-4382-8755-fae213221ea5",
|
| 102 |
+
"metadata": {
|
| 103 |
+
"execution": {
|
| 104 |
+
"iopub.execute_input": "2025-02-08T22:00:09.731540Z",
|
| 105 |
+
"iopub.status.busy": "2025-02-08T22:00:09.731359Z",
|
| 106 |
+
"iopub.status.idle": "2025-02-08T22:00:09.825738Z",
|
| 107 |
+
"shell.execute_reply": "2025-02-08T22:00:09.825258Z",
|
| 108 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.731524Z"
|
| 109 |
+
}
|
| 110 |
+
},
|
| 111 |
+
"outputs": [],
|
| 112 |
+
"source": [
|
| 113 |
+
"def split_by_month(input_path, output_dir):\n",
|
| 114 |
+
" # Load the dataset\n",
|
| 115 |
+
" df = pd.read_csv(input_path)\n",
|
| 116 |
+
" \n",
|
| 117 |
+
" # Convert the 'createdAt' column to datetime\n",
|
| 118 |
+
" df['createdAt'] = pd.to_datetime(df['createdAt'], errors='coerce')\n",
|
| 119 |
+
" \n",
|
| 120 |
+
" # Extract year and month\n",
|
| 121 |
+
" df['year_month'] = df['createdAt'].dt.to_period('M')\n",
|
| 122 |
+
" \n",
|
| 123 |
+
" # Group the data by year and month and save each group as a CSV file\n",
|
| 124 |
+
" unique_months = df['year_month'].unique()\n",
|
| 125 |
+
"\n",
|
| 126 |
+
" for month in unique_months:\n",
|
| 127 |
+
" # Filter data for the specific month\n",
|
| 128 |
+
" df_month = df[df['year_month'] == month]\n",
|
| 129 |
+
" \n",
|
| 130 |
+
" # Define the file name based on the year and month\n",
|
| 131 |
+
" file_name = f'{output_dir}/Civiverse-{month}.csv'\n",
|
| 132 |
+
" \n",
|
| 133 |
+
" # Save the file\n",
|
| 134 |
+
" df_month.to_csv(file_name, index=False)\n",
|
| 135 |
+
"\n",
|
| 136 |
+
" print(f\"Data has been split and saved to {output_dir}\")"
|
| 137 |
+
]
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"cell_type": "code",
|
| 141 |
+
"execution_count": 4,
|
| 142 |
+
"id": "2c909d7c-7d16-4dc7-8364-7f1c0784414c",
|
| 143 |
+
"metadata": {
|
| 144 |
+
"execution": {
|
| 145 |
+
"iopub.execute_input": "2025-02-08T22:00:09.827095Z",
|
| 146 |
+
"iopub.status.busy": "2025-02-08T22:00:09.826919Z",
|
| 147 |
+
"iopub.status.idle": "2025-02-08T22:00:10.479484Z",
|
| 148 |
+
"shell.execute_reply": "2025-02-08T22:00:10.478777Z",
|
| 149 |
+
"shell.execute_reply.started": "2025-02-08T22:00:09.827079Z"
|
| 150 |
+
}
|
| 151 |
+
},
|
| 152 |
+
"outputs": [
|
| 153 |
+
{
|
| 154 |
+
"name": "stderr",
|
| 155 |
+
"output_type": "stream",
|
| 156 |
+
"text": [
|
| 157 |
+
"/sctmp/lauwag/ipykernel_1497673/1825509207.py:9: UserWarning: Converting to PeriodArray/Index representation will drop timezone information.\n",
|
| 158 |
+
" df['year_month'] = df['createdAt'].dt.to_period('M')\n"
|
| 159 |
+
]
|
| 160 |
+
},
|
| 161 |
+
{
|
| 162 |
+
"name": "stdout",
|
| 163 |
+
"output_type": "stream",
|
| 164 |
+
"text": [
|
| 165 |
+
"Data has been split and saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month\n"
|
| 166 |
+
]
|
| 167 |
+
}
|
| 168 |
+
],
|
| 169 |
+
"source": [
|
| 170 |
+
"split_by_month(mini, mivolo_in)"
|
| 171 |
+
]
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"cell_type": "code",
|
| 175 |
+
"execution_count": 5,
|
| 176 |
+
"id": "4c543306-ffc8-4b9c-a3df-b49b2271caa9",
|
| 177 |
+
"metadata": {
|
| 178 |
+
"execution": {
|
| 179 |
+
"iopub.execute_input": "2025-02-08T22:00:10.480505Z",
|
| 180 |
+
"iopub.status.busy": "2025-02-08T22:00:10.480310Z",
|
| 181 |
+
"iopub.status.idle": "2025-02-08T22:00:10.483961Z",
|
| 182 |
+
"shell.execute_reply": "2025-02-08T22:00:10.483400Z",
|
| 183 |
+
"shell.execute_reply.started": "2025-02-08T22:00:10.480486Z"
|
| 184 |
+
}
|
| 185 |
+
},
|
| 186 |
+
"outputs": [],
|
| 187 |
+
"source": [
|
| 188 |
+
"mivolo_out = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
|
| 189 |
+
"mivolo_out.mkdir(parents=True, exist_ok=True) # Create the output directory if it doesn't exist"
|
| 190 |
+
]
|
| 191 |
+
},
|
| 192 |
+
{
|
| 193 |
+
"cell_type": "markdown",
|
| 194 |
+
"id": "ffb7dd23",
|
| 195 |
+
"metadata": {},
|
| 196 |
+
"source": [
|
| 197 |
+
"## MiVOLO gender and age inference"
|
| 198 |
+
]
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"cell_type": "code",
|
| 202 |
+
"execution_count": null,
|
| 203 |
+
"id": "304ed12f-c7b6-4129-b24d-7ccc793a62c7",
|
| 204 |
+
"metadata": {
|
| 205 |
+
"execution": {
|
| 206 |
+
"iopub.execute_input": "2025-02-08T22:00:10.484802Z",
|
| 207 |
+
"iopub.status.busy": "2025-02-08T22:00:10.484639Z"
|
| 208 |
+
}
|
| 209 |
+
},
|
| 210 |
+
"outputs": [
|
| 211 |
+
{
|
| 212 |
+
"name": "stderr",
|
| 213 |
+
"output_type": "stream",
|
| 214 |
+
"text": [
|
| 215 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/ultralytics/nn/tasks.py:634: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 216 |
+
" return torch.load(file, map_location=\"cpu\"), file # load\n"
|
| 217 |
+
]
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"name": "stdout",
|
| 221 |
+
"output_type": "stream",
|
| 222 |
+
"text": [
|
| 223 |
+
"Model summary (fused): 268 layers, 68125494 parameters, 0 gradients, 257.4 GFLOPs\n"
|
| 224 |
+
]
|
| 225 |
+
},
|
| 226 |
+
{
|
| 227 |
+
"name": "stderr",
|
| 228 |
+
"output_type": "stream",
|
| 229 |
+
"text": [
|
| 230 |
+
"[W208 23:00:15.738708520 NNPACK.cpp:61] Could not initialize NNPACK! Reason: Unsupported hardware.\n",
|
| 231 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/mivolo/model/mi_volo.py:33: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 232 |
+
" state = torch.load(ckpt_path, map_location=\"cpu\")\n",
|
| 233 |
+
"INFO:MiVOLO:Model meta:\n",
|
| 234 |
+
"min_age: 1, max_age: 95, avg_age: 48.0, num_classes: 3, in_chans: 6, with_persons_model: True, disable_faces: False, use_persons: True, only_age: False, num_classes_gender: 2, input_size: 224, use_person_crops: True, use_face_crops: True\n",
|
| 235 |
+
"/home/lauwag/data/conda/envs/horde/lib/python3.12/site-packages/timm/models/_helpers.py:39: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
|
| 236 |
+
" checkpoint = torch.load(checkpoint_path, map_location='cpu')\n",
|
| 237 |
+
"INFO:timm.models._helpers:Loaded state_dict from checkpoint '/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
|
| 238 |
+
"INFO:MiVOLO:Model mivolo_d1_224 created, param count: 27432414\n",
|
| 239 |
+
"INFO:timm.data.config:Data processing configuration for current model + dataset:\n",
|
| 240 |
+
"INFO:timm.data.config:\tinput_size: (3, 224, 224)\n",
|
| 241 |
+
"INFO:timm.data.config:\tinterpolation: bicubic\n",
|
| 242 |
+
"INFO:timm.data.config:\tmean: (0.485, 0.456, 0.406)\n",
|
| 243 |
+
"INFO:timm.data.config:\tstd: (0.229, 0.224, 0.225)\n",
|
| 244 |
+
"INFO:timm.data.config:\tcrop_pct: 0.96\n",
|
| 245 |
+
"INFO:timm.data.config:\tcrop_mode: center\n"
|
| 246 |
+
]
|
| 247 |
+
},
|
| 248 |
+
{
|
| 249 |
+
"name": "stdout",
|
| 250 |
+
"output_type": "stream",
|
| 251 |
+
"text": [
|
| 252 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-11.csv\n",
|
| 253 |
+
"\n",
|
| 254 |
+
"0: 640x640 (no detections), 723.9ms\n",
|
| 255 |
+
"Speed: 12.1ms preprocess, 723.9ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 256 |
+
"Processed and saved 1 images so far.\n",
|
| 257 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2022-11.csv\n",
|
| 258 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2022-12.csv\n",
|
| 259 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-01.csv\n",
|
| 260 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-02.csv\n",
|
| 261 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-03.csv\n",
|
| 262 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-04.csv\n",
|
| 263 |
+
"File not found: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-05.csv\n",
|
| 264 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-06.csv\n",
|
| 265 |
+
"\n",
|
| 266 |
+
"0: 416x640 1 person, 455.1ms\n",
|
| 267 |
+
"Speed: 3.5ms preprocess, 455.1ms inference, 33.5ms postprocess per image at shape (1, 3, 416, 640)\n"
|
| 268 |
+
]
|
| 269 |
+
},
|
| 270 |
+
{
|
| 271 |
+
"name": "stderr",
|
| 272 |
+
"output_type": "stream",
|
| 273 |
+
"text": [
|
| 274 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 275 |
+
"INFO:MiVOLO:\tage: 32.89\n",
|
| 276 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 277 |
+
]
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"name": "stdout",
|
| 281 |
+
"output_type": "stream",
|
| 282 |
+
"text": [
|
| 283 |
+
"Processed and saved 1 images so far.\n",
|
| 284 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-06.csv\n",
|
| 285 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-07.csv\n",
|
| 286 |
+
"\n",
|
| 287 |
+
"0: 640x320 1 person, 395.7ms\n",
|
| 288 |
+
"Speed: 2.9ms preprocess, 395.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
|
| 289 |
+
]
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"name": "stderr",
|
| 293 |
+
"output_type": "stream",
|
| 294 |
+
"text": [
|
| 295 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 296 |
+
"INFO:MiVOLO:\tage: 33.49\n",
|
| 297 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 298 |
+
]
|
| 299 |
+
},
|
| 300 |
+
{
|
| 301 |
+
"name": "stdout",
|
| 302 |
+
"output_type": "stream",
|
| 303 |
+
"text": [
|
| 304 |
+
"Processed and saved 1 images so far.\n",
|
| 305 |
+
"\n",
|
| 306 |
+
"0: 640x448 1 person, 1 face, 478.5ms\n",
|
| 307 |
+
"Speed: 1.9ms preprocess, 478.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 308 |
+
]
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"name": "stderr",
|
| 312 |
+
"output_type": "stream",
|
| 313 |
+
"text": [
|
| 314 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 315 |
+
"INFO:MiVOLO:\tage: 17.81\n",
|
| 316 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 317 |
+
]
|
| 318 |
+
},
|
| 319 |
+
{
|
| 320 |
+
"name": "stdout",
|
| 321 |
+
"output_type": "stream",
|
| 322 |
+
"text": [
|
| 323 |
+
"Processed and saved 2 images so far.\n",
|
| 324 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-07.csv\n",
|
| 325 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-08.csv\n",
|
| 326 |
+
"\n",
|
| 327 |
+
"0: 640x448 1 person, 478.0ms\n",
|
| 328 |
+
"Speed: 2.9ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 329 |
+
]
|
| 330 |
+
},
|
| 331 |
+
{
|
| 332 |
+
"name": "stderr",
|
| 333 |
+
"output_type": "stream",
|
| 334 |
+
"text": [
|
| 335 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 336 |
+
"INFO:MiVOLO:\tage: 40.62\n",
|
| 337 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 338 |
+
]
|
| 339 |
+
},
|
| 340 |
+
{
|
| 341 |
+
"name": "stdout",
|
| 342 |
+
"output_type": "stream",
|
| 343 |
+
"text": [
|
| 344 |
+
"Processed and saved 1 images so far.\n",
|
| 345 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-08.csv\n",
|
| 346 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-09.csv\n",
|
| 347 |
+
"\n",
|
| 348 |
+
"0: 416x640 (no detections), 567.1ms\n",
|
| 349 |
+
"Speed: 2.4ms preprocess, 567.1ms inference, 0.4ms postprocess per image at shape (1, 3, 416, 640)\n",
|
| 350 |
+
"Processed and saved 1 images so far.\n",
|
| 351 |
+
"\n",
|
| 352 |
+
"0: 320x640 (no detections), 393.6ms\n",
|
| 353 |
+
"Speed: 1.7ms preprocess, 393.6ms inference, 0.4ms postprocess per image at shape (1, 3, 320, 640)\n",
|
| 354 |
+
"\n",
|
| 355 |
+
"0: 640x640 (no detections), 711.9ms\n",
|
| 356 |
+
"Speed: 3.4ms preprocess, 711.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 357 |
+
"\n",
|
| 358 |
+
"0: 640x640 (no detections), 699.8ms\n",
|
| 359 |
+
"Speed: 2.3ms preprocess, 699.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 360 |
+
"\n",
|
| 361 |
+
"0: 640x576 1 person, 629.6ms\n",
|
| 362 |
+
"Speed: 2.4ms preprocess, 629.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 576)\n"
|
| 363 |
+
]
|
| 364 |
+
},
|
| 365 |
+
{
|
| 366 |
+
"name": "stderr",
|
| 367 |
+
"output_type": "stream",
|
| 368 |
+
"text": [
|
| 369 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 370 |
+
"INFO:MiVOLO:\tage: 28.65\n",
|
| 371 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 372 |
+
]
|
| 373 |
+
},
|
| 374 |
+
{
|
| 375 |
+
"name": "stdout",
|
| 376 |
+
"output_type": "stream",
|
| 377 |
+
"text": [
|
| 378 |
+
"\n",
|
| 379 |
+
"0: 640x448 1 person, 1 face, 598.3ms\n",
|
| 380 |
+
"Speed: 2.1ms preprocess, 598.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 381 |
+
]
|
| 382 |
+
},
|
| 383 |
+
{
|
| 384 |
+
"name": "stderr",
|
| 385 |
+
"output_type": "stream",
|
| 386 |
+
"text": [
|
| 387 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 388 |
+
"INFO:MiVOLO:\tage: 25.85\n",
|
| 389 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 390 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/98848c97-3d1e-4b52-9967-aeeca354a30e/width=656/98848c97-3d1e-4b52-9967-aeeca354a30e.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00133740>\n"
|
| 391 |
+
]
|
| 392 |
+
},
|
| 393 |
+
{
|
| 394 |
+
"name": "stdout",
|
| 395 |
+
"output_type": "stream",
|
| 396 |
+
"text": [
|
| 397 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-09.csv\n",
|
| 398 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-10.csv\n"
|
| 399 |
+
]
|
| 400 |
+
},
|
| 401 |
+
{
|
| 402 |
+
"name": "stderr",
|
| 403 |
+
"output_type": "stream",
|
| 404 |
+
"text": [
|
| 405 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/e6469288-b487-4a06-99c1-59e7ac22fa77/width=1024/e6469288-b487-4a06-99c1-59e7ac22fa77.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ecbdd0>\n"
|
| 406 |
+
]
|
| 407 |
+
},
|
| 408 |
+
{
|
| 409 |
+
"name": "stdout",
|
| 410 |
+
"output_type": "stream",
|
| 411 |
+
"text": [
|
| 412 |
+
"\n",
|
| 413 |
+
"0: 448x640 (no detections), 536.6ms\n",
|
| 414 |
+
"Speed: 10.1ms preprocess, 536.6ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
|
| 415 |
+
"Processed and saved 2 images so far.\n",
|
| 416 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-10.csv\n",
|
| 417 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-11.csv\n",
|
| 418 |
+
"\n",
|
| 419 |
+
"0: 640x448 1 person, 1 face, 662.9ms\n",
|
| 420 |
+
"Speed: 2.6ms preprocess, 662.9ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 421 |
+
]
|
| 422 |
+
},
|
| 423 |
+
{
|
| 424 |
+
"name": "stderr",
|
| 425 |
+
"output_type": "stream",
|
| 426 |
+
"text": [
|
| 427 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 428 |
+
"INFO:MiVOLO:\tage: 17.0\n",
|
| 429 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 430 |
+
]
|
| 431 |
+
},
|
| 432 |
+
{
|
| 433 |
+
"name": "stdout",
|
| 434 |
+
"output_type": "stream",
|
| 435 |
+
"text": [
|
| 436 |
+
"Processed and saved 1 images so far.\n",
|
| 437 |
+
"\n",
|
| 438 |
+
"0: 640x384 1 person, 895.9ms\n",
|
| 439 |
+
"Speed: 2.0ms preprocess, 895.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 440 |
+
]
|
| 441 |
+
},
|
| 442 |
+
{
|
| 443 |
+
"name": "stderr",
|
| 444 |
+
"output_type": "stream",
|
| 445 |
+
"text": [
|
| 446 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 447 |
+
"INFO:MiVOLO:\tage: 43.33\n",
|
| 448 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 449 |
+
]
|
| 450 |
+
},
|
| 451 |
+
{
|
| 452 |
+
"name": "stdout",
|
| 453 |
+
"output_type": "stream",
|
| 454 |
+
"text": [
|
| 455 |
+
"\n",
|
| 456 |
+
"0: 640x448 (no detections), 529.4ms\n",
|
| 457 |
+
"Speed: 2.6ms preprocess, 529.4ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 458 |
+
"\n",
|
| 459 |
+
"0: 640x448 1 person, 539.3ms\n",
|
| 460 |
+
"Speed: 2.8ms preprocess, 539.3ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 461 |
+
]
|
| 462 |
+
},
|
| 463 |
+
{
|
| 464 |
+
"name": "stderr",
|
| 465 |
+
"output_type": "stream",
|
| 466 |
+
"text": [
|
| 467 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 468 |
+
"INFO:MiVOLO:\tage: 39.15\n",
|
| 469 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 470 |
+
]
|
| 471 |
+
},
|
| 472 |
+
{
|
| 473 |
+
"name": "stdout",
|
| 474 |
+
"output_type": "stream",
|
| 475 |
+
"text": [
|
| 476 |
+
"\n",
|
| 477 |
+
"0: 640x448 1 person, 1 face, 708.6ms\n",
|
| 478 |
+
"Speed: 2.5ms preprocess, 708.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 479 |
+
]
|
| 480 |
+
},
|
| 481 |
+
{
|
| 482 |
+
"name": "stderr",
|
| 483 |
+
"output_type": "stream",
|
| 484 |
+
"text": [
|
| 485 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 486 |
+
"INFO:MiVOLO:\tage: 29.64\n",
|
| 487 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 488 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc/width=1080/ad4b5481-0a58-4196-a3e8-fca2fe22a3cc.mp4: cannot identify image file <_io.BytesIO object at 0x14cb010c24d0>\n"
|
| 489 |
+
]
|
| 490 |
+
},
|
| 491 |
+
{
|
| 492 |
+
"name": "stdout",
|
| 493 |
+
"output_type": "stream",
|
| 494 |
+
"text": [
|
| 495 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-11.csv\n",
|
| 496 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2023-12.csv\n",
|
| 497 |
+
"\n",
|
| 498 |
+
"0: 640x384 1 person, 1 face, 461.0ms\n",
|
| 499 |
+
"Speed: 2.4ms preprocess, 461.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 500 |
+
]
|
| 501 |
+
},
|
| 502 |
+
{
|
| 503 |
+
"name": "stderr",
|
| 504 |
+
"output_type": "stream",
|
| 505 |
+
"text": [
|
| 506 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 507 |
+
"INFO:MiVOLO:\tage: 19.61\n",
|
| 508 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 509 |
+
]
|
| 510 |
+
},
|
| 511 |
+
{
|
| 512 |
+
"name": "stdout",
|
| 513 |
+
"output_type": "stream",
|
| 514 |
+
"text": [
|
| 515 |
+
"Processed and saved 1 images so far.\n",
|
| 516 |
+
"\n",
|
| 517 |
+
"0: 640x448 1 person, 1 face, 501.3ms\n",
|
| 518 |
+
"Speed: 3.1ms preprocess, 501.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 519 |
+
]
|
| 520 |
+
},
|
| 521 |
+
{
|
| 522 |
+
"name": "stderr",
|
| 523 |
+
"output_type": "stream",
|
| 524 |
+
"text": [
|
| 525 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 526 |
+
"INFO:MiVOLO:\tage: 22.58\n",
|
| 527 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 528 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3004b5fa-af81-4de7-829d-1d809d70b878/width=512/3004b5fa-af81-4de7-829d-1d809d70b878.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
|
| 529 |
+
]
|
| 530 |
+
},
|
| 531 |
+
{
|
| 532 |
+
"name": "stdout",
|
| 533 |
+
"output_type": "stream",
|
| 534 |
+
"text": [
|
| 535 |
+
"\n",
|
| 536 |
+
"0: 640x640 (no detections), 842.5ms\n",
|
| 537 |
+
"Speed: 4.5ms preprocess, 842.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 538 |
+
"\n",
|
| 539 |
+
"0: 640x416 (no detections), 446.8ms\n",
|
| 540 |
+
"Speed: 2.5ms preprocess, 446.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 541 |
+
"Processed and saved 5 images so far.\n",
|
| 542 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2023-12.csv\n",
|
| 543 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-01.csv\n",
|
| 544 |
+
"\n",
|
| 545 |
+
"0: 640x448 (no detections), 638.5ms\n",
|
| 546 |
+
"Speed: 2.3ms preprocess, 638.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 547 |
+
"Processed and saved 1 images so far.\n",
|
| 548 |
+
"\n",
|
| 549 |
+
"0: 640x416 (no detections), 441.7ms\n",
|
| 550 |
+
"Speed: 2.5ms preprocess, 441.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 551 |
+
"\n",
|
| 552 |
+
"0: 640x448 (no detections), 470.3ms\n",
|
| 553 |
+
"Speed: 2.3ms preprocess, 470.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 554 |
+
"\n",
|
| 555 |
+
"0: 640x448 (no detections), 693.9ms\n",
|
| 556 |
+
"Speed: 2.5ms preprocess, 693.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 557 |
+
"\n",
|
| 558 |
+
"0: 640x512 1 person, 1 face, 808.6ms\n",
|
| 559 |
+
"Speed: 3.2ms preprocess, 808.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 560 |
+
]
|
| 561 |
+
},
|
| 562 |
+
{
|
| 563 |
+
"name": "stderr",
|
| 564 |
+
"output_type": "stream",
|
| 565 |
+
"text": [
|
| 566 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 567 |
+
"INFO:MiVOLO:\tage: 15.01\n",
|
| 568 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 569 |
+
]
|
| 570 |
+
},
|
| 571 |
+
{
|
| 572 |
+
"name": "stdout",
|
| 573 |
+
"output_type": "stream",
|
| 574 |
+
"text": [
|
| 575 |
+
"\n",
|
| 576 |
+
"0: 640x320 1 person, 1 face, 345.6ms\n",
|
| 577 |
+
"Speed: 2.0ms preprocess, 345.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 320)\n"
|
| 578 |
+
]
|
| 579 |
+
},
|
| 580 |
+
{
|
| 581 |
+
"name": "stderr",
|
| 582 |
+
"output_type": "stream",
|
| 583 |
+
"text": [
|
| 584 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 585 |
+
"INFO:MiVOLO:\tage: 20.86\n",
|
| 586 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 587 |
+
]
|
| 588 |
+
},
|
| 589 |
+
{
|
| 590 |
+
"name": "stdout",
|
| 591 |
+
"output_type": "stream",
|
| 592 |
+
"text": [
|
| 593 |
+
"Processed and saved 6 images so far.\n",
|
| 594 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-01.csv\n",
|
| 595 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-02.csv\n",
|
| 596 |
+
"\n",
|
| 597 |
+
"0: 640x384 1 person, 1 face, 387.8ms\n",
|
| 598 |
+
"Speed: 1.9ms preprocess, 387.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 384)\n"
|
| 599 |
+
]
|
| 600 |
+
},
|
| 601 |
+
{
|
| 602 |
+
"name": "stderr",
|
| 603 |
+
"output_type": "stream",
|
| 604 |
+
"text": [
|
| 605 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 606 |
+
"INFO:MiVOLO:\tage: 17.31\n",
|
| 607 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 608 |
+
]
|
| 609 |
+
},
|
| 610 |
+
{
|
| 611 |
+
"name": "stdout",
|
| 612 |
+
"output_type": "stream",
|
| 613 |
+
"text": [
|
| 614 |
+
"Processed and saved 1 images so far.\n",
|
| 615 |
+
"\n",
|
| 616 |
+
"0: 640x480 1 person, 1 face, 540.4ms\n",
|
| 617 |
+
"Speed: 2.5ms preprocess, 540.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 618 |
+
]
|
| 619 |
+
},
|
| 620 |
+
{
|
| 621 |
+
"name": "stderr",
|
| 622 |
+
"output_type": "stream",
|
| 623 |
+
"text": [
|
| 624 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 625 |
+
"INFO:MiVOLO:\tage: 17.47\n",
|
| 626 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 627 |
+
]
|
| 628 |
+
},
|
| 629 |
+
{
|
| 630 |
+
"name": "stdout",
|
| 631 |
+
"output_type": "stream",
|
| 632 |
+
"text": [
|
| 633 |
+
"\n",
|
| 634 |
+
"0: 640x640 1 person, 1 face, 713.1ms\n",
|
| 635 |
+
"Speed: 3.8ms preprocess, 713.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 640)\n"
|
| 636 |
+
]
|
| 637 |
+
},
|
| 638 |
+
{
|
| 639 |
+
"name": "stderr",
|
| 640 |
+
"output_type": "stream",
|
| 641 |
+
"text": [
|
| 642 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 643 |
+
"INFO:MiVOLO:\tage: 17.85\n",
|
| 644 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 645 |
+
]
|
| 646 |
+
},
|
| 647 |
+
{
|
| 648 |
+
"name": "stdout",
|
| 649 |
+
"output_type": "stream",
|
| 650 |
+
"text": [
|
| 651 |
+
"\n",
|
| 652 |
+
"0: 640x640 (no detections), 778.8ms\n",
|
| 653 |
+
"Speed: 28.7ms preprocess, 778.8ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 654 |
+
"\n",
|
| 655 |
+
"0: 640x448 1 person, 1 face, 528.2ms\n",
|
| 656 |
+
"Speed: 2.3ms preprocess, 528.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 657 |
+
]
|
| 658 |
+
},
|
| 659 |
+
{
|
| 660 |
+
"name": "stderr",
|
| 661 |
+
"output_type": "stream",
|
| 662 |
+
"text": [
|
| 663 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 664 |
+
"INFO:MiVOLO:\tage: 21.63\n",
|
| 665 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 666 |
+
]
|
| 667 |
+
},
|
| 668 |
+
{
|
| 669 |
+
"name": "stdout",
|
| 670 |
+
"output_type": "stream",
|
| 671 |
+
"text": [
|
| 672 |
+
"\n",
|
| 673 |
+
"0: 640x448 1 person, 1 face, 518.4ms\n",
|
| 674 |
+
"Speed: 3.9ms preprocess, 518.4ms inference, 0.9ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 675 |
+
]
|
| 676 |
+
},
|
| 677 |
+
{
|
| 678 |
+
"name": "stderr",
|
| 679 |
+
"output_type": "stream",
|
| 680 |
+
"text": [
|
| 681 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 682 |
+
"INFO:MiVOLO:\tage: 18.25\n",
|
| 683 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 684 |
+
]
|
| 685 |
+
},
|
| 686 |
+
{
|
| 687 |
+
"name": "stdout",
|
| 688 |
+
"output_type": "stream",
|
| 689 |
+
"text": [
|
| 690 |
+
"\n",
|
| 691 |
+
"0: 640x448 1 person, 1 face, 470.7ms\n",
|
| 692 |
+
"Speed: 2.5ms preprocess, 470.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 693 |
+
]
|
| 694 |
+
},
|
| 695 |
+
{
|
| 696 |
+
"name": "stderr",
|
| 697 |
+
"output_type": "stream",
|
| 698 |
+
"text": [
|
| 699 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 700 |
+
"INFO:MiVOLO:\tage: 20.51\n",
|
| 701 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 702 |
+
]
|
| 703 |
+
},
|
| 704 |
+
{
|
| 705 |
+
"name": "stdout",
|
| 706 |
+
"output_type": "stream",
|
| 707 |
+
"text": [
|
| 708 |
+
"\n",
|
| 709 |
+
"0: 640x480 1 person, 1 face, 647.1ms\n",
|
| 710 |
+
"Speed: 2.4ms preprocess, 647.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 711 |
+
]
|
| 712 |
+
},
|
| 713 |
+
{
|
| 714 |
+
"name": "stderr",
|
| 715 |
+
"output_type": "stream",
|
| 716 |
+
"text": [
|
| 717 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 718 |
+
"INFO:MiVOLO:\tage: 58.87\n",
|
| 719 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 720 |
+
]
|
| 721 |
+
},
|
| 722 |
+
{
|
| 723 |
+
"name": "stdout",
|
| 724 |
+
"output_type": "stream",
|
| 725 |
+
"text": [
|
| 726 |
+
"\n",
|
| 727 |
+
"0: 640x448 (no detections), 469.8ms\n",
|
| 728 |
+
"Speed: 2.6ms preprocess, 469.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 729 |
+
"\n",
|
| 730 |
+
"0: 640x448 1 person, 1 face, 477.5ms\n",
|
| 731 |
+
"Speed: 2.3ms preprocess, 477.5ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 732 |
+
]
|
| 733 |
+
},
|
| 734 |
+
{
|
| 735 |
+
"name": "stderr",
|
| 736 |
+
"output_type": "stream",
|
| 737 |
+
"text": [
|
| 738 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 739 |
+
"INFO:MiVOLO:\tage: 23.79\n",
|
| 740 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 741 |
+
]
|
| 742 |
+
},
|
| 743 |
+
{
|
| 744 |
+
"name": "stdout",
|
| 745 |
+
"output_type": "stream",
|
| 746 |
+
"text": [
|
| 747 |
+
"Processed and saved 10 images so far.\n",
|
| 748 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-02.csv\n",
|
| 749 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-03.csv\n",
|
| 750 |
+
"\n",
|
| 751 |
+
"0: 640x448 1 face, 511.4ms\n",
|
| 752 |
+
"Speed: 2.5ms preprocess, 511.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 753 |
+
]
|
| 754 |
+
},
|
| 755 |
+
{
|
| 756 |
+
"name": "stderr",
|
| 757 |
+
"output_type": "stream",
|
| 758 |
+
"text": [
|
| 759 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 760 |
+
"INFO:MiVOLO:\tage: 24.87\n",
|
| 761 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 762 |
+
]
|
| 763 |
+
},
|
| 764 |
+
{
|
| 765 |
+
"name": "stdout",
|
| 766 |
+
"output_type": "stream",
|
| 767 |
+
"text": [
|
| 768 |
+
"Processed and saved 1 images so far.\n",
|
| 769 |
+
"\n",
|
| 770 |
+
"0: 640x544 (no detections), 576.5ms\n",
|
| 771 |
+
"Speed: 2.9ms preprocess, 576.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 544)\n",
|
| 772 |
+
"\n",
|
| 773 |
+
"0: 640x448 1 person, 1 face, 687.1ms\n",
|
| 774 |
+
"Speed: 9.9ms preprocess, 687.1ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 775 |
+
]
|
| 776 |
+
},
|
| 777 |
+
{
|
| 778 |
+
"name": "stderr",
|
| 779 |
+
"output_type": "stream",
|
| 780 |
+
"text": [
|
| 781 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 782 |
+
"INFO:MiVOLO:\tage: 25.76\n",
|
| 783 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 784 |
+
]
|
| 785 |
+
},
|
| 786 |
+
{
|
| 787 |
+
"name": "stdout",
|
| 788 |
+
"output_type": "stream",
|
| 789 |
+
"text": [
|
| 790 |
+
"\n",
|
| 791 |
+
"0: 640x448 (no detections), 498.3ms\n",
|
| 792 |
+
"Speed: 2.3ms preprocess, 498.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 793 |
+
"\n",
|
| 794 |
+
"0: 640x512 (no detections), 573.2ms\n",
|
| 795 |
+
"Speed: 3.0ms preprocess, 573.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 796 |
+
"Processed and saved 5 images so far.\n",
|
| 797 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-03.csv\n",
|
| 798 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-04.csv\n",
|
| 799 |
+
"\n",
|
| 800 |
+
"0: 640x384 (no detections), 518.2ms\n",
|
| 801 |
+
"Speed: 2.7ms preprocess, 518.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 802 |
+
"Processed and saved 1 images so far.\n",
|
| 803 |
+
"\n",
|
| 804 |
+
"0: 640x512 (no detections), 707.7ms\n",
|
| 805 |
+
"Speed: 3.6ms preprocess, 707.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 806 |
+
"\n",
|
| 807 |
+
"0: 640x416 (no detections), 453.7ms\n",
|
| 808 |
+
"Speed: 2.4ms preprocess, 453.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 416)\n",
|
| 809 |
+
"\n",
|
| 810 |
+
"0: 640x384 (no detections), 391.0ms\n",
|
| 811 |
+
"Speed: 2.0ms preprocess, 391.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 812 |
+
"\n",
|
| 813 |
+
"0: 640x448 1 person, 1 face, 449.8ms\n",
|
| 814 |
+
"Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 815 |
+
]
|
| 816 |
+
},
|
| 817 |
+
{
|
| 818 |
+
"name": "stderr",
|
| 819 |
+
"output_type": "stream",
|
| 820 |
+
"text": [
|
| 821 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 822 |
+
"INFO:MiVOLO:\tage: 22.39\n",
|
| 823 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 824 |
+
]
|
| 825 |
+
},
|
| 826 |
+
{
|
| 827 |
+
"name": "stdout",
|
| 828 |
+
"output_type": "stream",
|
| 829 |
+
"text": [
|
| 830 |
+
"\n",
|
| 831 |
+
"0: 640x448 (no detections), 618.4ms\n",
|
| 832 |
+
"Speed: 2.3ms preprocess, 618.4ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 833 |
+
"\n",
|
| 834 |
+
"0: 640x448 1 person, 1 face, 631.0ms\n",
|
| 835 |
+
"Speed: 2.2ms preprocess, 631.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 836 |
+
]
|
| 837 |
+
},
|
| 838 |
+
{
|
| 839 |
+
"name": "stderr",
|
| 840 |
+
"output_type": "stream",
|
| 841 |
+
"text": [
|
| 842 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 843 |
+
"INFO:MiVOLO:\tage: 24.05\n",
|
| 844 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 845 |
+
]
|
| 846 |
+
},
|
| 847 |
+
{
|
| 848 |
+
"name": "stdout",
|
| 849 |
+
"output_type": "stream",
|
| 850 |
+
"text": [
|
| 851 |
+
"\n",
|
| 852 |
+
"0: 640x512 1 person, 496.4ms\n",
|
| 853 |
+
"Speed: 2.6ms preprocess, 496.4ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 854 |
+
]
|
| 855 |
+
},
|
| 856 |
+
{
|
| 857 |
+
"name": "stderr",
|
| 858 |
+
"output_type": "stream",
|
| 859 |
+
"text": [
|
| 860 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 861 |
+
"INFO:MiVOLO:\tage: 22.81\n",
|
| 862 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 863 |
+
]
|
| 864 |
+
},
|
| 865 |
+
{
|
| 866 |
+
"name": "stdout",
|
| 867 |
+
"output_type": "stream",
|
| 868 |
+
"text": [
|
| 869 |
+
"\n",
|
| 870 |
+
"0: 640x448 (no detections), 442.8ms\n",
|
| 871 |
+
"Speed: 2.3ms preprocess, 442.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 872 |
+
"\n",
|
| 873 |
+
"0: 640x448 1 person, 1 face, 477.7ms\n",
|
| 874 |
+
"Speed: 2.4ms preprocess, 477.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 875 |
+
]
|
| 876 |
+
},
|
| 877 |
+
{
|
| 878 |
+
"name": "stderr",
|
| 879 |
+
"output_type": "stream",
|
| 880 |
+
"text": [
|
| 881 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 882 |
+
"INFO:MiVOLO:\tage: 21.62\n",
|
| 883 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 884 |
+
]
|
| 885 |
+
},
|
| 886 |
+
{
|
| 887 |
+
"name": "stdout",
|
| 888 |
+
"output_type": "stream",
|
| 889 |
+
"text": [
|
| 890 |
+
"\n",
|
| 891 |
+
"0: 640x448 1 person, 1 face, 447.0ms\n",
|
| 892 |
+
"Speed: 2.2ms preprocess, 447.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 893 |
+
]
|
| 894 |
+
},
|
| 895 |
+
{
|
| 896 |
+
"name": "stderr",
|
| 897 |
+
"output_type": "stream",
|
| 898 |
+
"text": [
|
| 899 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 900 |
+
"INFO:MiVOLO:\tage: 54.31\n",
|
| 901 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 902 |
+
]
|
| 903 |
+
},
|
| 904 |
+
{
|
| 905 |
+
"name": "stdout",
|
| 906 |
+
"output_type": "stream",
|
| 907 |
+
"text": [
|
| 908 |
+
"\n",
|
| 909 |
+
"0: 640x640 (no detections), 819.0ms\n",
|
| 910 |
+
"Speed: 3.6ms preprocess, 819.0ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 911 |
+
"\n",
|
| 912 |
+
"0: 640x448 1 person, 1 face, 478.2ms\n",
|
| 913 |
+
"Speed: 1.8ms preprocess, 478.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 914 |
+
]
|
| 915 |
+
},
|
| 916 |
+
{
|
| 917 |
+
"name": "stderr",
|
| 918 |
+
"output_type": "stream",
|
| 919 |
+
"text": [
|
| 920 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 921 |
+
"INFO:MiVOLO:\tage: 20.56\n",
|
| 922 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 923 |
+
]
|
| 924 |
+
},
|
| 925 |
+
{
|
| 926 |
+
"name": "stdout",
|
| 927 |
+
"output_type": "stream",
|
| 928 |
+
"text": [
|
| 929 |
+
"\n",
|
| 930 |
+
"0: 640x448 1 person, 1 face, 471.2ms\n",
|
| 931 |
+
"Speed: 2.7ms preprocess, 471.2ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 932 |
+
]
|
| 933 |
+
},
|
| 934 |
+
{
|
| 935 |
+
"name": "stderr",
|
| 936 |
+
"output_type": "stream",
|
| 937 |
+
"text": [
|
| 938 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 939 |
+
"INFO:MiVOLO:\tage: 21.31\n",
|
| 940 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 941 |
+
]
|
| 942 |
+
},
|
| 943 |
+
{
|
| 944 |
+
"name": "stdout",
|
| 945 |
+
"output_type": "stream",
|
| 946 |
+
"text": [
|
| 947 |
+
"\n",
|
| 948 |
+
"0: 640x448 (no detections), 484.0ms\n",
|
| 949 |
+
"Speed: 2.2ms preprocess, 484.0ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 950 |
+
"\n",
|
| 951 |
+
"0: 640x640 (no detections), 832.6ms\n",
|
| 952 |
+
"Speed: 3.0ms preprocess, 832.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 640)\n",
|
| 953 |
+
"\n",
|
| 954 |
+
"0: 640x448 1 person, 1 face, 508.9ms\n",
|
| 955 |
+
"Speed: 2.5ms preprocess, 508.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 956 |
+
]
|
| 957 |
+
},
|
| 958 |
+
{
|
| 959 |
+
"name": "stderr",
|
| 960 |
+
"output_type": "stream",
|
| 961 |
+
"text": [
|
| 962 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 963 |
+
"INFO:MiVOLO:\tage: 27.19\n",
|
| 964 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 965 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6879c7b9-5cb3-42db-b409-30b4e2f71945/width=1080/6879c7b9-5cb3-42db-b409-30b4e2f71945.mp4: cannot identify image file <_io.BytesIO object at 0x14cb0020ad40>\n"
|
| 966 |
+
]
|
| 967 |
+
},
|
| 968 |
+
{
|
| 969 |
+
"name": "stdout",
|
| 970 |
+
"output_type": "stream",
|
| 971 |
+
"text": [
|
| 972 |
+
"\n",
|
| 973 |
+
"0: 640x448 9 persons, 461.8ms\n",
|
| 974 |
+
"Speed: 2.4ms preprocess, 461.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 975 |
+
]
|
| 976 |
+
},
|
| 977 |
+
{
|
| 978 |
+
"name": "stderr",
|
| 979 |
+
"output_type": "stream",
|
| 980 |
+
"text": [
|
| 981 |
+
"INFO:MiVOLO:faces_input: torch.Size([9, 3, 224, 224]), person_input: torch.Size([9, 3, 224, 224])\n",
|
| 982 |
+
"INFO:MiVOLO:\tage: 30.4\n",
|
| 983 |
+
"INFO:MiVOLO:\tgender: male [55%]\n",
|
| 984 |
+
"INFO:MiVOLO:\tage: 28.89\n",
|
| 985 |
+
"INFO:MiVOLO:\tgender: female [63%]\n",
|
| 986 |
+
"INFO:MiVOLO:\tage: 30.31\n",
|
| 987 |
+
"INFO:MiVOLO:\tgender: female [68%]\n",
|
| 988 |
+
"INFO:MiVOLO:\tage: 31.62\n",
|
| 989 |
+
"INFO:MiVOLO:\tgender: female [53%]\n",
|
| 990 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 991 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 992 |
+
"INFO:MiVOLO:\tage: 33.02\n",
|
| 993 |
+
"INFO:MiVOLO:\tgender: male [95%]\n",
|
| 994 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 995 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 996 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 997 |
+
"INFO:MiVOLO:\tgender: male [53%]\n",
|
| 998 |
+
"INFO:MiVOLO:\tage: 35.17\n",
|
| 999 |
+
"INFO:MiVOLO:\tgender: male [53%]\n"
|
| 1000 |
+
]
|
| 1001 |
+
},
|
| 1002 |
+
{
|
| 1003 |
+
"name": "stdout",
|
| 1004 |
+
"output_type": "stream",
|
| 1005 |
+
"text": [
|
| 1006 |
+
"Processed and saved 19 images so far.\n",
|
| 1007 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-04.csv\n",
|
| 1008 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-05.csv\n",
|
| 1009 |
+
"\n",
|
| 1010 |
+
"0: 640x448 1 person, 455.5ms\n",
|
| 1011 |
+
"Speed: 2.2ms preprocess, 455.5ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1012 |
+
]
|
| 1013 |
+
},
|
| 1014 |
+
{
|
| 1015 |
+
"name": "stderr",
|
| 1016 |
+
"output_type": "stream",
|
| 1017 |
+
"text": [
|
| 1018 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1019 |
+
"INFO:MiVOLO:\tage: 37.57\n",
|
| 1020 |
+
"INFO:MiVOLO:\tgender: male [95%]\n"
|
| 1021 |
+
]
|
| 1022 |
+
},
|
| 1023 |
+
{
|
| 1024 |
+
"name": "stdout",
|
| 1025 |
+
"output_type": "stream",
|
| 1026 |
+
"text": [
|
| 1027 |
+
"Processed and saved 1 images so far.\n",
|
| 1028 |
+
"\n",
|
| 1029 |
+
"0: 640x448 1 person, 1 face, 438.7ms\n",
|
| 1030 |
+
"Speed: 2.2ms preprocess, 438.7ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1031 |
+
]
|
| 1032 |
+
},
|
| 1033 |
+
{
|
| 1034 |
+
"name": "stderr",
|
| 1035 |
+
"output_type": "stream",
|
| 1036 |
+
"text": [
|
| 1037 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1038 |
+
"INFO:MiVOLO:\tage: 15.62\n",
|
| 1039 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1040 |
+
]
|
| 1041 |
+
},
|
| 1042 |
+
{
|
| 1043 |
+
"name": "stdout",
|
| 1044 |
+
"output_type": "stream",
|
| 1045 |
+
"text": [
|
| 1046 |
+
"\n",
|
| 1047 |
+
"0: 640x448 (no detections), 444.8ms\n",
|
| 1048 |
+
"Speed: 2.3ms preprocess, 444.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1049 |
+
]
|
| 1050 |
+
},
|
| 1051 |
+
{
|
| 1052 |
+
"name": "stderr",
|
| 1053 |
+
"output_type": "stream",
|
| 1054 |
+
"text": [
|
| 1055 |
+
"ERROR:inference:Unidentified image error for URL https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/6032bd70-6d53-4007-9e89-e69d4748efb5/width=528/6032bd70-6d53-4007-9e89-e69d4748efb5.mp4: cannot identify image file <_io.BytesIO object at 0x14cb00ed0950>\n"
|
| 1056 |
+
]
|
| 1057 |
+
},
|
| 1058 |
+
{
|
| 1059 |
+
"name": "stdout",
|
| 1060 |
+
"output_type": "stream",
|
| 1061 |
+
"text": [
|
| 1062 |
+
"\n",
|
| 1063 |
+
"0: 640x448 (no detections), 453.9ms\n",
|
| 1064 |
+
"Speed: 2.3ms preprocess, 453.9ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1065 |
+
"\n",
|
| 1066 |
+
"0: 640x448 1 person, 1 face, 475.0ms\n",
|
| 1067 |
+
"Speed: 1.6ms preprocess, 475.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1068 |
+
]
|
| 1069 |
+
},
|
| 1070 |
+
{
|
| 1071 |
+
"name": "stderr",
|
| 1072 |
+
"output_type": "stream",
|
| 1073 |
+
"text": [
|
| 1074 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1075 |
+
"INFO:MiVOLO:\tage: 22.5\n",
|
| 1076 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1077 |
+
]
|
| 1078 |
+
},
|
| 1079 |
+
{
|
| 1080 |
+
"name": "stdout",
|
| 1081 |
+
"output_type": "stream",
|
| 1082 |
+
"text": [
|
| 1083 |
+
"\n",
|
| 1084 |
+
"0: 640x448 1 person, 1 face, 447.6ms\n",
|
| 1085 |
+
"Speed: 2.5ms preprocess, 447.6ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1086 |
+
]
|
| 1087 |
+
},
|
| 1088 |
+
{
|
| 1089 |
+
"name": "stderr",
|
| 1090 |
+
"output_type": "stream",
|
| 1091 |
+
"text": [
|
| 1092 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1093 |
+
"INFO:MiVOLO:\tage: 23.46\n",
|
| 1094 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1095 |
+
]
|
| 1096 |
+
},
|
| 1097 |
+
{
|
| 1098 |
+
"name": "stdout",
|
| 1099 |
+
"output_type": "stream",
|
| 1100 |
+
"text": [
|
| 1101 |
+
"\n",
|
| 1102 |
+
"0: 640x512 (no detections), 528.5ms\n",
|
| 1103 |
+
"Speed: 3.2ms preprocess, 528.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 1104 |
+
"\n",
|
| 1105 |
+
"0: 640x448 1 person, 1 face, 449.8ms\n",
|
| 1106 |
+
"Speed: 2.3ms preprocess, 449.8ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1107 |
+
]
|
| 1108 |
+
},
|
| 1109 |
+
{
|
| 1110 |
+
"name": "stderr",
|
| 1111 |
+
"output_type": "stream",
|
| 1112 |
+
"text": [
|
| 1113 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1114 |
+
"INFO:MiVOLO:\tage: 29.32\n",
|
| 1115 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1116 |
+
]
|
| 1117 |
+
},
|
| 1118 |
+
{
|
| 1119 |
+
"name": "stdout",
|
| 1120 |
+
"output_type": "stream",
|
| 1121 |
+
"text": [
|
| 1122 |
+
"\n",
|
| 1123 |
+
"0: 640x448 1 person, 1 face, 617.7ms\n",
|
| 1124 |
+
"Speed: 2.4ms preprocess, 617.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1125 |
+
]
|
| 1126 |
+
},
|
| 1127 |
+
{
|
| 1128 |
+
"name": "stderr",
|
| 1129 |
+
"output_type": "stream",
|
| 1130 |
+
"text": [
|
| 1131 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1132 |
+
"INFO:MiVOLO:\tage: 21.32\n",
|
| 1133 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1134 |
+
]
|
| 1135 |
+
},
|
| 1136 |
+
{
|
| 1137 |
+
"name": "stdout",
|
| 1138 |
+
"output_type": "stream",
|
| 1139 |
+
"text": [
|
| 1140 |
+
"\n",
|
| 1141 |
+
"0: 640x448 (no detections), 609.1ms\n",
|
| 1142 |
+
"Speed: 2.3ms preprocess, 609.1ms inference, 0.5ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1143 |
+
"\n",
|
| 1144 |
+
"0: 640x448 (no detections), 436.2ms\n",
|
| 1145 |
+
"Speed: 2.5ms preprocess, 436.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1146 |
+
"\n",
|
| 1147 |
+
"0: 640x512 1 person, 1 face, 585.6ms\n",
|
| 1148 |
+
"Speed: 3.1ms preprocess, 585.6ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1149 |
+
]
|
| 1150 |
+
},
|
| 1151 |
+
{
|
| 1152 |
+
"name": "stderr",
|
| 1153 |
+
"output_type": "stream",
|
| 1154 |
+
"text": [
|
| 1155 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1156 |
+
"INFO:MiVOLO:\tage: 20.5\n",
|
| 1157 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1158 |
+
]
|
| 1159 |
+
},
|
| 1160 |
+
{
|
| 1161 |
+
"name": "stdout",
|
| 1162 |
+
"output_type": "stream",
|
| 1163 |
+
"text": [
|
| 1164 |
+
"\n",
|
| 1165 |
+
"0: 640x448 1 person, 457.3ms\n",
|
| 1166 |
+
"Speed: 2.1ms preprocess, 457.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1167 |
+
]
|
| 1168 |
+
},
|
| 1169 |
+
{
|
| 1170 |
+
"name": "stderr",
|
| 1171 |
+
"output_type": "stream",
|
| 1172 |
+
"text": [
|
| 1173 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1174 |
+
"INFO:MiVOLO:\tage: 25.19\n",
|
| 1175 |
+
"INFO:MiVOLO:\tgender: male [81%]\n"
|
| 1176 |
+
]
|
| 1177 |
+
},
|
| 1178 |
+
{
|
| 1179 |
+
"name": "stdout",
|
| 1180 |
+
"output_type": "stream",
|
| 1181 |
+
"text": [
|
| 1182 |
+
"Processed and saved 14 images so far.\n",
|
| 1183 |
+
"Processed and saved to: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/MiVOLO-results/2024-05.csv\n",
|
| 1184 |
+
"Processing: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/image_subsets/Civiverse-mini-by-month/Civiverse-2024-06.csv\n",
|
| 1185 |
+
"\n",
|
| 1186 |
+
"0: 640x448 (no detections), 484.5ms\n",
|
| 1187 |
+
"Speed: 2.8ms preprocess, 484.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1188 |
+
"Processed and saved 1 images so far.\n",
|
| 1189 |
+
"\n",
|
| 1190 |
+
"0: 640x512 (no detections), 524.8ms\n",
|
| 1191 |
+
"Speed: 2.9ms preprocess, 524.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 512)\n",
|
| 1192 |
+
"\n",
|
| 1193 |
+
"0: 640x480 1 person, 478.0ms\n",
|
| 1194 |
+
"Speed: 2.6ms preprocess, 478.0ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 480)\n"
|
| 1195 |
+
]
|
| 1196 |
+
},
|
| 1197 |
+
{
|
| 1198 |
+
"name": "stderr",
|
| 1199 |
+
"output_type": "stream",
|
| 1200 |
+
"text": [
|
| 1201 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1202 |
+
"INFO:MiVOLO:\tage: 39.4\n",
|
| 1203 |
+
"INFO:MiVOLO:\tgender: male [99%]\n"
|
| 1204 |
+
]
|
| 1205 |
+
},
|
| 1206 |
+
{
|
| 1207 |
+
"name": "stdout",
|
| 1208 |
+
"output_type": "stream",
|
| 1209 |
+
"text": [
|
| 1210 |
+
"\n",
|
| 1211 |
+
"0: 640x512 1 person, 1 face, 539.8ms\n",
|
| 1212 |
+
"Speed: 2.6ms preprocess, 539.8ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1213 |
+
]
|
| 1214 |
+
},
|
| 1215 |
+
{
|
| 1216 |
+
"name": "stderr",
|
| 1217 |
+
"output_type": "stream",
|
| 1218 |
+
"text": [
|
| 1219 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1220 |
+
"INFO:MiVOLO:\tage: 21.33\n",
|
| 1221 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1222 |
+
]
|
| 1223 |
+
},
|
| 1224 |
+
{
|
| 1225 |
+
"name": "stdout",
|
| 1226 |
+
"output_type": "stream",
|
| 1227 |
+
"text": [
|
| 1228 |
+
"\n",
|
| 1229 |
+
"0: 640x448 1 person, 2 faces, 446.7ms\n",
|
| 1230 |
+
"Speed: 2.4ms preprocess, 446.7ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1231 |
+
]
|
| 1232 |
+
},
|
| 1233 |
+
{
|
| 1234 |
+
"name": "stderr",
|
| 1235 |
+
"output_type": "stream",
|
| 1236 |
+
"text": [
|
| 1237 |
+
"INFO:MiVOLO:faces_input: torch.Size([2, 3, 224, 224]), person_input: torch.Size([2, 3, 224, 224])\n",
|
| 1238 |
+
"INFO:MiVOLO:\tage: 20.65\n",
|
| 1239 |
+
"INFO:MiVOLO:\tgender: female [99%]\n",
|
| 1240 |
+
"INFO:MiVOLO:\tage: 20.53\n",
|
| 1241 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1242 |
+
]
|
| 1243 |
+
},
|
| 1244 |
+
{
|
| 1245 |
+
"name": "stdout",
|
| 1246 |
+
"output_type": "stream",
|
| 1247 |
+
"text": [
|
| 1248 |
+
"\n",
|
| 1249 |
+
"0: 640x640 1 person, 1 face, 655.0ms\n",
|
| 1250 |
+
"Speed: 3.3ms preprocess, 655.0ms inference, 0.8ms postprocess per image at shape (1, 3, 640, 640)\n"
|
| 1251 |
+
]
|
| 1252 |
+
},
|
| 1253 |
+
{
|
| 1254 |
+
"name": "stderr",
|
| 1255 |
+
"output_type": "stream",
|
| 1256 |
+
"text": [
|
| 1257 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1258 |
+
"INFO:MiVOLO:\tage: 26.34\n",
|
| 1259 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1260 |
+
]
|
| 1261 |
+
},
|
| 1262 |
+
{
|
| 1263 |
+
"name": "stdout",
|
| 1264 |
+
"output_type": "stream",
|
| 1265 |
+
"text": [
|
| 1266 |
+
"\n",
|
| 1267 |
+
"0: 640x384 (no detections), 400.6ms\n",
|
| 1268 |
+
"Speed: 2.1ms preprocess, 400.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 384)\n",
|
| 1269 |
+
"\n",
|
| 1270 |
+
"0: 640x448 1 person, 587.9ms\n",
|
| 1271 |
+
"Speed: 2.2ms preprocess, 587.9ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 448)\n"
|
| 1272 |
+
]
|
| 1273 |
+
},
|
| 1274 |
+
{
|
| 1275 |
+
"name": "stderr",
|
| 1276 |
+
"output_type": "stream",
|
| 1277 |
+
"text": [
|
| 1278 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1279 |
+
"INFO:MiVOLO:\tage: 30.4\n",
|
| 1280 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1281 |
+
]
|
| 1282 |
+
},
|
| 1283 |
+
{
|
| 1284 |
+
"name": "stdout",
|
| 1285 |
+
"output_type": "stream",
|
| 1286 |
+
"text": [
|
| 1287 |
+
"\n",
|
| 1288 |
+
"0: 640x448 (no detections), 610.3ms\n",
|
| 1289 |
+
"Speed: 2.3ms preprocess, 610.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1290 |
+
"\n",
|
| 1291 |
+
"0: 640x448 (no detections), 453.6ms\n",
|
| 1292 |
+
"Speed: 2.3ms preprocess, 453.6ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1293 |
+
"\n",
|
| 1294 |
+
"0: 640x512 1 person, 1 face, 511.3ms\n",
|
| 1295 |
+
"Speed: 2.8ms preprocess, 511.3ms inference, 0.7ms postprocess per image at shape (1, 3, 640, 512)\n"
|
| 1296 |
+
]
|
| 1297 |
+
},
|
| 1298 |
+
{
|
| 1299 |
+
"name": "stderr",
|
| 1300 |
+
"output_type": "stream",
|
| 1301 |
+
"text": [
|
| 1302 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1303 |
+
"INFO:MiVOLO:\tage: 34.28\n",
|
| 1304 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1305 |
+
]
|
| 1306 |
+
},
|
| 1307 |
+
{
|
| 1308 |
+
"name": "stdout",
|
| 1309 |
+
"output_type": "stream",
|
| 1310 |
+
"text": [
|
| 1311 |
+
"\n",
|
| 1312 |
+
"0: 640x448 (no detections), 441.2ms\n",
|
| 1313 |
+
"Speed: 2.3ms preprocess, 441.2ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1314 |
+
"\n",
|
| 1315 |
+
"0: 640x448 (no detections), 586.3ms\n",
|
| 1316 |
+
"Speed: 2.3ms preprocess, 586.3ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1317 |
+
"\n",
|
| 1318 |
+
"0: 640x448 (no detections), 437.5ms\n",
|
| 1319 |
+
"Speed: 2.4ms preprocess, 437.5ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1320 |
+
"\n",
|
| 1321 |
+
"0: 448x640 1 person, 1 face, 437.4ms\n",
|
| 1322 |
+
"Speed: 2.4ms preprocess, 437.4ms inference, 0.7ms postprocess per image at shape (1, 3, 448, 640)\n"
|
| 1323 |
+
]
|
| 1324 |
+
},
|
| 1325 |
+
{
|
| 1326 |
+
"name": "stderr",
|
| 1327 |
+
"output_type": "stream",
|
| 1328 |
+
"text": [
|
| 1329 |
+
"INFO:MiVOLO:faces_input: torch.Size([1, 3, 224, 224]), person_input: torch.Size([1, 3, 224, 224])\n",
|
| 1330 |
+
"INFO:MiVOLO:\tage: 22.81\n",
|
| 1331 |
+
"INFO:MiVOLO:\tgender: female [99%]\n"
|
| 1332 |
+
]
|
| 1333 |
+
},
|
| 1334 |
+
{
|
| 1335 |
+
"name": "stdout",
|
| 1336 |
+
"output_type": "stream",
|
| 1337 |
+
"text": [
|
| 1338 |
+
"\n",
|
| 1339 |
+
"0: 640x448 (no detections), 436.8ms\n",
|
| 1340 |
+
"Speed: 2.6ms preprocess, 436.8ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1341 |
+
"\n",
|
| 1342 |
+
"0: 448x640 (no detections), 433.0ms\n",
|
| 1343 |
+
"Speed: 1.9ms preprocess, 433.0ms inference, 0.4ms postprocess per image at shape (1, 3, 448, 640)\n",
|
| 1344 |
+
"\n",
|
| 1345 |
+
"0: 640x448 (no detections), 599.7ms\n",
|
| 1346 |
+
"Speed: 2.5ms preprocess, 599.7ms inference, 0.4ms postprocess per image at shape (1, 3, 640, 448)\n",
|
| 1347 |
+
"\n"
|
| 1348 |
+
]
|
| 1349 |
+
}
|
| 1350 |
+
],
|
| 1351 |
+
"source": [
|
| 1352 |
+
"# Set up logging\n",
|
| 1353 |
+
"detector_weights = current_dir.parent / 'ext/MiVOLO/models/yolov8x_person_face.pt'\n",
|
| 1354 |
+
"checkpoint = current_dir.parent / 'ext/MiVOLO/models/mivolo_imbd.pth.tar'\n",
|
| 1355 |
+
"\n",
|
| 1356 |
+
"_logger = logging.getLogger(\"inference\")\n",
|
| 1357 |
+
"logging.basicConfig(level=logging.INFO)\n",
|
| 1358 |
+
"\n",
|
| 1359 |
+
"# Placeholder configuration and predictor initialization for MiVOLO\n",
|
| 1360 |
+
"class Config:\n",
|
| 1361 |
+
" def __init__(self, detector_weights, checkpoint, device, with_persons=True, disable_faces=False, draw=False):\n",
|
| 1362 |
+
" self.detector_weights = detector_weights\n",
|
| 1363 |
+
" self.checkpoint = checkpoint\n",
|
| 1364 |
+
" self.device = device\n",
|
| 1365 |
+
" self.with_persons = with_persons\n",
|
| 1366 |
+
" self.disable_faces = disable_faces\n",
|
| 1367 |
+
" self.draw = draw\n",
|
| 1368 |
+
"\n",
|
| 1369 |
+
"\n",
|
| 1370 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1371 |
+
"config = Config(detector_weights=detector_weights, checkpoint=checkpoint, device=device)\n",
|
| 1372 |
+
"predictor = Predictor(config, verbose=True)\n",
|
| 1373 |
+
"\n",
|
| 1374 |
+
"def download_image(url):\n",
|
| 1375 |
+
" try:\n",
|
| 1376 |
+
" response = requests.get(url)\n",
|
| 1377 |
+
" response.raise_for_status()\n",
|
| 1378 |
+
" return Image.open(BytesIO(response.content)).convert(\"RGB\")\n",
|
| 1379 |
+
" except requests.RequestException as e:\n",
|
| 1380 |
+
" _logger.error(f\"Failed to download image from {url}: {e}\")\n",
|
| 1381 |
+
" return None\n",
|
| 1382 |
+
" except UnidentifiedImageError as e:\n",
|
| 1383 |
+
" _logger.error(f\"Unidentified image error for URL {url}: {e}\")\n",
|
| 1384 |
+
" return None\n",
|
| 1385 |
+
"\n",
|
| 1386 |
+
"def process_images_with_progress(data, predictor, output_file, start_idx=0):\n",
|
| 1387 |
+
" results = []\n",
|
| 1388 |
+
" total_images = len(data)\n",
|
| 1389 |
+
"\n",
|
| 1390 |
+
" for idx, row in data.iterrows():\n",
|
| 1391 |
+
" if idx < start_idx:\n",
|
| 1392 |
+
" continue\n",
|
| 1393 |
+
"\n",
|
| 1394 |
+
" img_url = row[\"url\"]\n",
|
| 1395 |
+
" pil_image = download_image(img_url)\n",
|
| 1396 |
+
" if pil_image is None:\n",
|
| 1397 |
+
" continue\n",
|
| 1398 |
+
"\n",
|
| 1399 |
+
" np_image = np.array(pil_image)\n",
|
| 1400 |
+
" np_image = cv2.cvtColor(np_image, cv2.COLOR_RGB2BGR)\n",
|
| 1401 |
+
" detected_objects, _ = predictor.recognize(np_image)\n",
|
| 1402 |
+
"\n",
|
| 1403 |
+
" row_result = row.to_dict() # Start with the original row's data\n",
|
| 1404 |
+
"\n",
|
| 1405 |
+
" if detected_objects and detected_objects.ages:\n",
|
| 1406 |
+
" for i in range(len(detected_objects.ages)):\n",
|
| 1407 |
+
" age = detected_objects.ages[i]\n",
|
| 1408 |
+
" gender = detected_objects.genders[i]\n",
|
| 1409 |
+
" gender_confidence = detected_objects.gender_scores[i]\n",
|
| 1410 |
+
"\n",
|
| 1411 |
+
" if gender_confidence >= 0.83:\n",
|
| 1412 |
+
" detection = {\n",
|
| 1413 |
+
" \"detection_type\": 'face' if i in detected_objects.face_to_person_map else 'person',\n",
|
| 1414 |
+
" \"gender\": gender,\n",
|
| 1415 |
+
" \"gender_confidence\": gender_confidence,\n",
|
| 1416 |
+
" \"age\": age,\n",
|
| 1417 |
+
" \"n_persons\": detected_objects.n_persons,\n",
|
| 1418 |
+
" \"n_faces\": detected_objects.n_faces,\n",
|
| 1419 |
+
" \"detected\": True\n",
|
| 1420 |
+
" }\n",
|
| 1421 |
+
" else:\n",
|
| 1422 |
+
" detection = {\n",
|
| 1423 |
+
" \"detection_type\": \"N/A\",\n",
|
| 1424 |
+
" \"gender\": \"N/A\",\n",
|
| 1425 |
+
" \"gender_confidence\": 0,\n",
|
| 1426 |
+
" \"age\": 0,\n",
|
| 1427 |
+
" \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
|
| 1428 |
+
" \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
|
| 1429 |
+
" \"detected\": False\n",
|
| 1430 |
+
" }\n",
|
| 1431 |
+
"\n",
|
| 1432 |
+
" results.append({**row_result, **detection})\n",
|
| 1433 |
+
" else:\n",
|
| 1434 |
+
" detection = {\n",
|
| 1435 |
+
" \"detection_type\": \"N/A\",\n",
|
| 1436 |
+
" \"gender\": \"N/A\",\n",
|
| 1437 |
+
" \"gender_confidence\": 0,\n",
|
| 1438 |
+
" \"age\": 0,\n",
|
| 1439 |
+
" \"n_persons\": detected_objects.n_persons if detected_objects else 0,\n",
|
| 1440 |
+
" \"n_faces\": detected_objects.n_faces if detected_objects else 0,\n",
|
| 1441 |
+
" \"detected\": False\n",
|
| 1442 |
+
" }\n",
|
| 1443 |
+
" results.append({**row_result, **detection})\n",
|
| 1444 |
+
"\n",
|
| 1445 |
+
" if idx % 100 == 0 or idx == total_images - 1:\n",
|
| 1446 |
+
" df = pd.DataFrame(results)\n",
|
| 1447 |
+
" if os.path.exists(output_file):\n",
|
| 1448 |
+
" df.to_csv(output_file, mode='a', header=False, index=False)\n",
|
| 1449 |
+
" else:\n",
|
| 1450 |
+
" df.to_csv(output_file, mode='w', header=True, index=False)\n",
|
| 1451 |
+
" results = []\n",
|
| 1452 |
+
" print(f\"Processed and saved {idx + 1} images so far.\")\n",
|
| 1453 |
+
"\n",
|
| 1454 |
+
"def generate_months(start, end):\n",
|
| 1455 |
+
" start_date = datetime.strptime(start, '%Y-%m')\n",
|
| 1456 |
+
" end_date = datetime.strptime(end, '%Y-%m')\n",
|
| 1457 |
+
" while start_date <= end_date:\n",
|
| 1458 |
+
" yield start_date.strftime('%Y-%m')\n",
|
| 1459 |
+
" start_date += relativedelta(months=1) # Increment by calendar months\n",
|
| 1460 |
+
"\n",
|
| 1461 |
+
"\n",
|
| 1462 |
+
"start_month = '2022-11'\n",
|
| 1463 |
+
"end_month = '2024-12'\n",
|
| 1464 |
+
"\n",
|
| 1465 |
+
"for month in generate_months(start_month, end_month):\n",
|
| 1466 |
+
" input_file_path = mivolo_in / f'Civiverse-{month}.csv'\n",
|
| 1467 |
+
" output_file_path = mivolo_out / f'{month}.csv'\n",
|
| 1468 |
+
"\n",
|
| 1469 |
+
" if input_file_path.exists():\n",
|
| 1470 |
+
" print(f\"Processing: {input_file_path}\")\n",
|
| 1471 |
+
"\n",
|
| 1472 |
+
" data = pd.read_csv(input_file_path)\n",
|
| 1473 |
+
" start_index = 0\n",
|
| 1474 |
+
" process_images_with_progress(data, predictor, output_file_path, start_idx=start_index)\n",
|
| 1475 |
+
"\n",
|
| 1476 |
+
" print(f\"Processed and saved to: {output_file_path}\")\n",
|
| 1477 |
+
" else:\n",
|
| 1478 |
+
" print(f\"File not found: {input_file_path}\")"
|
| 1479 |
+
]
|
| 1480 |
+
},
|
| 1481 |
+
{
|
| 1482 |
+
"cell_type": "markdown",
|
| 1483 |
+
"id": "26aeeef7",
|
| 1484 |
+
"metadata": {},
|
| 1485 |
+
"source": [
|
| 1486 |
+
"## Visualization code"
|
| 1487 |
+
]
|
| 1488 |
+
},
|
| 1489 |
+
{
|
| 1490 |
+
"cell_type": "code",
|
| 1491 |
+
"execution_count": null,
|
| 1492 |
+
"id": "88ec896a-bf9b-4cc6-a787-c1343f8acb41",
|
| 1493 |
+
"metadata": {},
|
| 1494 |
+
"outputs": [],
|
| 1495 |
+
"source": [
|
| 1496 |
+
"import matplotlib.pyplot as plt\n",
|
| 1497 |
+
"import matplotlib.patches as mpatches\n",
|
| 1498 |
+
"\n",
|
| 1499 |
+
"input_dir = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/'\n",
|
| 1500 |
+
"plot_dir = current_dir.parent / 'plots/'\n",
|
| 1501 |
+
"\n",
|
| 1502 |
+
"\n",
|
| 1503 |
+
"all_data = pd.DataFrame()\n",
|
| 1504 |
+
"for file_path in input_dir.glob('*.csv'): # Reads all CSV files in the folder\n",
|
| 1505 |
+
" #print(f\"Loading: {file_path}\")\n",
|
| 1506 |
+
" data = pd.read_csv(file_path)\n",
|
| 1507 |
+
" all_data = pd.concat([all_data, data], ignore_index=True)\n",
|
| 1508 |
+
"\n",
|
| 1509 |
+
"# Filter rows where detection_type equals \"person\"\n",
|
| 1510 |
+
"person_data = all_data[all_data['detection_type'] == 'person']\n",
|
| 1511 |
+
"\n",
|
| 1512 |
+
"# Count unique images and categorize by persons detected\n",
|
| 1513 |
+
"n_images = all_data['id'].nunique()\n",
|
| 1514 |
+
"images_with_zero_persons = all_data[all_data['n_persons'] == 0]['id'].nunique()\n",
|
| 1515 |
+
"images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
|
| 1516 |
+
"\n",
|
| 1517 |
+
"n_persons_detected = person_data['id'].nunique() # Unique images with at least one detected person\n",
|
| 1518 |
+
"total_persons_detected = person_data.shape[0] # Total number of persons detected\n",
|
| 1519 |
+
"\n",
|
| 1520 |
+
"\n",
|
| 1521 |
+
"\n",
|
| 1522 |
+
"n_total_female = person_data[person_data['gender'] == 'female']['id'].nunique()\n",
|
| 1523 |
+
"n_total_male = person_data[person_data['gender'] == 'male']['id'].nunique()\n",
|
| 1524 |
+
"\n",
|
| 1525 |
+
"# Filter the data further for non-missing age and gender\n",
|
| 1526 |
+
"filtered_data = person_data.dropna(subset=['age', 'gender'])\n",
|
| 1527 |
+
"\n",
|
| 1528 |
+
"# Round ages for consistent plotting\n",
|
| 1529 |
+
"filtered_data['rounded_age'] = np.round(filtered_data['age'] * 4) / 4\n",
|
| 1530 |
+
"\n",
|
| 1531 |
+
"# Map browsingLevel to colors\n",
|
| 1532 |
+
"def get_browsing_color(browsing_level):\n",
|
| 1533 |
+
" color_mapping = {\n",
|
| 1534 |
+
" 1: 'silver',\n",
|
| 1535 |
+
" 2: 'rosybrown',\n",
|
| 1536 |
+
" 4: 'coral',\n",
|
| 1537 |
+
" 8: 'crimson',\n",
|
| 1538 |
+
" 16: 'blueviolet'\n",
|
| 1539 |
+
" }\n",
|
| 1540 |
+
" return color_mapping.get(browsing_level, 'black') # Default to black for unknown values\n",
|
| 1541 |
+
"\n",
|
| 1542 |
+
"filtered_data['color'] = filtered_data['browsingLevel'].apply(get_browsing_color)\n",
|
| 1543 |
+
"\n",
|
| 1544 |
+
"# Aggregate data for plotting\n",
|
| 1545 |
+
"aggregated_data = (\n",
|
| 1546 |
+
" filtered_data.groupby(['rounded_age', 'gender', 'color'])\n",
|
| 1547 |
+
" .size()\n",
|
| 1548 |
+
" .unstack(fill_value=0)\n",
|
| 1549 |
+
")\n",
|
| 1550 |
+
"\n",
|
| 1551 |
+
"# Define NSFW colors\n",
|
| 1552 |
+
"nsfw_colors = ['blueviolet', 'crimson', 'coral', 'rosybrown', 'silver']\n",
|
| 1553 |
+
"\n",
|
| 1554 |
+
"# Plotting function\n",
|
| 1555 |
+
"def plot_gender_data(ax, data, gender_label):\n",
|
| 1556 |
+
" ages = data.index\n",
|
| 1557 |
+
" bottom = np.zeros(len(ages))\n",
|
| 1558 |
+
" \n",
|
| 1559 |
+
" for color in nsfw_colors:\n",
|
| 1560 |
+
" counts = data[color] if color in data.columns else np.zeros(len(ages))\n",
|
| 1561 |
+
" ax.bar(\n",
|
| 1562 |
+
" ages,\n",
|
| 1563 |
+
" counts,\n",
|
| 1564 |
+
" color=color,\n",
|
| 1565 |
+
" edgecolor=color,\n",
|
| 1566 |
+
" linewidth=1,\n",
|
| 1567 |
+
" width=0.2,\n",
|
| 1568 |
+
" bottom=bottom,\n",
|
| 1569 |
+
" alpha=0.5\n",
|
| 1570 |
+
" )\n",
|
| 1571 |
+
" bottom += counts\n",
|
| 1572 |
+
"\n",
|
| 1573 |
+
" x_min = 5\n",
|
| 1574 |
+
" x_max = filtered_data['rounded_age'].max()\n",
|
| 1575 |
+
" ax.set_xticks(np.arange(x_min, x_max + 0.5, 5))\n",
|
| 1576 |
+
" ax.set_xticklabels([f'{int(age)}' for age in np.arange(x_min, x_max + 0.5, 5)], fontsize=12, fontweight='bold')\n",
|
| 1577 |
+
" ax.set_xticks(np.arange(x_min, x_max + 0.5, 0.5), minor=True)\n",
|
| 1578 |
+
"\n",
|
| 1579 |
+
" y_min = 0\n",
|
| 1580 |
+
" y_max = bottom.max() + 100\n",
|
| 1581 |
+
" y_ticks = np.arange(y_min, y_max + 1, 100) # Fine-grained steps of 100\n",
|
| 1582 |
+
" ax.set_yticks(y_ticks)\n",
|
| 1583 |
+
" ax.set_yticklabels([str(int(y)) for y in y_ticks], fontsize=12, fontweight='bold')\n",
|
| 1584 |
+
"\n",
|
| 1585 |
+
" ax.grid(True, which='major', color='lightgrey', linestyle='-', linewidth=0.5)\n",
|
| 1586 |
+
" ax.grid(True, which='minor', color='lightgrey', linestyle=':', linewidth=0.5)\n",
|
| 1587 |
+
"\n",
|
| 1588 |
+
" ax.spines['top'].set_visible(False)\n",
|
| 1589 |
+
" ax.spines['right'].set_visible(False)\n",
|
| 1590 |
+
" ax.spines['left'].set_visible(False)\n",
|
| 1591 |
+
" ax.spines['bottom'].set_visible(False)\n",
|
| 1592 |
+
" \n",
|
| 1593 |
+
" ax.set_xlabel('Age', fontsize=12, fontweight='bold')\n",
|
| 1594 |
+
" if gender_label == 'Female':\n",
|
| 1595 |
+
" ax.set_ylabel('Number of Subjects', fontsize=14, fontweight='bold')\n",
|
| 1596 |
+
" ax.set_title(f'{gender_label} Read', fontsize=14, fontweight='bold')\n",
|
| 1597 |
+
"\n",
|
| 1598 |
+
"# Set up the subplots\n",
|
| 1599 |
+
"fig, axes = plt.subplots(1, 2, figsize=(14, 6.5), sharey=True)\n",
|
| 1600 |
+
"\n",
|
| 1601 |
+
"plot_gender_data(axes[0], aggregated_data.xs('male', level='gender'), 'Male')\n",
|
| 1602 |
+
"plot_gender_data(axes[1], aggregated_data.xs('female', level='gender'), 'Female')\n",
|
| 1603 |
+
"\n",
|
| 1604 |
+
"legend_patches = [\n",
|
| 1605 |
+
" mpatches.Patch(facecolor='blueviolet', edgecolor='blueviolet', linewidth=2, label='Level 16: XXX'),\n",
|
| 1606 |
+
" mpatches.Patch(facecolor='crimson', edgecolor='crimson', linewidth=2, label='Level 8: X'),\n",
|
| 1607 |
+
" mpatches.Patch(facecolor='coral', edgecolor='coral', linewidth=2, label='Level 4: Mature'),\n",
|
| 1608 |
+
" mpatches.Patch(facecolor='rosybrown', edgecolor='rosybrown', linewidth=2, label='Level 2: Soft'),\n",
|
| 1609 |
+
" mpatches.Patch(facecolor='silver', edgecolor='silver', linewidth=2, label='Level 1: SFW'),\n",
|
| 1610 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'n images: {n_images}', alpha=0),\n",
|
| 1611 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'Total persons detected: {total_persons_detected}', alpha=0),\n",
|
| 1612 |
+
" mpatches.Patch(facecolor='none', edgecolor='none', label=f'Unique images containing persons: {n_persons_detected}', alpha=0),\n",
|
| 1613 |
+
"]\n",
|
| 1614 |
+
"\n",
|
| 1615 |
+
"axes[0].legend(handles=legend_patches, title=\"Browsing Levels\", loc='upper left', fontsize=12, title_fontsize=12, frameon=True)\n",
|
| 1616 |
+
"plt.savefig(f'{plot_dir}/mivolo.svg', format='svg', bbox_inches='tight')\n",
|
| 1617 |
+
"plt.tight_layout()\n",
|
| 1618 |
+
"\n",
|
| 1619 |
+
"# Count images with at least one person\n",
|
| 1620 |
+
"images_with_one_or_more_persons = all_data[all_data['n_persons'] > 0]['id'].nunique()\n",
|
| 1621 |
+
"\n",
|
| 1622 |
+
"# Count unique images in `person_data`\n",
|
| 1623 |
+
"n_persons_detected = person_data['id'].nunique()\n",
|
| 1624 |
+
"\n",
|
| 1625 |
+
"# Count total persons detected\n",
|
| 1626 |
+
"total_persons = person_data.shape[0] # This counts all detected persons\n",
|
| 1627 |
+
"\n",
|
| 1628 |
+
"# Display potential inconsistencies\n",
|
| 1629 |
+
"print(f\"Total images: {n_images}\")\n",
|
| 1630 |
+
"print(f\"Images with at least one person: {images_with_one_or_more_persons}\")\n",
|
| 1631 |
+
"print(f\"Unique images in `person_data`: {n_persons_detected}\")\n",
|
| 1632 |
+
"print(f\"Total number of persons detected: {total_persons}\")\n",
|
| 1633 |
+
"\n",
|
| 1634 |
+
"\n",
|
| 1635 |
+
"\n",
|
| 1636 |
+
"plt.show()"
|
| 1637 |
+
]
|
| 1638 |
+
},
|
| 1639 |
+
{
|
| 1640 |
+
"cell_type": "markdown",
|
| 1641 |
+
"id": "42b2a557-b8f4-4d0f-8907-98e3012a1b34",
|
| 1642 |
+
"metadata": {
|
| 1643 |
+
"execution": {
|
| 1644 |
+
"iopub.execute_input": "2025-02-06T20:01:54.848400Z",
|
| 1645 |
+
"iopub.status.busy": "2025-02-06T20:01:54.847713Z",
|
| 1646 |
+
"iopub.status.idle": "2025-02-06T20:01:54.851533Z",
|
| 1647 |
+
"shell.execute_reply": "2025-02-06T20:01:54.851102Z",
|
| 1648 |
+
"shell.execute_reply.started": "2025-02-06T20:01:54.848376Z"
|
| 1649 |
+
}
|
| 1650 |
+
},
|
| 1651 |
+
"source": [
|
| 1652 |
+
"### Latex Table"
|
| 1653 |
+
]
|
| 1654 |
+
},
|
| 1655 |
+
{
|
| 1656 |
+
"cell_type": "code",
|
| 1657 |
+
"execution_count": null,
|
| 1658 |
+
"id": "3e506c41-6497-4ece-99f3-73f09fe1129e",
|
| 1659 |
+
"metadata": {},
|
| 1660 |
+
"outputs": [],
|
| 1661 |
+
"source": [
|
| 1662 |
+
"import os\n",
|
| 1663 |
+
"import pandas as pd\n",
|
| 1664 |
+
"\n",
|
| 1665 |
+
"# Define the directory containing CSV files\n",
|
| 1666 |
+
"directory_path = current_dir.parent / 'data/CSV/image_subsets/MiVOLO-results/' # Update this path with the actual directory path\n",
|
| 1667 |
+
"\n",
|
| 1668 |
+
"# Prepare data for LaTeX table\n",
|
| 1669 |
+
"table_rows = []\n",
|
| 1670 |
+
"\n",
|
| 1671 |
+
"# Loop through each file in the directory\n",
|
| 1672 |
+
"for file_name in os.listdir(directory_path):\n",
|
| 1673 |
+
" if file_name.endswith('.csv'):\n",
|
| 1674 |
+
" file_path = os.path.join(directory_path, file_name)\n",
|
| 1675 |
+
" print(f\"Processing file: {file_name}\")\n",
|
| 1676 |
+
"\n",
|
| 1677 |
+
" # Load the data\n",
|
| 1678 |
+
" data = pd.read_csv(file_path)\n",
|
| 1679 |
+
"\n",
|
| 1680 |
+
" # Total images analyzed\n",
|
| 1681 |
+
" total_images = data['id'].nunique()\n",
|
| 1682 |
+
"\n",
|
| 1683 |
+
" # Count of images with no persons detected\n",
|
| 1684 |
+
" images_no_persons = data[data['n_persons'] == 0]['id'].nunique()\n",
|
| 1685 |
+
"\n",
|
| 1686 |
+
" # Total persons detected (only using \"person\" detection type)\n",
|
| 1687 |
+
" total_persons_count = data[data['detection_type'] == 'person'].shape[0]\n",
|
| 1688 |
+
"\n",
|
| 1689 |
+
" # Average age and standard deviation for male and female individuals\n",
|
| 1690 |
+
" male_age_stats = data[data['gender'] == 'male']['age'].agg(['mean', 'std']).fillna(0)\n",
|
| 1691 |
+
" female_age_stats = data[data['gender'] == 'female']['age'].agg(['mean', 'std']).fillna(0)\n",
|
| 1692 |
+
"\n",
|
| 1693 |
+
" # Count of female and male subjects\n",
|
| 1694 |
+
" female_images_count = data[data['gender'] == 'female']['id'].nunique()\n",
|
| 1695 |
+
" male_images_count = data[data['gender'] == 'male']['id'].nunique()\n",
|
| 1696 |
+
"\n",
|
| 1697 |
+
" # Female to male ratio\n",
|
| 1698 |
+
" female_to_male_ratio = female_images_count / male_images_count if male_images_count else None\n",
|
| 1699 |
+
"\n",
|
| 1700 |
+
" # Browsing level analysis for females\n",
|
| 1701 |
+
" female_browsing_level_1 = data[(data['gender'] == 'female') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
|
| 1702 |
+
" female_browsing_level_2_16 = data[(data['gender'] == 'female') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
|
| 1703 |
+
" \n",
|
| 1704 |
+
" female_browsing_level_1_percentage = (female_browsing_level_1 / female_images_count * 100) if female_images_count else 0\n",
|
| 1705 |
+
" female_browsing_level_2_16_percentage = (female_browsing_level_2_16 / female_images_count * 100) if female_images_count else 0\n",
|
| 1706 |
+
"\n",
|
| 1707 |
+
" # Browsing level analysis for males\n",
|
| 1708 |
+
" male_browsing_level_1 = data[(data['gender'] == 'male') & (data['browsingLevel'] == 1)]['id'].nunique()\n",
|
| 1709 |
+
" male_browsing_level_2_16 = data[(data['gender'] == 'male') & (data['browsingLevel'].isin([2, 4, 8, 16]))]['id'].nunique()\n",
|
| 1710 |
+
"\n",
|
| 1711 |
+
" male_browsing_level_1_percentage = (male_browsing_level_1 / male_images_count * 100) if male_images_count else 0\n",
|
| 1712 |
+
" male_browsing_level_2_16_percentage = (male_browsing_level_2_16 / male_images_count * 100) if male_images_count else 0\n",
|
| 1713 |
+
"\n",
|
| 1714 |
+
" # Add row to table data\n",
|
| 1715 |
+
" table_rows.append([\n",
|
| 1716 |
+
" file_name.replace('.csv', ''), # Remove file extension\n",
|
| 1717 |
+
" total_images,\n",
|
| 1718 |
+
" total_persons_count,\n",
|
| 1719 |
+
" images_no_persons,\n",
|
| 1720 |
+
" f\"{female_browsing_level_1_percentage:.2f}\",\n",
|
| 1721 |
+
" f\"{female_browsing_level_2_16_percentage:.2f}\",\n",
|
| 1722 |
+
" f\"{male_browsing_level_1_percentage:.2f}\",\n",
|
| 1723 |
+
" f\"{male_browsing_level_2_16_percentage:.2f}\",\n",
|
| 1724 |
+
" f\"{female_to_male_ratio:.2f}\" if female_to_male_ratio is not None else \"N/A\",\n",
|
| 1725 |
+
" f\"{female_age_stats['mean']:.2f} ({female_age_stats['std']:.2f})\",\n",
|
| 1726 |
+
" f\"{male_age_stats['mean']:.2f} ({male_age_stats['std']:.2f})\"\n",
|
| 1727 |
+
" ])\n",
|
| 1728 |
+
"\n",
|
| 1729 |
+
"# Sort table rows by the filename (assumes filenames are formatted with sortable dates)\n",
|
| 1730 |
+
"table_rows = sorted(table_rows, key=lambda x: x[0])\n",
|
| 1731 |
+
"\n",
|
| 1732 |
+
"# Generate LaTeX table\n",
|
| 1733 |
+
"latex_table = r\"\"\"\n",
|
| 1734 |
+
"\\begin{table}[H]\n",
|
| 1735 |
+
"\\centering\n",
|
| 1736 |
+
"\\scriptsize\n",
|
| 1737 |
+
"\\renewcommand{\\arraystretch}{0.9}\n",
|
| 1738 |
+
"\\caption{Summary of Image Classification for 2023-2024}\n",
|
| 1739 |
+
"\\label{table:image_classification_2023_2024}\n",
|
| 1740 |
+
"\\begin{tabular}{lrrrrrrrrrr}\n",
|
| 1741 |
+
"\\toprule\n",
|
| 1742 |
+
"File Name & Total Images & Total Persons & No Persons & \\multicolumn{2}{c}{Female (\\%)} & \\multicolumn{2}{c}{Male (\\%)} & Female:Male & Female Age (Mean ± SD) & Male Age (Mean ± SD) \\\\\n",
|
| 1743 |
+
" & & & & L1 & L2-16 & L1 & L2-16 & & & \\\\\n",
|
| 1744 |
+
"\\midrule\n",
|
| 1745 |
+
"\"\"\"\n",
|
| 1746 |
+
"for row in table_rows:\n",
|
| 1747 |
+
" latex_table += \" & \".join(map(str, row)) + r\" \\\\\\\\\\n\"\n",
|
| 1748 |
+
"\n",
|
| 1749 |
+
"latex_table += r\"\"\"\n",
|
| 1750 |
+
"\\bottomrule\n",
|
| 1751 |
+
"\\end{tabular}\n",
|
| 1752 |
+
"\\vspace{1em}\n",
|
| 1753 |
+
"\\noindent\n",
|
| 1754 |
+
"\\textbf{Disclaimer:} \\(\\female\\) and \\(\\male\\) refer to female-read and male-read classifications as determined by the MiVOLO system's weights. \n",
|
| 1755 |
+
"We acknowledge the complexities of gender presentations and stress that these terms do not necessarily correspond to biological sex.\n",
|
| 1756 |
+
"\\end{table}\n",
|
| 1757 |
+
"\"\"\"\n",
|
| 1758 |
+
"\n",
|
| 1759 |
+
"# Output LaTeX table\n",
|
| 1760 |
+
"print(\"\\nGenerated LaTeX Table:\")\n",
|
| 1761 |
+
"print(latex_table)\n",
|
| 1762 |
+
"\n"
|
| 1763 |
+
]
|
| 1764 |
+
},
|
| 1765 |
+
{
|
| 1766 |
+
"cell_type": "code",
|
| 1767 |
+
"execution_count": null,
|
| 1768 |
+
"id": "3ef428be-856b-4c4c-b1b0-a052d181d03b",
|
| 1769 |
+
"metadata": {},
|
| 1770 |
+
"outputs": [],
|
| 1771 |
+
"source": []
|
| 1772 |
+
}
|
| 1773 |
+
],
|
| 1774 |
+
"metadata": {
|
| 1775 |
+
"kernelspec": {
|
| 1776 |
+
"display_name": "Python 3 (ipykernel)",
|
| 1777 |
+
"language": "python",
|
| 1778 |
+
"name": "python3"
|
| 1779 |
+
},
|
| 1780 |
+
"language_info": {
|
| 1781 |
+
"codemirror_mode": {
|
| 1782 |
+
"name": "ipython",
|
| 1783 |
+
"version": 3
|
| 1784 |
+
},
|
| 1785 |
+
"file_extension": ".py",
|
| 1786 |
+
"mimetype": "text/x-python",
|
| 1787 |
+
"name": "python",
|
| 1788 |
+
"nbconvert_exporter": "python",
|
| 1789 |
+
"pygments_lexer": "ipython3",
|
| 1790 |
+
"version": "3.12.10"
|
| 1791 |
+
}
|
| 1792 |
+
},
|
| 1793 |
+
"nbformat": 4,
|
| 1794 |
+
"nbformat_minor": 5
|
| 1795 |
+
}
|
jupyter_notebooks/Section_2-3-1_Tag_occurences.ipynb
ADDED
|
@@ -0,0 +1,801 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "71008ae2-4465-45d1-9ad0-6e6d54c99a69",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-12-09T20:15:54.294327Z",
|
| 9 |
+
"iopub.status.busy": "2025-12-09T20:15:54.294119Z",
|
| 10 |
+
"iopub.status.idle": "2025-12-09T20:15:54.296418Z",
|
| 11 |
+
"shell.execute_reply": "2025-12-09T20:15:54.295943Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-12-09T20:15:54.294313Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# Tag occurence percentages"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "code",
|
| 21 |
+
"execution_count": 31,
|
| 22 |
+
"id": "1577b529-19b4-471a-af8b-bc331087bb61",
|
| 23 |
+
"metadata": {
|
| 24 |
+
"execution": {
|
| 25 |
+
"iopub.execute_input": "2025-12-09T20:33:24.090495Z",
|
| 26 |
+
"iopub.status.busy": "2025-12-09T20:33:24.090266Z",
|
| 27 |
+
"iopub.status.idle": "2025-12-09T20:33:35.733409Z",
|
| 28 |
+
"shell.execute_reply": "2025-12-09T20:33:35.732815Z",
|
| 29 |
+
"shell.execute_reply.started": "2025-12-09T20:33:24.090478Z"
|
| 30 |
+
}
|
| 31 |
+
},
|
| 32 |
+
"outputs": [
|
| 33 |
+
{
|
| 34 |
+
"name": "stdout",
|
| 35 |
+
"output_type": "stream",
|
| 36 |
+
"text": [
|
| 37 |
+
"\n",
|
| 38 |
+
"==================================================\n",
|
| 39 |
+
"Tag Analysis for 'anime'\n",
|
| 40 |
+
"==================================================\n",
|
| 41 |
+
"Models with tag: 74721\n",
|
| 42 |
+
"Total models: 232164\n",
|
| 43 |
+
"Percentage: 32.18%\n",
|
| 44 |
+
"==================================================\n",
|
| 45 |
+
"\n"
|
| 46 |
+
]
|
| 47 |
+
}
|
| 48 |
+
],
|
| 49 |
+
"source": [
|
| 50 |
+
"from pathlib import Path\n",
|
| 51 |
+
"import pandas as pd\n",
|
| 52 |
+
"import sys\n",
|
| 53 |
+
"\n",
|
| 54 |
+
"current_dir = Path.cwd()\n",
|
| 55 |
+
"\n",
|
| 56 |
+
"# ============================================\n",
|
| 57 |
+
"# INPUT: Change these values\n",
|
| 58 |
+
"# ============================================\n",
|
| 59 |
+
"csv_file = current_dir.parent / \"data/CSV/models/Civi_models.csv\" # Your CSV file path\n",
|
| 60 |
+
"tag_to_find = \"anime\" # Tag to search for\n",
|
| 61 |
+
"# ============================================\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"def calculate_tag_percentage(csv_file, tag_to_find):\n",
|
| 64 |
+
" \"\"\"\n",
|
| 65 |
+
" Calculate what percentage of models contain a specific tag.\n",
|
| 66 |
+
" \"\"\"\n",
|
| 67 |
+
" # Read the CSV file\n",
|
| 68 |
+
" df = pd.read_csv(csv_file)\n",
|
| 69 |
+
" \n",
|
| 70 |
+
" # Get all tag columns\n",
|
| 71 |
+
" tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
|
| 72 |
+
" \n",
|
| 73 |
+
" # Count total models\n",
|
| 74 |
+
" total_models = len(df)\n",
|
| 75 |
+
" \n",
|
| 76 |
+
" # Count models containing the tag (case-insensitive search)\n",
|
| 77 |
+
" tag_lower = tag_to_find.lower()\n",
|
| 78 |
+
" models_with_tag = 0\n",
|
| 79 |
+
" \n",
|
| 80 |
+
" for idx, row in df.iterrows():\n",
|
| 81 |
+
" # Check if the tag appears in any of the tag columns\n",
|
| 82 |
+
" for tag_col in tag_columns:\n",
|
| 83 |
+
" tag_value = str(row[tag_col]).lower().strip()\n",
|
| 84 |
+
" if tag_value == tag_lower:\n",
|
| 85 |
+
" models_with_tag += 1\n",
|
| 86 |
+
" break # Count each model only once\n",
|
| 87 |
+
" \n",
|
| 88 |
+
" # Calculate percentage\n",
|
| 89 |
+
" percentage = (models_with_tag / total_models * 100) if total_models > 0 else 0\n",
|
| 90 |
+
" \n",
|
| 91 |
+
" return {\n",
|
| 92 |
+
" 'tag': tag_to_find,\n",
|
| 93 |
+
" 'count': models_with_tag,\n",
|
| 94 |
+
" 'total': total_models,\n",
|
| 95 |
+
" 'percentage': percentage\n",
|
| 96 |
+
" }\n",
|
| 97 |
+
"\n",
|
| 98 |
+
"# Calculate and display results\n",
|
| 99 |
+
"result = calculate_tag_percentage(csv_file, tag_to_find)\n",
|
| 100 |
+
"\n",
|
| 101 |
+
"print(f\"\\n{'='*50}\")\n",
|
| 102 |
+
"print(f\"Tag Analysis for '{result['tag']}'\")\n",
|
| 103 |
+
"print(f\"{'='*50}\")\n",
|
| 104 |
+
"print(f\"Models with tag: {result['count']}\")\n",
|
| 105 |
+
"print(f\"Total models: {result['total']}\")\n",
|
| 106 |
+
"print(f\"Percentage: {result['percentage']:.2f}%\")\n",
|
| 107 |
+
"print(f\"{'='*50}\\n\")"
|
| 108 |
+
]
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"cell_type": "code",
|
| 112 |
+
"execution_count": 28,
|
| 113 |
+
"id": "825dc9b8-ff0e-4afd-b2f1-e4bca1036aea",
|
| 114 |
+
"metadata": {
|
| 115 |
+
"execution": {
|
| 116 |
+
"iopub.execute_input": "2025-12-09T20:22:12.641296Z",
|
| 117 |
+
"iopub.status.busy": "2025-12-09T20:22:12.641105Z",
|
| 118 |
+
"iopub.status.idle": "2025-12-09T20:22:25.293171Z",
|
| 119 |
+
"shell.execute_reply": "2025-12-09T20:22:25.292530Z",
|
| 120 |
+
"shell.execute_reply.started": "2025-12-09T20:22:12.641281Z"
|
| 121 |
+
}
|
| 122 |
+
},
|
| 123 |
+
"outputs": [
|
| 124 |
+
{
|
| 125 |
+
"name": "stdout",
|
| 126 |
+
"output_type": "stream",
|
| 127 |
+
"text": [
|
| 128 |
+
"\n",
|
| 129 |
+
"=== Tag Analysis for '-f' ===\n",
|
| 130 |
+
"Models with tag: 0\n",
|
| 131 |
+
"Total models: 232164\n",
|
| 132 |
+
"Percentage: 0.00%\n",
|
| 133 |
+
"\n"
|
| 134 |
+
]
|
| 135 |
+
}
|
| 136 |
+
],
|
| 137 |
+
"source": [
|
| 138 |
+
"from pathlib import Path\n",
|
| 139 |
+
"import pandas as pd\n",
|
| 140 |
+
"import sys\n",
|
| 141 |
+
"\n",
|
| 142 |
+
"current_dir = Path.cwd()\n",
|
| 143 |
+
"\n",
|
| 144 |
+
"\n",
|
| 145 |
+
"def calculate_tag_percentage(csv_file, tag_to_find):\n",
|
| 146 |
+
" # Read the CSV file\n",
|
| 147 |
+
" df = pd.read_csv(csv_file)\n",
|
| 148 |
+
" \n",
|
| 149 |
+
" # Get all tag columns\n",
|
| 150 |
+
" tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
|
| 151 |
+
" \n",
|
| 152 |
+
" # Count total models\n",
|
| 153 |
+
" total_models = len(df)\n",
|
| 154 |
+
" \n",
|
| 155 |
+
" # Count models containing the tag (case-insensitive search)\n",
|
| 156 |
+
" tag_lower = tag_to_find.lower()\n",
|
| 157 |
+
" models_with_tag = 0\n",
|
| 158 |
+
" \n",
|
| 159 |
+
" for idx, row in df.iterrows():\n",
|
| 160 |
+
" # Check if the tag appears in any of the tag columns\n",
|
| 161 |
+
" for tag_col in tag_columns:\n",
|
| 162 |
+
" tag_value = str(row[tag_col]).lower().strip()\n",
|
| 163 |
+
" if tag_value == tag_lower:\n",
|
| 164 |
+
" models_with_tag += 1\n",
|
| 165 |
+
" break # Count each model only once\n",
|
| 166 |
+
" \n",
|
| 167 |
+
" # Calculate percentage\n",
|
| 168 |
+
" percentage = (models_with_tag / total_models * 100) if total_models > 0 else 0\n",
|
| 169 |
+
" \n",
|
| 170 |
+
" return {\n",
|
| 171 |
+
" 'tag': tag_to_find,\n",
|
| 172 |
+
" 'count': models_with_tag,\n",
|
| 173 |
+
" 'total': total_models,\n",
|
| 174 |
+
" 'percentage': percentage\n",
|
| 175 |
+
" }\n",
|
| 176 |
+
"\n",
|
| 177 |
+
"\n",
|
| 178 |
+
"def analyze_all_tags(csv_file):\n",
|
| 179 |
+
"\n",
|
| 180 |
+
" df = pd.read_csv(csv_file)\n",
|
| 181 |
+
" tag_columns = [col for col in df.columns if col.startswith('tag_')]\n",
|
| 182 |
+
" total_models = len(df)\n",
|
| 183 |
+
" \n",
|
| 184 |
+
" # Collect all tags and count occurrences\n",
|
| 185 |
+
" tag_counts = {}\n",
|
| 186 |
+
" for tag_col in tag_columns:\n",
|
| 187 |
+
" for tag in df[tag_col].dropna():\n",
|
| 188 |
+
" tag = str(tag).strip()\n",
|
| 189 |
+
" if tag: # Ignore empty strings\n",
|
| 190 |
+
" tag_counts[tag] = tag_counts.get(tag, 0) + 1\n",
|
| 191 |
+
" \n",
|
| 192 |
+
" # Create results DataFrame\n",
|
| 193 |
+
" results = []\n",
|
| 194 |
+
" for tag, count in tag_counts.items():\n",
|
| 195 |
+
" percentage = (count / total_models * 100)\n",
|
| 196 |
+
" results.append({\n",
|
| 197 |
+
" 'tag': tag,\n",
|
| 198 |
+
" 'count': count,\n",
|
| 199 |
+
" 'percentage': round(percentage, 2)\n",
|
| 200 |
+
" })\n",
|
| 201 |
+
" \n",
|
| 202 |
+
" results_df = pd.DataFrame(results)\n",
|
| 203 |
+
" results_df = results_df.sort_values('count', ascending=False)\n",
|
| 204 |
+
" \n",
|
| 205 |
+
" return results_df\n",
|
| 206 |
+
"\n",
|
| 207 |
+
"\n",
|
| 208 |
+
"if __name__ == \"__main__\":\n",
|
| 209 |
+
" # Default CSV file path\n",
|
| 210 |
+
" csv_file = current_dir.parent / \"data/CSV/models/Civi_models.csv\"\n",
|
| 211 |
+
" \n",
|
| 212 |
+
" # Check if a specific tag is provided as argument\n",
|
| 213 |
+
" if len(sys.argv) > 1:\n",
|
| 214 |
+
" tag = sys.argv[1]\n",
|
| 215 |
+
" result = calculate_tag_percentage(csv_file, tag)\n",
|
| 216 |
+
" \n",
|
| 217 |
+
" print(f\"\\n=== Tag Analysis for '{result['tag']}' ===\")\n",
|
| 218 |
+
" print(f\"Models with tag: {result['count']}\")\n",
|
| 219 |
+
" print(f\"Total models: {result['total']}\")\n",
|
| 220 |
+
" print(f\"Percentage: {result['percentage']:.2f}%\\n\")\n",
|
| 221 |
+
" else:\n",
|
| 222 |
+
" # If no specific tag provided, show all tags\n",
|
| 223 |
+
" print(\"\\n=== All Tags Analysis ===\\n\")\n",
|
| 224 |
+
" results_df = analyze_all_tags(csv_file)\n",
|
| 225 |
+
" print(results_df.to_string(index=False))\n",
|
| 226 |
+
" print(f\"\\nTotal unique tags: {len(results_df)}\")\n",
|
| 227 |
+
" print(f\"Total models: {len(pd.read_csv(csv_file))}\\n\")\n",
|
| 228 |
+
" \n",
|
| 229 |
+
" # Show example usage\n",
|
| 230 |
+
" print(\"\\nTo search for a specific tag, run:\")\n",
|
| 231 |
+
" print(\" python tag_percentage_calculator.py <tag_name>\")\n",
|
| 232 |
+
" print(\"\\nExample:\")\n",
|
| 233 |
+
" print(\" python tag_percentage_calculator.py anime\")"
|
| 234 |
+
]
|
| 235 |
+
},
|
| 236 |
+
{
|
| 237 |
+
"cell_type": "code",
|
| 238 |
+
"execution_count": null,
|
| 239 |
+
"id": "63187f58-9777-4ffb-bdf9-93b191a60241",
|
| 240 |
+
"metadata": {
|
| 241 |
+
"execution": {
|
| 242 |
+
"iopub.execute_input": "2025-12-09T19:08:49.572641Z",
|
| 243 |
+
"iopub.status.busy": "2025-12-09T19:08:49.572453Z",
|
| 244 |
+
"iopub.status.idle": "2025-12-09T19:08:49.634561Z",
|
| 245 |
+
"shell.execute_reply": "2025-12-09T19:08:49.634109Z",
|
| 246 |
+
"shell.execute_reply.started": "2025-12-09T19:08:49.572627Z"
|
| 247 |
+
}
|
| 248 |
+
},
|
| 249 |
+
"outputs": [],
|
| 250 |
+
"source": [
|
| 251 |
+
"from pathlib import Path\n",
|
| 252 |
+
"import json\n",
|
| 253 |
+
"from collections import defaultdict\n",
|
| 254 |
+
"\n",
|
| 255 |
+
"current_dir = Path.cwd()\n",
|
| 256 |
+
"\n",
|
| 257 |
+
"\n",
|
| 258 |
+
"\n",
|
| 259 |
+
"def load_data(filepath):\n",
|
| 260 |
+
" \"\"\"Load the JSON data from file.\"\"\"\n",
|
| 261 |
+
" with open(filepath, 'r', encoding='utf-8') as f:\n",
|
| 262 |
+
" return json.load(f)\n",
|
| 263 |
+
"\n",
|
| 264 |
+
"def calculate_cooccurrence_rate(data, target_tag, cooccurring_tags):\n",
|
| 265 |
+
" \"\"\"\n",
|
| 266 |
+
" Calculate what percentage of target_tag occurrences co-occur with each tag in cooccurring_tags.\n",
|
| 267 |
+
" \n",
|
| 268 |
+
" Args:\n",
|
| 269 |
+
" data: Dictionary with 'nodes' and 'links'\n",
|
| 270 |
+
" target_tag: The main tag to analyze (e.g., \"woman\")\n",
|
| 271 |
+
" cooccurring_tags: List of tags to check co-occurrence with (e.g., [\"sexy\", \"pose\"])\n",
|
| 272 |
+
" \n",
|
| 273 |
+
" Returns:\n",
|
| 274 |
+
" Dictionary with results\n",
|
| 275 |
+
" \"\"\"\n",
|
| 276 |
+
" # Find the target tag's total occurrences\n",
|
| 277 |
+
" target_size = None\n",
|
| 278 |
+
" for node in data['nodes']:\n",
|
| 279 |
+
" if node['id'] == target_tag:\n",
|
| 280 |
+
" target_size = node['size']\n",
|
| 281 |
+
" break\n",
|
| 282 |
+
" \n",
|
| 283 |
+
" if target_size is None:\n",
|
| 284 |
+
" print(f\"Warning: Tag '{target_tag}' not found in nodes!\")\n",
|
| 285 |
+
" return None\n",
|
| 286 |
+
" \n",
|
| 287 |
+
" print(f\"\\n{'='*60}\")\n",
|
| 288 |
+
" print(f\"Analysis for tag: '{target_tag}'\")\n",
|
| 289 |
+
" print(f\"{'='*60}\")\n",
|
| 290 |
+
" print(f\"Total occurrences of '{target_tag}': {target_size:,}\")\n",
|
| 291 |
+
" print()\n",
|
| 292 |
+
" \n",
|
| 293 |
+
" # Find co-occurrences in links\n",
|
| 294 |
+
" results = {}\n",
|
| 295 |
+
" for cooccurring_tag in cooccurring_tags:\n",
|
| 296 |
+
" cooccurrence_count = 0\n",
|
| 297 |
+
" \n",
|
| 298 |
+
" # Check both directions in links\n",
|
| 299 |
+
" for link in data['links']:\n",
|
| 300 |
+
" if (link['source'] == target_tag and link['target'] == cooccurring_tag) or \\\n",
|
| 301 |
+
" (link['source'] == cooccurring_tag and link['target'] == target_tag):\n",
|
| 302 |
+
" cooccurrence_count = link['value']\n",
|
| 303 |
+
" break\n",
|
| 304 |
+
" \n",
|
| 305 |
+
" if cooccurrence_count > 0:\n",
|
| 306 |
+
" percentage = (cooccurrence_count / target_size) * 100\n",
|
| 307 |
+
" results[cooccurring_tag] = {\n",
|
| 308 |
+
" 'count': cooccurrence_count,\n",
|
| 309 |
+
" 'percentage': percentage\n",
|
| 310 |
+
" }\n",
|
| 311 |
+
" print(f\"Tag: '{cooccurring_tag}'\")\n",
|
| 312 |
+
" print(f\" Co-occurrences: {cooccurrence_count:,}\")\n",
|
| 313 |
+
" print(f\" Percentage: {percentage:.2f}%\")\n",
|
| 314 |
+
" print(f\" (i.e., {percentage:.2f}% of '{target_tag}' occurrences also have '{cooccurring_tag}')\")\n",
|
| 315 |
+
" else:\n",
|
| 316 |
+
" results[cooccurring_tag] = {\n",
|
| 317 |
+
" 'count': 0,\n",
|
| 318 |
+
" 'percentage': 0.0\n",
|
| 319 |
+
" }\n",
|
| 320 |
+
" print(f\"Tag: '{cooccurring_tag}'\")\n",
|
| 321 |
+
" print(f\" No co-occurrences found\")\n",
|
| 322 |
+
" print()\n",
|
| 323 |
+
" \n",
|
| 324 |
+
" # Calculate combined co-occurrence (both tags together)\n",
|
| 325 |
+
" print(f\"\\n{'='*60}\")\n",
|
| 326 |
+
" print(\"Combined Analysis\")\n",
|
| 327 |
+
" print(f\"{'='*60}\")\n",
|
| 328 |
+
" \n",
|
| 329 |
+
" # To find items with ALL tags, we'd need to look at the underlying data\n",
|
| 330 |
+
" # With just the graph structure, we can only report individual co-occurrences\n",
|
| 331 |
+
" print(f\"Individual co-occurrence rates calculated above.\")\n",
|
| 332 |
+
" print(f\"Note: To calculate how often ALL tags appear together,\")\n",
|
| 333 |
+
" print(f\"we would need access to the raw item-level data.\")\n",
|
| 334 |
+
" \n",
|
| 335 |
+
" return results\n",
|
| 336 |
+
"\n",
|
| 337 |
+
"def main():\n",
|
| 338 |
+
" # Load the data\n",
|
| 339 |
+
" filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
|
| 340 |
+
" print(\"Loading data...\")\n",
|
| 341 |
+
" data = load_data(filepath)\n",
|
| 342 |
+
" print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\")\n",
|
| 343 |
+
" \n",
|
| 344 |
+
" # Calculate co-occurrence rates\n",
|
| 345 |
+
" target_tag = \"woman\"\n",
|
| 346 |
+
" cooccurring_tags = [\"sexy\", \"pose\"]\n",
|
| 347 |
+
" \n",
|
| 348 |
+
" results = calculate_cooccurrence_rate(data, target_tag, cooccurring_tags)\n",
|
| 349 |
+
" \n",
|
| 350 |
+
" # Summary\n",
|
| 351 |
+
" print(f\"\\n{'='*60}\")\n",
|
| 352 |
+
" print(\"SUMMARY\")\n",
|
| 353 |
+
" print(f\"{'='*60}\")\n",
|
| 354 |
+
" if results:\n",
|
| 355 |
+
" for tag, stats in results.items():\n",
|
| 356 |
+
" print(f\"'{target_tag}' + '{tag}': {stats['percentage']:.2f}% ({stats['count']:,} occurrences)\")\n",
|
| 357 |
+
"\n",
|
| 358 |
+
"if __name__ == \"__main__\":\n",
|
| 359 |
+
" main()"
|
| 360 |
+
]
|
| 361 |
+
},
|
| 362 |
+
{
|
| 363 |
+
"cell_type": "code",
|
| 364 |
+
"execution_count": 26,
|
| 365 |
+
"id": "6e0e8c6b-547e-4899-b001-1d4c6b31476f",
|
| 366 |
+
"metadata": {
|
| 367 |
+
"execution": {
|
| 368 |
+
"iopub.execute_input": "2025-12-09T19:48:36.009465Z",
|
| 369 |
+
"iopub.status.busy": "2025-12-09T19:48:36.009249Z",
|
| 370 |
+
"iopub.status.idle": "2025-12-09T19:48:36.063251Z",
|
| 371 |
+
"shell.execute_reply": "2025-12-09T19:48:36.062750Z",
|
| 372 |
+
"shell.execute_reply.started": "2025-12-09T19:48:36.009449Z"
|
| 373 |
+
}
|
| 374 |
+
},
|
| 375 |
+
"outputs": [
|
| 376 |
+
{
|
| 377 |
+
"name": "stdout",
|
| 378 |
+
"output_type": "stream",
|
| 379 |
+
"text": [
|
| 380 |
+
"Loading data...\n",
|
| 381 |
+
"Loaded 60,330 nodes and 16,921 links\n",
|
| 382 |
+
"\n",
|
| 383 |
+
"================================================================================\n",
|
| 384 |
+
"Top 100 Co-occurring Tags for: 'anime'\n",
|
| 385 |
+
"================================================================================\n",
|
| 386 |
+
"Total occurrences of 'anime': 74,187\n",
|
| 387 |
+
"\n",
|
| 388 |
+
"Rank Tag Count Percentage \n",
|
| 389 |
+
"------ ------------------------------ ------------ ------------\n",
|
| 390 |
+
"1 character 53,792 72.51%\n",
|
| 391 |
+
"2 woman 30,731 41.42%\n",
|
| 392 |
+
"3 girls 21,434 28.89%\n",
|
| 393 |
+
"4 female 14,286 19.26%\n",
|
| 394 |
+
"5 style 11,593 15.63%\n",
|
| 395 |
+
"6 game character 9,309 12.55%\n",
|
| 396 |
+
"7 sexy 8,876 11.96%\n",
|
| 397 |
+
"8 male 4,476 6.03%\n",
|
| 398 |
+
"9 video game 3,411 4.60%\n",
|
| 399 |
+
"10 man 3,333 4.49%\n",
|
| 400 |
+
"11 lora 3,260 4.39%\n",
|
| 401 |
+
"12 concept 2,568 3.46%\n",
|
| 402 |
+
"13 girl 2,472 3.33%\n",
|
| 403 |
+
"14 base model 2,197 2.96%\n",
|
| 404 |
+
"15 photorealistic 2,191 2.95%\n",
|
| 405 |
+
"16 manga 2,015 2.72%\n",
|
| 406 |
+
"17 boys 2,006 2.70%\n",
|
| 407 |
+
"18 anime character 1,841 2.48%\n",
|
| 408 |
+
"19 cartoon 1,759 2.37%\n",
|
| 409 |
+
"20 game 1,536 2.07%\n",
|
| 410 |
+
"21 men 1,431 1.93%\n",
|
| 411 |
+
"22 cute 1,325 1.79%\n",
|
| 412 |
+
"23 hentai 1,222 1.65%\n",
|
| 413 |
+
"24 clothing 1,131 1.52%\n",
|
| 414 |
+
"25 furry 1,114 1.50%\n",
|
| 415 |
+
"26 realistic 1,074 1.45%\n",
|
| 416 |
+
"27 styles 1,057 1.42%\n",
|
| 417 |
+
"28 illustration 971 1.31%\n",
|
| 418 |
+
"29 characters 967 1.30%\n",
|
| 419 |
+
"30 art style 949 1.28%\n",
|
| 420 |
+
"31 pokemon 932 1.26%\n",
|
| 421 |
+
"32 vtuber 890 1.20%\n",
|
| 422 |
+
"33 person 872 1.18%\n",
|
| 423 |
+
"34 artstyle 801 1.08%\n",
|
| 424 |
+
"35 anime girl 756 1.02%\n",
|
| 425 |
+
"36 blue archive 684 0.92%\n",
|
| 426 |
+
"37 2d 666 0.90%\n",
|
| 427 |
+
"38 fantasy 643 0.87%\n",
|
| 428 |
+
"39 art 633 0.85%\n",
|
| 429 |
+
"40 poses 619 0.83%\n",
|
| 430 |
+
"41 3d 578 0.78%\n",
|
| 431 |
+
"42 nsfw 550 0.74%\n",
|
| 432 |
+
"43 artist 544 0.73%\n",
|
| 433 |
+
"44 genshin impact 519 0.70%\n",
|
| 434 |
+
"45 idolmaster 483 0.65%\n",
|
| 435 |
+
"46 fire emblem 481 0.65%\n",
|
| 436 |
+
"47 fate 432 0.58%\n",
|
| 437 |
+
"48 waifu 406 0.55%\n",
|
| 438 |
+
"49 azur lane 399 0.54%\n",
|
| 439 |
+
"50 dragon ball 388 0.52%\n",
|
| 440 |
+
"51 ponyxl 379 0.51%\n",
|
| 441 |
+
"52 naruto 379 0.51%\n",
|
| 442 |
+
"53 precure 379 0.51%\n",
|
| 443 |
+
"54 videogame 372 0.50%\n",
|
| 444 |
+
"55 retro 356 0.48%\n",
|
| 445 |
+
"56 meme 355 0.48%\n",
|
| 446 |
+
"57 arknights 354 0.48%\n",
|
| 447 |
+
"58 hololive 347 0.47%\n",
|
| 448 |
+
"59 virtual youtuber 347 0.47%\n",
|
| 449 |
+
"60 umamusume 333 0.45%\n",
|
| 450 |
+
"61 falcom 326 0.44%\n",
|
| 451 |
+
"62 one piece 325 0.44%\n",
|
| 452 |
+
"63 boy 321 0.43%\n",
|
| 453 |
+
"64 chibi 315 0.42%\n",
|
| 454 |
+
"65 comics 303 0.41%\n",
|
| 455 |
+
"66 idolm@ster 295 0.40%\n",
|
| 456 |
+
"67 gundam 294 0.40%\n",
|
| 457 |
+
"68 bleach 294 0.40%\n",
|
| 458 |
+
"69 pose 288 0.39%\n",
|
| 459 |
+
"70 guy 284 0.38%\n",
|
| 460 |
+
"71 milf 281 0.38%\n",
|
| 461 |
+
"72 my hero academia 279 0.38%\n",
|
| 462 |
+
"73 genshin 276 0.37%\n",
|
| 463 |
+
"74 porn 268 0.36%\n",
|
| 464 |
+
"75 kawaii 257 0.35%\n",
|
| 465 |
+
"76 kantai collection 255 0.34%\n",
|
| 466 |
+
"77 galgame 250 0.34%\n",
|
| 467 |
+
"78 eiyuu densetsu 246 0.33%\n",
|
| 468 |
+
"79 animals 239 0.32%\n",
|
| 469 |
+
"80 yu-gi-oh! 236 0.32%\n",
|
| 470 |
+
"81 comic 232 0.31%\n",
|
| 471 |
+
"82 sex 228 0.31%\n",
|
| 472 |
+
"83 cinderella girls 227 0.31%\n",
|
| 473 |
+
"84 kancolle 223 0.30%\n",
|
| 474 |
+
"85 huge breasts 220 0.30%\n",
|
| 475 |
+
"86 clothes 219 0.30%\n",
|
| 476 |
+
"87 digital art 217 0.29%\n",
|
| 477 |
+
"88 oc 216 0.29%\n",
|
| 478 |
+
"89 scenery 215 0.29%\n",
|
| 479 |
+
"90 nintendo 215 0.29%\n",
|
| 480 |
+
"91 manhwa 214 0.29%\n",
|
| 481 |
+
"92 final fantasy 211 0.28%\n",
|
| 482 |
+
"93 nikke 211 0.28%\n",
|
| 483 |
+
"94 cosplay 208 0.28%\n",
|
| 484 |
+
"95 beautiful 207 0.28%\n",
|
| 485 |
+
"96 dragon ball z 207 0.28%\n",
|
| 486 |
+
"97 concepts 207 0.28%\n",
|
| 487 |
+
"98 videogame character 207 0.28%\n",
|
| 488 |
+
"99 thick thighs 205 0.28%\n",
|
| 489 |
+
"100 wide hips 202 0.27%\n",
|
| 490 |
+
"\n",
|
| 491 |
+
"================================================================================\n",
|
| 492 |
+
"\n"
|
| 493 |
+
]
|
| 494 |
+
}
|
| 495 |
+
],
|
| 496 |
+
"source": [
|
| 497 |
+
"from pathlib import Path\n",
|
| 498 |
+
"import json\n",
|
| 499 |
+
"from collections import defaultdict\n",
|
| 500 |
+
"\n",
|
| 501 |
+
"current_dir = Path.cwd()\n",
|
| 502 |
+
"\n",
|
| 503 |
+
"def load_data(filepath):\n",
|
| 504 |
+
" \"\"\"Load the JSON data from file.\"\"\"\n",
|
| 505 |
+
" with open(filepath, 'r', encoding='utf-8') as f:\n",
|
| 506 |
+
" return json.load(f)\n",
|
| 507 |
+
"\n",
|
| 508 |
+
"def get_top_cooccurrences(data, target_tag, top_n=10):\n",
|
| 509 |
+
" \"\"\"\n",
|
| 510 |
+
" Find the top N tags that co-occur with the target tag.\n",
|
| 511 |
+
" \n",
|
| 512 |
+
" Args:\n",
|
| 513 |
+
" data: Dictionary with 'nodes' and 'links'\n",
|
| 514 |
+
" target_tag: The main tag to analyze (e.g., \"woman\")\n",
|
| 515 |
+
" top_n: Number of top co-occurring tags to return (default: 10)\n",
|
| 516 |
+
" \n",
|
| 517 |
+
" Returns:\n",
|
| 518 |
+
" List of tuples (tag, count, percentage) sorted by count\n",
|
| 519 |
+
" \"\"\"\n",
|
| 520 |
+
" # Find the target tag's total occurrences\n",
|
| 521 |
+
" target_size = None\n",
|
| 522 |
+
" for node in data['nodes']:\n",
|
| 523 |
+
" if node['id'] == target_tag:\n",
|
| 524 |
+
" target_size = node['size']\n",
|
| 525 |
+
" break\n",
|
| 526 |
+
" \n",
|
| 527 |
+
" if target_size is None:\n",
|
| 528 |
+
" print(f\"Error: Tag '{target_tag}' not found in nodes!\")\n",
|
| 529 |
+
" return None, None\n",
|
| 530 |
+
" \n",
|
| 531 |
+
" # Find all co-occurrences in links\n",
|
| 532 |
+
" cooccurrences = []\n",
|
| 533 |
+
" \n",
|
| 534 |
+
" for link in data['links']:\n",
|
| 535 |
+
" if link['source'] == target_tag:\n",
|
| 536 |
+
" cooccurrences.append({\n",
|
| 537 |
+
" 'tag': link['target'],\n",
|
| 538 |
+
" 'count': link['value']\n",
|
| 539 |
+
" })\n",
|
| 540 |
+
" elif link['target'] == target_tag:\n",
|
| 541 |
+
" cooccurrences.append({\n",
|
| 542 |
+
" 'tag': link['source'],\n",
|
| 543 |
+
" 'count': link['value']\n",
|
| 544 |
+
" })\n",
|
| 545 |
+
" \n",
|
| 546 |
+
" # Sort by count (descending) and take top N\n",
|
| 547 |
+
" cooccurrences.sort(key=lambda x: x['count'], reverse=True)\n",
|
| 548 |
+
" top_cooccurrences = cooccurrences[:top_n]\n",
|
| 549 |
+
" \n",
|
| 550 |
+
" # Calculate percentages\n",
|
| 551 |
+
" results = []\n",
|
| 552 |
+
" for item in top_cooccurrences:\n",
|
| 553 |
+
" percentage = (item['count'] / target_size) * 100\n",
|
| 554 |
+
" results.append((item['tag'], item['count'], percentage))\n",
|
| 555 |
+
" \n",
|
| 556 |
+
" return results, target_size\n",
|
| 557 |
+
"\n",
|
| 558 |
+
"def display_results(target_tag, results, target_size, top_n):\n",
|
| 559 |
+
" \"\"\"Display the results in a formatted table.\"\"\"\n",
|
| 560 |
+
" if results is None:\n",
|
| 561 |
+
" return\n",
|
| 562 |
+
" \n",
|
| 563 |
+
" print(f\"\\n{'='*80}\")\n",
|
| 564 |
+
" print(f\"Top {top_n} Co-occurring Tags for: '{target_tag}'\")\n",
|
| 565 |
+
" print(f\"{'='*80}\")\n",
|
| 566 |
+
" print(f\"Total occurrences of '{target_tag}': {target_size:,}\\n\")\n",
|
| 567 |
+
" \n",
|
| 568 |
+
" if not results:\n",
|
| 569 |
+
" print(f\"No co-occurrences found for '{target_tag}'\")\n",
|
| 570 |
+
" return\n",
|
| 571 |
+
" \n",
|
| 572 |
+
" # Print header\n",
|
| 573 |
+
" print(f\"{'Rank':<6} {'Tag':<30} {'Count':<12} {'Percentage':<12}\")\n",
|
| 574 |
+
" print(f\"{'-'*6} {'-'*30} {'-'*12} {'-'*12}\")\n",
|
| 575 |
+
" \n",
|
| 576 |
+
" # Print results\n",
|
| 577 |
+
" for i, (tag, count, percentage) in enumerate(results, 1):\n",
|
| 578 |
+
" print(f\"{i:<6} {tag:<30} {count:<12,} {percentage:>10.2f}%\")\n",
|
| 579 |
+
" \n",
|
| 580 |
+
" print(f\"\\n{'='*80}\\n\")\n",
|
| 581 |
+
"\n",
|
| 582 |
+
"def main():\n",
|
| 583 |
+
" # Load the data\n",
|
| 584 |
+
" filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
|
| 585 |
+
" print(\"Loading data...\")\n",
|
| 586 |
+
" data = load_data(filepath)\n",
|
| 587 |
+
" print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\")\n",
|
| 588 |
+
" \n",
|
| 589 |
+
" # Analyze different tags\n",
|
| 590 |
+
" target_tags = [\"anime\"] # Add more tags here to analyze multiple\n",
|
| 591 |
+
" top_n = 100\n",
|
| 592 |
+
" \n",
|
| 593 |
+
" for target_tag in target_tags:\n",
|
| 594 |
+
" results, target_size = get_top_cooccurrences(data, target_tag, top_n)\n",
|
| 595 |
+
" display_results(target_tag, results, target_size, top_n)\n",
|
| 596 |
+
"\n",
|
| 597 |
+
"if __name__ == \"__main__\":\n",
|
| 598 |
+
" main()"
|
| 599 |
+
]
|
| 600 |
+
},
|
| 601 |
+
{
|
| 602 |
+
"cell_type": "code",
|
| 603 |
+
"execution_count": 25,
|
| 604 |
+
"id": "e8af35af-a4b9-4011-b8c3-5ec8e75ce6c1",
|
| 605 |
+
"metadata": {
|
| 606 |
+
"execution": {
|
| 607 |
+
"iopub.execute_input": "2025-12-09T19:47:44.592679Z",
|
| 608 |
+
"iopub.status.busy": "2025-12-09T19:47:44.592464Z",
|
| 609 |
+
"iopub.status.idle": "2025-12-09T19:47:44.649077Z",
|
| 610 |
+
"shell.execute_reply": "2025-12-09T19:47:44.648523Z",
|
| 611 |
+
"shell.execute_reply.started": "2025-12-09T19:47:44.592664Z"
|
| 612 |
+
}
|
| 613 |
+
},
|
| 614 |
+
"outputs": [
|
| 615 |
+
{
|
| 616 |
+
"name": "stdout",
|
| 617 |
+
"output_type": "stream",
|
| 618 |
+
"text": [
|
| 619 |
+
"Loading data...\n",
|
| 620 |
+
"Loaded 60,330 nodes and 16,921 links\n",
|
| 621 |
+
"\n",
|
| 622 |
+
"\n",
|
| 623 |
+
"================================================================================\n",
|
| 624 |
+
"Co-occurrence Analysis: 'anime' + 'dragon ball'\n",
|
| 625 |
+
"================================================================================\n",
|
| 626 |
+
"\n",
|
| 627 |
+
"Total occurrences of 'anime': 74,187\n",
|
| 628 |
+
"Total occurrences of 'dragon ball': 479\n",
|
| 629 |
+
"\n",
|
| 630 |
+
"Items with BOTH tags: 388\n",
|
| 631 |
+
"\n",
|
| 632 |
+
">>> 0.52% of 'anime' occurrences also have 'dragon ball'\n",
|
| 633 |
+
"\n",
|
| 634 |
+
"================================================================================\n",
|
| 635 |
+
"\n"
|
| 636 |
+
]
|
| 637 |
+
}
|
| 638 |
+
],
|
| 639 |
+
"source": [
|
| 640 |
+
"from pathlib import Path\n",
|
| 641 |
+
"import json\n",
|
| 642 |
+
"from collections import defaultdict\n",
|
| 643 |
+
"\n",
|
| 644 |
+
"current_dir = Path.cwd()\n",
|
| 645 |
+
"\n",
|
| 646 |
+
"def load_data(filepath):\n",
|
| 647 |
+
" \"\"\"Load the JSON data from file.\"\"\"\n",
|
| 648 |
+
" with open(filepath, 'r', encoding='utf-8') as f:\n",
|
| 649 |
+
" return json.load(f)\n",
|
| 650 |
+
"\n",
|
| 651 |
+
"def get_tag_cooccurrence(data, tag1, tag2):\n",
|
| 652 |
+
" \"\"\"\n",
|
| 653 |
+
" Find what percentage of tag1 occurrences also have tag2.\n",
|
| 654 |
+
" \n",
|
| 655 |
+
" Args:\n",
|
| 656 |
+
" data: Dictionary with 'nodes' and 'links'\n",
|
| 657 |
+
" tag1: Primary tag to analyze (e.g., \"cat\")\n",
|
| 658 |
+
" tag2: Secondary tag to check for (e.g., \"dog\")\n",
|
| 659 |
+
" \n",
|
| 660 |
+
" Returns:\n",
|
| 661 |
+
" Dictionary with co-occurrence information\n",
|
| 662 |
+
" \"\"\"\n",
|
| 663 |
+
" # Find the tags' total occurrences\n",
|
| 664 |
+
" tag1_size = None\n",
|
| 665 |
+
" tag2_size = None\n",
|
| 666 |
+
" \n",
|
| 667 |
+
" for node in data['nodes']:\n",
|
| 668 |
+
" if node['id'] == tag1:\n",
|
| 669 |
+
" tag1_size = node['size']\n",
|
| 670 |
+
" if node['id'] == tag2:\n",
|
| 671 |
+
" tag2_size = node['size']\n",
|
| 672 |
+
" \n",
|
| 673 |
+
" if tag1_size is None:\n",
|
| 674 |
+
" print(f\"Error: Tag '{tag1}' not found in nodes!\")\n",
|
| 675 |
+
" return None\n",
|
| 676 |
+
" \n",
|
| 677 |
+
" if tag2_size is None:\n",
|
| 678 |
+
" print(f\"Error: Tag '{tag2}' not found in nodes!\")\n",
|
| 679 |
+
" return None\n",
|
| 680 |
+
" \n",
|
| 681 |
+
" # Find co-occurrence count in links\n",
|
| 682 |
+
" # This represents how many items have BOTH tag1 AND tag2\n",
|
| 683 |
+
" cooccurrence_count = 0\n",
|
| 684 |
+
" \n",
|
| 685 |
+
" for link in data['links']:\n",
|
| 686 |
+
" if (link['source'] == tag1 and link['target'] == tag2) or \\\n",
|
| 687 |
+
" (link['source'] == tag2 and link['target'] == tag1):\n",
|
| 688 |
+
" cooccurrence_count = link['value']\n",
|
| 689 |
+
" break\n",
|
| 690 |
+
" \n",
|
| 691 |
+
" # Calculate percentage: what % of tag1 items also have tag2\n",
|
| 692 |
+
" percentage_with_tag2 = (cooccurrence_count / tag1_size) * 100 if tag1_size > 0 else 0\n",
|
| 693 |
+
" \n",
|
| 694 |
+
" return {\n",
|
| 695 |
+
" 'primary_tag': tag1,\n",
|
| 696 |
+
" 'secondary_tag': tag2,\n",
|
| 697 |
+
" 'primary_tag_total': tag1_size,\n",
|
| 698 |
+
" 'secondary_tag_total': tag2_size,\n",
|
| 699 |
+
" 'cooccurrence_count': cooccurrence_count,\n",
|
| 700 |
+
" 'percentage_with_secondary': percentage_with_tag2\n",
|
| 701 |
+
" }\n",
|
| 702 |
+
"\n",
|
| 703 |
+
"def display_cooccurrence_results(result):\n",
|
| 704 |
+
" \"\"\"Display the co-occurrence results in a formatted way.\"\"\"\n",
|
| 705 |
+
" if result is None:\n",
|
| 706 |
+
" return\n",
|
| 707 |
+
" \n",
|
| 708 |
+
" print(f\"\\n{'='*80}\")\n",
|
| 709 |
+
" print(f\"Co-occurrence Analysis: '{result['primary_tag']}' + '{result['secondary_tag']}'\")\n",
|
| 710 |
+
" print(f\"{'='*80}\\n\")\n",
|
| 711 |
+
" \n",
|
| 712 |
+
" print(f\"Total occurrences of '{result['primary_tag']}': {result['primary_tag_total']:,}\")\n",
|
| 713 |
+
" print(f\"Total occurrences of '{result['secondary_tag']}': {result['secondary_tag_total']:,}\")\n",
|
| 714 |
+
" print(f\"\\nItems with BOTH tags: {result['cooccurrence_count']:,}\")\n",
|
| 715 |
+
" print(f\"\\n>>> {result['percentage_with_secondary']:.2f}% of '{result['primary_tag']}' occurrences also have '{result['secondary_tag']}'\")\n",
|
| 716 |
+
" \n",
|
| 717 |
+
" print(f\"\\n{'='*80}\\n\")\n",
|
| 718 |
+
"\n",
|
| 719 |
+
"def analyze_multiple_pairs(data, tag_pairs):\n",
|
| 720 |
+
" \"\"\"\n",
|
| 721 |
+
" Analyze multiple tag pairs at once.\n",
|
| 722 |
+
" \n",
|
| 723 |
+
" Args:\n",
|
| 724 |
+
" data: Dictionary with 'nodes' and 'links'\n",
|
| 725 |
+
" tag_pairs: List of tuples, each containing two tags to compare\n",
|
| 726 |
+
" \"\"\"\n",
|
| 727 |
+
" results = []\n",
|
| 728 |
+
" \n",
|
| 729 |
+
" for tag1, tag2 in tag_pairs:\n",
|
| 730 |
+
" result = get_tag_cooccurrence(data, tag1, tag2)\n",
|
| 731 |
+
" if result:\n",
|
| 732 |
+
" results.append(result)\n",
|
| 733 |
+
" display_cooccurrence_results(result)\n",
|
| 734 |
+
" \n",
|
| 735 |
+
" return results\n",
|
| 736 |
+
"\n",
|
| 737 |
+
"def main():\n",
|
| 738 |
+
" # Load the data\n",
|
| 739 |
+
" filepath = current_dir.parent / \"public/json/nodes_all.json\"\n",
|
| 740 |
+
" print(\"Loading data...\")\n",
|
| 741 |
+
" data = load_data(filepath)\n",
|
| 742 |
+
" print(f\"Loaded {len(data['nodes']):,} nodes and {len(data['links']):,} links\\n\")\n",
|
| 743 |
+
" \n",
|
| 744 |
+
" # Analyze: What percentage of \"cat\" occurrences also have \"dog\"?\n",
|
| 745 |
+
" primary_tag = \"anime\" # The main tag you're interested in\n",
|
| 746 |
+
" secondary_tag = \"dragon ball\" # The tag you want to check for\n",
|
| 747 |
+
" \n",
|
| 748 |
+
" result = get_tag_cooccurrence(data, primary_tag, secondary_tag)\n",
|
| 749 |
+
" display_cooccurrence_results(result)\n",
|
| 750 |
+
" \n",
|
| 751 |
+
" # You can also check the reverse: What percentage of \"dog\" occurrences also have \"cat\"?\n",
|
| 752 |
+
" # result_reverse = get_tag_cooccurrence(data, \"dog\", \"cat\")\n",
|
| 753 |
+
" # display_cooccurrence_results(result_reverse)\n",
|
| 754 |
+
" \n",
|
| 755 |
+
" # Option: Analyze multiple pairs at once\n",
|
| 756 |
+
" # Uncomment the lines below to analyze multiple pairs\n",
|
| 757 |
+
" \"\"\"\n",
|
| 758 |
+
" tag_pairs = [\n",
|
| 759 |
+
" (\"cat\", \"dog\"),\n",
|
| 760 |
+
" (\"boy\", \"anime\"),\n",
|
| 761 |
+
" (\"girl\", \"anime\"),\n",
|
| 762 |
+
" (\"man\", \"photorealistic\")\n",
|
| 763 |
+
" ]\n",
|
| 764 |
+
" results = analyze_multiple_pairs(data, tag_pairs)\n",
|
| 765 |
+
" \"\"\"\n",
|
| 766 |
+
"\n",
|
| 767 |
+
"if __name__ == \"__main__\":\n",
|
| 768 |
+
" main()"
|
| 769 |
+
]
|
| 770 |
+
},
|
| 771 |
+
{
|
| 772 |
+
"cell_type": "code",
|
| 773 |
+
"execution_count": null,
|
| 774 |
+
"id": "a79ee96c-a060-4cc5-82d8-5642ccbef328",
|
| 775 |
+
"metadata": {},
|
| 776 |
+
"outputs": [],
|
| 777 |
+
"source": []
|
| 778 |
+
}
|
| 779 |
+
],
|
| 780 |
+
"metadata": {
|
| 781 |
+
"kernelspec": {
|
| 782 |
+
"display_name": "Python 3 (ipykernel)",
|
| 783 |
+
"language": "python",
|
| 784 |
+
"name": "python3"
|
| 785 |
+
},
|
| 786 |
+
"language_info": {
|
| 787 |
+
"codemirror_mode": {
|
| 788 |
+
"name": "ipython",
|
| 789 |
+
"version": 3
|
| 790 |
+
},
|
| 791 |
+
"file_extension": ".py",
|
| 792 |
+
"mimetype": "text/x-python",
|
| 793 |
+
"name": "python",
|
| 794 |
+
"nbconvert_exporter": "python",
|
| 795 |
+
"pygments_lexer": "ipython3",
|
| 796 |
+
"version": "3.13.9"
|
| 797 |
+
}
|
| 798 |
+
},
|
| 799 |
+
"nbformat": 4,
|
| 800 |
+
"nbformat_minor": 5
|
| 801 |
+
}
|
jupyter_notebooks/Section_2-3-2_top_10_most_popular_checkpoints.ipynb
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "14f6b6e3-5edb-458a-9553-09616455ba9f",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# Section 6.3: Download Models"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"id": "64d0907d-780d-48a0-b2c4-a9de650a3760",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"### Download models from CIVITAI to the respective folders to ~/ComfyUI/models\n"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "code",
|
| 21 |
+
"execution_count": 1,
|
| 22 |
+
"id": "2f788cc1-d8ff-43b8-9051-a3d5d8d79086",
|
| 23 |
+
"metadata": {
|
| 24 |
+
"execution": {
|
| 25 |
+
"iopub.execute_input": "2025-02-06T18:25:06.875713Z",
|
| 26 |
+
"iopub.status.busy": "2025-02-06T18:25:06.875579Z",
|
| 27 |
+
"iopub.status.idle": "2025-02-06T18:25:07.380814Z",
|
| 28 |
+
"shell.execute_reply": "2025-02-06T18:25:07.380184Z",
|
| 29 |
+
"shell.execute_reply.started": "2025-02-06T18:25:06.875698Z"
|
| 30 |
+
}
|
| 31 |
+
},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"import os\n",
|
| 35 |
+
"import re\n",
|
| 36 |
+
"import csv\n",
|
| 37 |
+
"import json\n",
|
| 38 |
+
"import time\n",
|
| 39 |
+
"import requests\n",
|
| 40 |
+
"from itertools import cycle\n",
|
| 41 |
+
"import pandas as pd\n",
|
| 42 |
+
"from requests.adapters import HTTPAdapter\n",
|
| 43 |
+
"from requests.packages.urllib3.util.retry import Retry\n",
|
| 44 |
+
"from pathlib import Path"
|
| 45 |
+
]
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"cell_type": "code",
|
| 49 |
+
"execution_count": 2,
|
| 50 |
+
"id": "7b6349d2-b839-4b8e-895e-cc7eb3fba88f",
|
| 51 |
+
"metadata": {
|
| 52 |
+
"execution": {
|
| 53 |
+
"iopub.execute_input": "2025-02-06T18:25:07.848340Z",
|
| 54 |
+
"iopub.status.busy": "2025-02-06T18:25:07.848183Z",
|
| 55 |
+
"iopub.status.idle": "2025-02-06T18:25:07.851420Z",
|
| 56 |
+
"shell.execute_reply": "2025-02-06T18:25:07.850952Z",
|
| 57 |
+
"shell.execute_reply.started": "2025-02-06T18:25:07.848325Z"
|
| 58 |
+
}
|
| 59 |
+
},
|
| 60 |
+
"outputs": [],
|
| 61 |
+
"source": [
|
| 62 |
+
"current_dir = Path.cwd()\n",
|
| 63 |
+
"api_karussell = current_dir.parent / 'misc/api_keys.txt'\n",
|
| 64 |
+
"target_dir = current_dir.parent / 'data/models/checkpoints/' \n",
|
| 65 |
+
"target_dir.parent.mkdir(parents=True, exist_ok=True)"
|
| 66 |
+
]
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"cell_type": "markdown",
|
| 70 |
+
"id": "3c2a9244-5070-4159-821f-39bb92051daf",
|
| 71 |
+
"metadata": {},
|
| 72 |
+
"source": [
|
| 73 |
+
"## function definition:"
|
| 74 |
+
]
|
| 75 |
+
},
|
| 76 |
+
{
|
| 77 |
+
"cell_type": "code",
|
| 78 |
+
"execution_count": 3,
|
| 79 |
+
"id": "aeaaf1eb-9768-47ad-9298-7d7a8be59094",
|
| 80 |
+
"metadata": {
|
| 81 |
+
"execution": {
|
| 82 |
+
"iopub.execute_input": "2025-02-06T18:25:09.379460Z",
|
| 83 |
+
"iopub.status.busy": "2025-02-06T18:25:09.379303Z",
|
| 84 |
+
"iopub.status.idle": "2025-02-06T18:25:09.382034Z",
|
| 85 |
+
"shell.execute_reply": "2025-02-06T18:25:09.381557Z",
|
| 86 |
+
"shell.execute_reply.started": "2025-02-06T18:25:09.379446Z"
|
| 87 |
+
}
|
| 88 |
+
},
|
| 89 |
+
"outputs": [],
|
| 90 |
+
"source": [
|
| 91 |
+
"csv_file_path = current_dir.parent / 'data/CSV/model_subsets/Civiverse_checkpoint_only.csv'"
|
| 92 |
+
]
|
| 93 |
+
},
|
| 94 |
+
{
|
| 95 |
+
"cell_type": "code",
|
| 96 |
+
"execution_count": null,
|
| 97 |
+
"id": "2cdc7051-af24-439c-b985-b5088bb13cc0",
|
| 98 |
+
"metadata": {
|
| 99 |
+
"execution": {
|
| 100 |
+
"iopub.execute_input": "2025-02-06T18:25:10.192171Z",
|
| 101 |
+
"iopub.status.busy": "2025-02-06T18:25:10.191997Z"
|
| 102 |
+
}
|
| 103 |
+
},
|
| 104 |
+
"outputs": [
|
| 105 |
+
{
|
| 106 |
+
"name": "stdout",
|
| 107 |
+
"output_type": "stream",
|
| 108 |
+
"text": [
|
| 109 |
+
"Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Pony/realDream_sdxlPony14.safetensors\n",
|
| 110 |
+
"Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Flux.1 D/acornIsSpinningFLUX_aisFluxV15.safetensors\n",
|
| 111 |
+
"Failed to download https://civitai.com/api/download/models/1376998: 401 Client Error: Unauthorized for url: https://civitai.com/api/download/models/1376998\n",
|
| 112 |
+
"Failed to download https://civitai.com/api/download/models/1379842: 401 Client Error: Unauthorized for url: https://civitai.com/api/download/models/1379842\n",
|
| 113 |
+
"Downloaded: /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/models/checkpoints/Illustrious/miaomiaoHarem_v15a.safetensors\n"
|
| 114 |
+
]
|
| 115 |
+
}
|
| 116 |
+
],
|
| 117 |
+
"source": [
|
| 118 |
+
"with open(api_karussell, 'r') as f:\n",
|
| 119 |
+
" api_keys = [line.strip() for line in f if line.strip()]\n",
|
| 120 |
+
"api_key_cycle = cycle(api_keys) # Rotate API keys\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"# Load CSV file\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"df = pd.read_csv(csv_file_path)\n",
|
| 125 |
+
"\n",
|
| 126 |
+
"\n",
|
| 127 |
+
"\n",
|
| 128 |
+
"def get_filename_from_response(response, url):\n",
|
| 129 |
+
" if 'content-disposition' in response.headers:\n",
|
| 130 |
+
" content_disposition = response.headers['content-disposition']\n",
|
| 131 |
+
" filename = content_disposition.split(\"filename=\")[-1].strip(\"\\\"\")\n",
|
| 132 |
+
" else:\n",
|
| 133 |
+
" filename = url.split(\"/\")[-1]\n",
|
| 134 |
+
" return filename\n",
|
| 135 |
+
"\n",
|
| 136 |
+
"def download_file(url, dest_folder, api_key):\n",
|
| 137 |
+
" headers = {'Authorization': f'Bearer {api_key}'} # If authentication is needed\n",
|
| 138 |
+
" \n",
|
| 139 |
+
" try:\n",
|
| 140 |
+
" response = requests.get(url, headers=headers, stream=True)\n",
|
| 141 |
+
" response.raise_for_status()\n",
|
| 142 |
+
" \n",
|
| 143 |
+
" filename = get_filename_from_response(response, url)\n",
|
| 144 |
+
" dest_path = dest_folder / filename\n",
|
| 145 |
+
" \n",
|
| 146 |
+
" with open(dest_path, 'wb') as f:\n",
|
| 147 |
+
" for chunk in response.iter_content(chunk_size=8192):\n",
|
| 148 |
+
" f.write(chunk)\n",
|
| 149 |
+
" \n",
|
| 150 |
+
" print(f\"Downloaded: {dest_path}\")\n",
|
| 151 |
+
" except requests.exceptions.RequestException as e:\n",
|
| 152 |
+
" print(f\"Failed to download {url}: {e}\")\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"def main():\n",
|
| 155 |
+
" for _, row in df.iterrows():\n",
|
| 156 |
+
" url = row['downloadUrl']\n",
|
| 157 |
+
" base_model = row['baseModel']\n",
|
| 158 |
+
" \n",
|
| 159 |
+
" if pd.isna(url) or pd.isna(base_model):\n",
|
| 160 |
+
" continue # Skip missing values\n",
|
| 161 |
+
" \n",
|
| 162 |
+
" # Create model-specific folder\n",
|
| 163 |
+
" model_folder = target_dir / base_model\n",
|
| 164 |
+
" model_folder.mkdir(parents=True, exist_ok=True)\n",
|
| 165 |
+
" \n",
|
| 166 |
+
" # Rotate API keys\n",
|
| 167 |
+
" api_key = next(api_key_cycle)\n",
|
| 168 |
+
" \n",
|
| 169 |
+
" # Download filfile\n",
|
| 170 |
+
" download_file(url, model_folder, api_key)\n",
|
| 171 |
+
" \n",
|
| 172 |
+
" # Sleep to avoid rate limits\n",
|
| 173 |
+
" time.sleep(2) # Adjust based on API limits\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"\n",
|
| 176 |
+
"if __name__ == \"__main__\":\n",
|
| 177 |
+
" main()\n"
|
| 178 |
+
]
|
| 179 |
+
},
|
| 180 |
+
{
|
| 181 |
+
"cell_type": "code",
|
| 182 |
+
"execution_count": null,
|
| 183 |
+
"id": "c99e2fee-eef2-4530-b69a-9a3db3bda84f",
|
| 184 |
+
"metadata": {},
|
| 185 |
+
"outputs": [],
|
| 186 |
+
"source": []
|
| 187 |
+
}
|
| 188 |
+
],
|
| 189 |
+
"metadata": {
|
| 190 |
+
"kernelspec": {
|
| 191 |
+
"display_name": "Python 3 (ipykernel)",
|
| 192 |
+
"language": "python",
|
| 193 |
+
"name": "python3"
|
| 194 |
+
},
|
| 195 |
+
"language_info": {
|
| 196 |
+
"codemirror_mode": {
|
| 197 |
+
"name": "ipython",
|
| 198 |
+
"version": 3
|
| 199 |
+
},
|
| 200 |
+
"file_extension": ".py",
|
| 201 |
+
"mimetype": "text/x-python",
|
| 202 |
+
"name": "python",
|
| 203 |
+
"nbconvert_exporter": "python",
|
| 204 |
+
"pygments_lexer": "ipython3",
|
| 205 |
+
"version": "3.11.9"
|
| 206 |
+
}
|
| 207 |
+
},
|
| 208 |
+
"nbformat": 4,
|
| 209 |
+
"nbformat_minor": 5
|
| 210 |
+
}
|
jupyter_notebooks/Section_2-3-3_Figure_7_top_30_adapters.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/Section_2-3-4_Figure_8_Step_1_LLM_annotation.ipynb
ADDED
|
@@ -0,0 +1,1941 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "23d0ae58",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# Deepfake Adapter Dataset - LLM Annotation Pipeline"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"id": "e4407358",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"### Unified Model Loading & Inference\n",
|
| 17 |
+
"Code for querying Mistral, Gemma, and Qwen models."
|
| 18 |
+
]
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"cell_type": "markdown",
|
| 22 |
+
"id": "1a1b9d0e",
|
| 23 |
+
"metadata": {},
|
| 24 |
+
"source": [
|
| 25 |
+
"## CLEANING & PREPROCESSING"
|
| 26 |
+
]
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"cell_type": "markdown",
|
| 30 |
+
"id": "3df42c46",
|
| 31 |
+
"metadata": {},
|
| 32 |
+
"source": [
|
| 33 |
+
"#### Named Entity Recognitition (NER) using SpaCy "
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"cell_type": "code",
|
| 38 |
+
"execution_count": null,
|
| 39 |
+
"id": "a287eef4",
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"outputs": [],
|
| 42 |
+
"source": [
|
| 43 |
+
"import pandas as pd\n",
|
| 44 |
+
"import re\n",
|
| 45 |
+
"from pathlib import Path\n",
|
| 46 |
+
"import emoji\n",
|
| 47 |
+
"import spacy\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"# Load spaCy model\n",
|
| 50 |
+
"# You may need to download it first: python -m spacy download en_core_web_sm\n",
|
| 51 |
+
"try:\n",
|
| 52 |
+
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
| 53 |
+
" print(\"✅ spaCy model loaded: en_core_web_sm\")\n",
|
| 54 |
+
"except OSError:\n",
|
| 55 |
+
" print(\"❌ spaCy model not found. Downloading...\")\n",
|
| 56 |
+
" import subprocess\n",
|
| 57 |
+
" subprocess.run([\"python\", \"-m\", \"spacy\", \"download\", \"en_core_web_sm\"])\n",
|
| 58 |
+
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
| 59 |
+
" print(\"✅ spaCy model downloaded and loaded\")\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"# Set up paths\n",
|
| 62 |
+
"current_dir = Path.cwd()\n",
|
| 63 |
+
"#input_file = current_dir.parent / \"data/CSV/real_person_adapters.csv\"\n",
|
| 64 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter.csv\"\n",
|
| 65 |
+
"\n",
|
| 66 |
+
"# Load dataset\n",
|
| 67 |
+
"df = pd.read_csv(input_file)\n",
|
| 68 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 69 |
+
"\n",
|
| 70 |
+
"def translate_leetspeak(text: str) -> str:\n",
|
| 71 |
+
" \"\"\"\n",
|
| 72 |
+
" Translate common leetspeak patterns to normal letters.\n",
|
| 73 |
+
" Examples: 4kira -> Akira, 3mma -> Emma, 1rene -> Irene\n",
|
| 74 |
+
" \"\"\"\n",
|
| 75 |
+
" if not text:\n",
|
| 76 |
+
" return text\n",
|
| 77 |
+
" \n",
|
| 78 |
+
" # Common leetspeak mappings (order matters!)\n",
|
| 79 |
+
" leetspeak_map = {\n",
|
| 80 |
+
" '4': 'a',\n",
|
| 81 |
+
" '3': 'e', \n",
|
| 82 |
+
" '1': 'i',\n",
|
| 83 |
+
" '0': 'o',\n",
|
| 84 |
+
" '7': 't',\n",
|
| 85 |
+
" '5': 's',\n",
|
| 86 |
+
" '8': 'b',\n",
|
| 87 |
+
" '9': 'g',\n",
|
| 88 |
+
" '@': 'a',\n",
|
| 89 |
+
" '$': 's',\n",
|
| 90 |
+
" '!': 'i',\n",
|
| 91 |
+
" }\n",
|
| 92 |
+
" \n",
|
| 93 |
+
" result = text\n",
|
| 94 |
+
" # Apply mappings at word boundaries or start of string\n",
|
| 95 |
+
" for leet, normal in leetspeak_map.items():\n",
|
| 96 |
+
" # Replace at start of word\n",
|
| 97 |
+
" result = re.sub(rf'\\b{re.escape(leet)}', normal, result, flags=re.IGNORECASE)\n",
|
| 98 |
+
" # Replace standalone numbers that look like letters in context\n",
|
| 99 |
+
" result = re.sub(rf'(?<=[a-z]){re.escape(leet)}(?=[a-z])', normal, result, flags=re.IGNORECASE)\n",
|
| 100 |
+
" \n",
|
| 101 |
+
" return result\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"def preprocess_for_ner(name: str) -> str:\n",
|
| 104 |
+
" \"\"\"\n",
|
| 105 |
+
" Preprocess the name before spaCy NER.\n",
|
| 106 |
+
" Remove noise but keep the actual name parts.\n",
|
| 107 |
+
" \"\"\"\n",
|
| 108 |
+
" if pd.isna(name):\n",
|
| 109 |
+
" return \"\"\n",
|
| 110 |
+
" \n",
|
| 111 |
+
" name = str(name)\n",
|
| 112 |
+
" \n",
|
| 113 |
+
" # FIRST: Translate leetspeak\n",
|
| 114 |
+
" name = translate_leetspeak(name)\n",
|
| 115 |
+
" \n",
|
| 116 |
+
" # Remove emoji\n",
|
| 117 |
+
" name = emoji.replace_emoji(name, replace=' ')\n",
|
| 118 |
+
" \n",
|
| 119 |
+
" # Remove version indicators (v1, v2, v1.0, etc.)\n",
|
| 120 |
+
" name = re.sub(r'\\s*[vV]\\d+(\\.\\d+)?\\s*', ' ', name)\n",
|
| 121 |
+
" \n",
|
| 122 |
+
" # Remove LoRA-related terms (case insensitive)\n",
|
| 123 |
+
" lora_terms = ['lora', 'loha', 'lycoris', 'controlnet', 'textual inversion', \n",
|
| 124 |
+
" 'embedding', 'ti', 'checkpoint', 'model', 'adapter', 'pony', 'sdxl', 'flux', 'illustrious', 'sd14', 'sd14', 'sd2', 'sd3', 'diffusion', 'stable', 'hunyuan']\n",
|
| 125 |
+
" for term in lora_terms:\n",
|
| 126 |
+
" name = re.sub(rf'\\b{term}\\b', '', name, flags=re.IGNORECASE)\n",
|
| 127 |
+
" \n",
|
| 128 |
+
" # Remove content in parentheses or brackets (often metadata)\n",
|
| 129 |
+
" name = re.sub(r'\\([^)]*\\)', '', name)\n",
|
| 130 |
+
" name = re.sub(r'\\[[^\\]]*\\]', '', name)\n",
|
| 131 |
+
" \n",
|
| 132 |
+
" # Remove special characters like 「」\n",
|
| 133 |
+
" name = re.sub(r'[「」『』【】〈〉《》]', '', name)\n",
|
| 134 |
+
" \n",
|
| 135 |
+
" # Handle pipe - keep first part\n",
|
| 136 |
+
" if '|' in name:\n",
|
| 137 |
+
" name = name.split('|')[0]\n",
|
| 138 |
+
" \n",
|
| 139 |
+
" # Handle forward slash - keep first part\n",
|
| 140 |
+
" if '/' in name:\n",
|
| 141 |
+
" name = name.split('/')[0]\n",
|
| 142 |
+
" \n",
|
| 143 |
+
" # Replace underscores with spaces\n",
|
| 144 |
+
" name = name.replace('_', ' ')\n",
|
| 145 |
+
" \n",
|
| 146 |
+
" # Remove multiple spaces\n",
|
| 147 |
+
" name = re.sub(r'\\s+', ' ', name)\n",
|
| 148 |
+
" \n",
|
| 149 |
+
" # Strip\n",
|
| 150 |
+
" name = name.strip()\n",
|
| 151 |
+
" \n",
|
| 152 |
+
" return name\n",
|
| 153 |
+
"\n",
|
| 154 |
+
"def extract_person_name(text: str) -> str:\n",
|
| 155 |
+
" \"\"\"\n",
|
| 156 |
+
" Use spaCy NER to extract person names from text.\n",
|
| 157 |
+
" Falls back to cleaned text if no PERSON entity found.\n",
|
| 158 |
+
" \"\"\"\n",
|
| 159 |
+
" if not text:\n",
|
| 160 |
+
" return \"\"\n",
|
| 161 |
+
" \n",
|
| 162 |
+
" # Run spaCy NER\n",
|
| 163 |
+
" doc = nlp(text)\n",
|
| 164 |
+
" \n",
|
| 165 |
+
" # Extract PERSON entities\n",
|
| 166 |
+
" person_entities = [ent.text for ent in doc.ents if ent.label_ == \"PERSON\"]\n",
|
| 167 |
+
" \n",
|
| 168 |
+
" if person_entities:\n",
|
| 169 |
+
" # Return the first (usually longest/best) person name\n",
|
| 170 |
+
" return person_entities[0].strip()\n",
|
| 171 |
+
" \n",
|
| 172 |
+
" # If no PERSON entity found, try to extract capitalized words (likely names)\n",
|
| 173 |
+
" # This helps with names spaCy might miss\n",
|
| 174 |
+
" words = text.split()\n",
|
| 175 |
+
" capitalized_words = [w for w in words if w and w[0].isupper() and len(w) > 1]\n",
|
| 176 |
+
" \n",
|
| 177 |
+
" if capitalized_words:\n",
|
| 178 |
+
" # Join first few capitalized words (likely the name)\n",
|
| 179 |
+
" return ' '.join(capitalized_words[:3]).strip()\n",
|
| 180 |
+
" \n",
|
| 181 |
+
" # Last resort: return cleaned text\n",
|
| 182 |
+
" return text.strip()\n",
|
| 183 |
+
"\n",
|
| 184 |
+
"def clean_name_with_spacy(name: str) -> str:\n",
|
| 185 |
+
" \"\"\"\n",
|
| 186 |
+
" Complete name cleaning pipeline with spaCy NER.\n",
|
| 187 |
+
" \n",
|
| 188 |
+
" Pipeline:\n",
|
| 189 |
+
" 1. Translate leetspeak (4→a, 3→e, 1→i, etc.)\n",
|
| 190 |
+
" 2. Remove noise (emoji, version tags, LoRA terms)\n",
|
| 191 |
+
" 3. Use spaCy to extract PERSON entities\n",
|
| 192 |
+
" 4. Fallback to capitalized words or cleaned text\n",
|
| 193 |
+
" \"\"\"\n",
|
| 194 |
+
" # Step 1 & 2: Preprocess (leetspeak + noise removal)\n",
|
| 195 |
+
" preprocessed = preprocess_for_ner(name)\n",
|
| 196 |
+
" \n",
|
| 197 |
+
" if not preprocessed:\n",
|
| 198 |
+
" return \"\"\n",
|
| 199 |
+
" \n",
|
| 200 |
+
" # Step 3: Extract person name using spaCy NER\n",
|
| 201 |
+
" person_name = extract_person_name(preprocessed)\n",
|
| 202 |
+
" \n",
|
| 203 |
+
" return person_name\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"# Apply name cleaning with spaCy\n",
|
| 206 |
+
"print(\"\\n🔄 Processing names with spaCy NER...\")\n",
|
| 207 |
+
"df['real_name'] = df['name'].apply(clean_name_with_spacy)\n",
|
| 208 |
+
"\n",
|
| 209 |
+
"# Show examples with detailed comparison\n",
|
| 210 |
+
"print(\"\\n📊 Name cleaning examples (with spaCy NER):\")\n",
|
| 211 |
+
"print(\"=\" * 100)\n",
|
| 212 |
+
"print(f\"{'Original Name':<50} | {'Cleaned Name':<30}\")\n",
|
| 213 |
+
"print(\"=\" * 100)\n",
|
| 214 |
+
"\n",
|
| 215 |
+
"examples = df[['name', 'real_name']].head(30)\n",
|
| 216 |
+
"shown = 0\n",
|
| 217 |
+
"for idx, row in examples.iterrows():\n",
|
| 218 |
+
" if row['name'] != row['real_name'] and shown < 20:\n",
|
| 219 |
+
" print(f\"{row['name']:<50} | {row['real_name']:<30}\")\n",
|
| 220 |
+
" shown += 1\n",
|
| 221 |
+
"\n",
|
| 222 |
+
"print(\"=\" * 100)\n",
|
| 223 |
+
"\n",
|
| 224 |
+
"# Show specific test cases\n",
|
| 225 |
+
"print(\"\\n🧪 Leetspeak translation examples:\")\n",
|
| 226 |
+
"test_names = ['4kira LoRA', '3mma Watson v2', '1rene LORA', 'L3vi Ackerman']\n",
|
| 227 |
+
"for test in test_names:\n",
|
| 228 |
+
" result = clean_name_with_spacy(test)\n",
|
| 229 |
+
" print(f\" {test:<30} -> {result}\")\n",
|
| 230 |
+
"\n",
|
| 231 |
+
"# Statistics\n",
|
| 232 |
+
"print(f\"\\n📈 Statistics:\")\n",
|
| 233 |
+
"print(f\" Total rows: {len(df)}\")\n",
|
| 234 |
+
"print(f\" Non-empty names: {(df['real_name'] != '').sum()}\")\n",
|
| 235 |
+
"print(f\" Empty names: {(df['real_name'] == '').sum()}\")\n",
|
| 236 |
+
"\n",
|
| 237 |
+
"# Show some examples of what spaCy identified\n",
|
| 238 |
+
"print(\"\\n🎯 Sample spaCy NER results:\")\n",
|
| 239 |
+
"sample_names = df['real_name'].head(20).tolist()\n",
|
| 240 |
+
"for i, name in enumerate(sample_names[:10], 1):\n",
|
| 241 |
+
" if name:\n",
|
| 242 |
+
" print(f\" {i}. {name}\")\n",
|
| 243 |
+
"\n",
|
| 244 |
+
"print(f\"\\n✅ Cleaned {len(df)} names using spaCy NER\")\n",
|
| 245 |
+
"\n",
|
| 246 |
+
"# Save intermediate result\n",
|
| 247 |
+
"output_step1 = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
|
| 248 |
+
"df.to_csv(output_step1, index=False)\n",
|
| 249 |
+
"print(f\"💾 Saved to {output_step1}\")\n"
|
| 250 |
+
]
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"cell_type": "markdown",
|
| 254 |
+
"id": "64687c72",
|
| 255 |
+
"metadata": {},
|
| 256 |
+
"source": [
|
| 257 |
+
"#### STEP 02: Nationality tag to Country hint\n",
|
| 258 |
+
"here tags related to nationality gets converted to the country equivalent."
|
| 259 |
+
]
|
| 260 |
+
},
|
| 261 |
+
{
|
| 262 |
+
"cell_type": "code",
|
| 263 |
+
"execution_count": null,
|
| 264 |
+
"id": "d6eaef5b",
|
| 265 |
+
"metadata": {},
|
| 266 |
+
"outputs": [],
|
| 267 |
+
"source": [
|
| 268 |
+
"import pandas as pd\n",
|
| 269 |
+
"from pathlib import Path\n",
|
| 270 |
+
"\n",
|
| 271 |
+
"# Set up paths\n",
|
| 272 |
+
"current_dir = Path.cwd()\n",
|
| 273 |
+
"countries_file = current_dir.parent / \"misc/lists/countries.csv\"\n",
|
| 274 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 275 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_01_NER.csv\"\n",
|
| 276 |
+
"output_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 277 |
+
"\n",
|
| 278 |
+
"# Load datasets\n",
|
| 279 |
+
"poi_df = pd.read_csv(input_file)\n",
|
| 280 |
+
"countries_df = pd.read_csv(countries_file)\n",
|
| 281 |
+
"professions_df = pd.read_csv(professions_file)\n",
|
| 282 |
+
"\n",
|
| 283 |
+
"# Define uninhabited or non-relevant territories to exclude\n",
|
| 284 |
+
"excluded_territories = {\n",
|
| 285 |
+
" 'isle of man', 'bouvet island', 'heard island and mcdonald islands',\n",
|
| 286 |
+
" 'french southern territories', 'south georgia and the south sandwich islands',\n",
|
| 287 |
+
" 'svalbard and jan mayen', 'british indian ocean territory', 'antarctica',\n",
|
| 288 |
+
" 'christmas island', 'cocos (keeling) islands', 'norfolk island',\n",
|
| 289 |
+
" 'pitcairn', 'tokelau', 'united states minor outlying islands',\n",
|
| 290 |
+
" 'wallis and futuna', 'western sahara'\n",
|
| 291 |
+
"}\n",
|
| 292 |
+
"\n",
|
| 293 |
+
"# Step 1: Combine tags into one lowercase list\n",
|
| 294 |
+
"def combine_tags(row):\n",
|
| 295 |
+
" return [str(row[f\"tag_{i}\"]).strip().lower() for i in range(1, 8) if pd.notna(row.get(f\"tag_{i}\"))]\n",
|
| 296 |
+
"\n",
|
| 297 |
+
"poi_df[\"tags\"] = poi_df.apply(combine_tags, axis=1)\n",
|
| 298 |
+
"\n",
|
| 299 |
+
"# Step 2: Build tag → (country, nationality) mapping with PRIORITIES\n",
|
| 300 |
+
"tag_to_country_nationality = {}\n",
|
| 301 |
+
"# We'll use a priority score: direct country name = 3, nationality = 2, word parts = 1\n",
|
| 302 |
+
"\n",
|
| 303 |
+
"for _, row in countries_df.iterrows():\n",
|
| 304 |
+
" country = str(row[\"en_short_name\"]).strip()\n",
|
| 305 |
+
" nationality = str(row[\"nationality\"]).strip()\n",
|
| 306 |
+
" \n",
|
| 307 |
+
" # Skip excluded territories\n",
|
| 308 |
+
" if country.lower() in excluded_territories:\n",
|
| 309 |
+
" continue\n",
|
| 310 |
+
"\n",
|
| 311 |
+
" country_lc = country.lower()\n",
|
| 312 |
+
" nationality_lc = nationality.lower()\n",
|
| 313 |
+
"\n",
|
| 314 |
+
" # Store as (country, nationality, priority)\n",
|
| 315 |
+
" # Exact country name match = highest priority\n",
|
| 316 |
+
" if country_lc not in tag_to_country_nationality:\n",
|
| 317 |
+
" tag_to_country_nationality[country_lc] = (country, \"\", 3)\n",
|
| 318 |
+
" \n",
|
| 319 |
+
" # Exact nationality match = medium priority \n",
|
| 320 |
+
" if nationality_lc not in tag_to_country_nationality:\n",
|
| 321 |
+
" tag_to_country_nationality[nationality_lc] = (\"\", nationality, 2)\n",
|
| 322 |
+
" \n",
|
| 323 |
+
" # No-space versions\n",
|
| 324 |
+
" country_no_space = country_lc.replace(\" \", \"\")\n",
|
| 325 |
+
" nationality_no_space = nationality_lc.replace(\" \", \"\")\n",
|
| 326 |
+
" \n",
|
| 327 |
+
" if country_no_space not in tag_to_country_nationality:\n",
|
| 328 |
+
" tag_to_country_nationality[country_no_space] = (country, \"\", 3)\n",
|
| 329 |
+
" if nationality_no_space not in tag_to_country_nationality:\n",
|
| 330 |
+
" tag_to_country_nationality[nationality_no_space] = (\"\", nationality, 2)\n",
|
| 331 |
+
"\n",
|
| 332 |
+
" # Word parts = lowest priority (only for longer words to avoid false matches)\n",
|
| 333 |
+
" for part in country_lc.split():\n",
|
| 334 |
+
" if len(part) > 4: # Only words longer than 4 chars\n",
|
| 335 |
+
" if part not in tag_to_country_nationality:\n",
|
| 336 |
+
" tag_to_country_nationality[part] = (country, \"\", 1)\n",
|
| 337 |
+
" for part in nationality_lc.split():\n",
|
| 338 |
+
" if len(part) > 4:\n",
|
| 339 |
+
" if part not in tag_to_country_nationality:\n",
|
| 340 |
+
" tag_to_country_nationality[part] = (\"\", nationality, 1)\n",
|
| 341 |
+
"\n",
|
| 342 |
+
"print(f\"Built country/nationality mapping with {len(tag_to_country_nationality)} entries\")\n",
|
| 343 |
+
"\n",
|
| 344 |
+
"# Step 3: Infer likely_country and likely_nationality by checking ALL tags\n",
|
| 345 |
+
"def infer_country_and_nationality(tags):\n",
|
| 346 |
+
" \"\"\"\n",
|
| 347 |
+
" Check ALL tags and return the best match based on priority.\n",
|
| 348 |
+
" Priority: exact country name > nationality > word parts\n",
|
| 349 |
+
" \"\"\"\n",
|
| 350 |
+
" best_match = None\n",
|
| 351 |
+
" best_priority = 0\n",
|
| 352 |
+
" \n",
|
| 353 |
+
" for tag in tags:\n",
|
| 354 |
+
" # Try cleaned version (no spaces)\n",
|
| 355 |
+
" cleaned = tag.replace(\" \", \"\").lower()\n",
|
| 356 |
+
" \n",
|
| 357 |
+
" # Check cleaned version\n",
|
| 358 |
+
" if cleaned in tag_to_country_nationality:\n",
|
| 359 |
+
" country, nationality, priority = tag_to_country_nationality[cleaned]\n",
|
| 360 |
+
" if priority > best_priority and country and country.lower() not in excluded_territories:\n",
|
| 361 |
+
" best_match = (country, nationality)\n",
|
| 362 |
+
" best_priority = priority\n",
|
| 363 |
+
" \n",
|
| 364 |
+
" # Also check original tag\n",
|
| 365 |
+
" if tag in tag_to_country_nationality:\n",
|
| 366 |
+
" country, nationality, priority = tag_to_country_nationality[tag]\n",
|
| 367 |
+
" if priority > best_priority and country and country.lower() not in excluded_territories:\n",
|
| 368 |
+
" best_match = (country, nationality)\n",
|
| 369 |
+
" best_priority = priority\n",
|
| 370 |
+
" \n",
|
| 371 |
+
" if best_match:\n",
|
| 372 |
+
" return pd.Series(best_match)\n",
|
| 373 |
+
" return pd.Series([\"\", \"\"])\n",
|
| 374 |
+
"\n",
|
| 375 |
+
"poi_df[[\"likely_country\", \"likely_nationality\"]] = poi_df[\"tags\"].apply(infer_country_and_nationality)\n",
|
| 376 |
+
"\n",
|
| 377 |
+
"# Step 4: Build tag → profession mapping\n",
|
| 378 |
+
"profession_alias_map = {}\n",
|
| 379 |
+
"\n",
|
| 380 |
+
"for _, row in professions_df.iterrows():\n",
|
| 381 |
+
" canonical = str(row['profession']).strip().lower()\n",
|
| 382 |
+
" profession_alias_map[canonical] = canonical\n",
|
| 383 |
+
" for alias_col in ['alias_1', 'alias_2', 'alias_3']:\n",
|
| 384 |
+
" alias = row.get(alias_col)\n",
|
| 385 |
+
" if pd.notna(alias):\n",
|
| 386 |
+
" profession_alias_map[str(alias).strip().lower()] = canonical\n",
|
| 387 |
+
"\n",
|
| 388 |
+
"# Step 5: Infer likely profession from tags\n",
|
| 389 |
+
"def infer_profession_from_tags(tags):\n",
|
| 390 |
+
" matched = []\n",
|
| 391 |
+
" for tag in tags:\n",
|
| 392 |
+
" cleaned = tag.strip().lower()\n",
|
| 393 |
+
" if cleaned in profession_alias_map:\n",
|
| 394 |
+
" matched.append(profession_alias_map[cleaned])\n",
|
| 395 |
+
"\n",
|
| 396 |
+
" if not matched:\n",
|
| 397 |
+
" return \"\"\n",
|
| 398 |
+
" if \"celebrity\" in matched and len(set(matched)) > 1:\n",
|
| 399 |
+
" # Drop 'celebrity' if other professions are present\n",
|
| 400 |
+
" matched = [m for m in matched if m != \"celebrity\"]\n",
|
| 401 |
+
"\n",
|
| 402 |
+
" return matched[0] # Return the first specific match\n",
|
| 403 |
+
"\n",
|
| 404 |
+
"\n",
|
| 405 |
+
"poi_df[\"likely_profession\"] = poi_df[\"tags\"].apply(infer_profession_from_tags)\n",
|
| 406 |
+
"\n",
|
| 407 |
+
"# Step 6: Save enriched dataset\n",
|
| 408 |
+
"poi_df.to_csv(output_file, index=False)\n",
|
| 409 |
+
"\n",
|
| 410 |
+
"# Preview results\n",
|
| 411 |
+
"print(f\"\\nProcessed {len(poi_df)} rows\")\n",
|
| 412 |
+
"print(f\"Rows with country: {(poi_df['likely_country'] != '').sum()}\")\n",
|
| 413 |
+
"print(f\"Rows with nationality: {(poi_df['likely_nationality'] != '').sum()}\")\n",
|
| 414 |
+
"print(f\"Rows with profession: {(poi_df['likely_profession'] != '').sum()}\")\n",
|
| 415 |
+
"\n",
|
| 416 |
+
"print(f\"\\nTop 10 countries:\")\n",
|
| 417 |
+
"print(poi_df[poi_df['likely_country'] != '']['likely_country'].value_counts().head(10))\n"
|
| 418 |
+
]
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"cell_type": "markdown",
|
| 422 |
+
"id": "4a4a58b3",
|
| 423 |
+
"metadata": {},
|
| 424 |
+
"source": [
|
| 425 |
+
"## LLM ANNOTATION"
|
| 426 |
+
]
|
| 427 |
+
},
|
| 428 |
+
{
|
| 429 |
+
"cell_type": "markdown",
|
| 430 |
+
"id": "b298844d",
|
| 431 |
+
"metadata": {},
|
| 432 |
+
"source": [
|
| 433 |
+
"#### Model Configurations"
|
| 434 |
+
]
|
| 435 |
+
},
|
| 436 |
+
{
|
| 437 |
+
"cell_type": "code",
|
| 438 |
+
"execution_count": null,
|
| 439 |
+
"id": "39f3d65e",
|
| 440 |
+
"metadata": {},
|
| 441 |
+
"outputs": [],
|
| 442 |
+
"source": [
|
| 443 |
+
"import pandas as pd\n",
|
| 444 |
+
"import json\n",
|
| 445 |
+
"import time\n",
|
| 446 |
+
"import re\n",
|
| 447 |
+
"from pathlib import Path\n",
|
| 448 |
+
"from tqdm import tqdm\n",
|
| 449 |
+
"import torch\n",
|
| 450 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 451 |
+
"import signal\n",
|
| 452 |
+
"from contextlib import contextmanager\n",
|
| 453 |
+
"\n",
|
| 454 |
+
"# Configuration\n",
|
| 455 |
+
"current_dir = Path.cwd()\n",
|
| 456 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 457 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 458 |
+
"\n",
|
| 459 |
+
"# Model configurations\n",
|
| 460 |
+
"MODEL_CONFIGS = {\n",
|
| 461 |
+
" 'mistral': {\n",
|
| 462 |
+
" 'name': 'mistralai/Mistral-7B-Instruct-v0.3',\n",
|
| 463 |
+
" 'dtype': torch.bfloat16,\n",
|
| 464 |
+
" 'quantization': None,\n",
|
| 465 |
+
" 'generation_params': {\n",
|
| 466 |
+
" 'max_new_tokens': 512,\n",
|
| 467 |
+
" 'temperature': 0.05,\n",
|
| 468 |
+
" 'do_sample': True,\n",
|
| 469 |
+
" 'top_p': 1.0,\n",
|
| 470 |
+
" }\n",
|
| 471 |
+
" },\n",
|
| 472 |
+
" 'gemma': {\n",
|
| 473 |
+
" 'name': 'google/gemma-3-27b-it',\n",
|
| 474 |
+
" 'dtype': torch.bfloat16,\n",
|
| 475 |
+
" 'quantization': None,\n",
|
| 476 |
+
" 'generation_params': {\n",
|
| 477 |
+
" 'max_new_tokens': 512,\n",
|
| 478 |
+
" 'temperature': 0.1,\n",
|
| 479 |
+
" 'do_sample': True,\n",
|
| 480 |
+
" 'top_p': 1.0,\n",
|
| 481 |
+
" }\n",
|
| 482 |
+
" },\n",
|
| 483 |
+
" 'qwen': {\n",
|
| 484 |
+
" 'name': 'Qwen/Qwen2.5-32B-Instruct',\n",
|
| 485 |
+
" 'dtype': None, # Will use quantization\n",
|
| 486 |
+
" 'quantization': BitsAndBytesConfig(\n",
|
| 487 |
+
" load_in_8bit=True,\n",
|
| 488 |
+
" llm_int8_threshold=6.0,\n",
|
| 489 |
+
" llm_int8_has_fp16_weight=False\n",
|
| 490 |
+
" ),\n",
|
| 491 |
+
" 'generation_params': {\n",
|
| 492 |
+
" 'max_new_tokens': 512,\n",
|
| 493 |
+
" 'temperature': 0.1,\n",
|
| 494 |
+
" 'do_sample': False,\n",
|
| 495 |
+
" }\n",
|
| 496 |
+
" }\n",
|
| 497 |
+
"}\n",
|
| 498 |
+
"\n",
|
| 499 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 500 |
+
" \"actor\",\n",
|
| 501 |
+
" \"adult performer\",\n",
|
| 502 |
+
" \"singer/musician\",\n",
|
| 503 |
+
" \"model\",\n",
|
| 504 |
+
" \"online personality\",\n",
|
| 505 |
+
" \"public figure\",\n",
|
| 506 |
+
" \"voice actor/ASMR\",\n",
|
| 507 |
+
" \"sports professional\",\n",
|
| 508 |
+
" \"tv personality\"\n",
|
| 509 |
+
"]\n"
|
| 510 |
+
]
|
| 511 |
+
},
|
| 512 |
+
{
|
| 513 |
+
"cell_type": "markdown",
|
| 514 |
+
"id": "c215b38c",
|
| 515 |
+
"metadata": {},
|
| 516 |
+
"source": [
|
| 517 |
+
"#### Load Model Function"
|
| 518 |
+
]
|
| 519 |
+
},
|
| 520 |
+
{
|
| 521 |
+
"cell_type": "code",
|
| 522 |
+
"execution_count": null,
|
| 523 |
+
"id": "cfb5b13e",
|
| 524 |
+
"metadata": {},
|
| 525 |
+
"outputs": [],
|
| 526 |
+
"source": [
|
| 527 |
+
"def load_model(model_type='mistral'):\n",
|
| 528 |
+
" \"\"\"\n",
|
| 529 |
+
" Load model and tokenizer based on type.\n",
|
| 530 |
+
" \n",
|
| 531 |
+
" Args:\n",
|
| 532 |
+
" model_type: 'mistral', 'gemma', or 'qwen'\n",
|
| 533 |
+
" \n",
|
| 534 |
+
" Returns:\n",
|
| 535 |
+
" tuple: (model, tokenizer, config)\n",
|
| 536 |
+
" \"\"\"\n",
|
| 537 |
+
" if model_type not in MODEL_CONFIGS:\n",
|
| 538 |
+
" raise ValueError(f\"Unknown model type: {model_type}. Choose from {list(MODEL_CONFIGS.keys())}\")\n",
|
| 539 |
+
" \n",
|
| 540 |
+
" config = MODEL_CONFIGS[model_type]\n",
|
| 541 |
+
" model_name = config['name']\n",
|
| 542 |
+
" \n",
|
| 543 |
+
" device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 544 |
+
" print(f\"Loading model: {model_name}\")\n",
|
| 545 |
+
" print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 546 |
+
" print(f\"Device: {device}\\n\")\n",
|
| 547 |
+
" \n",
|
| 548 |
+
" if device == \"cpu\":\n",
|
| 549 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 550 |
+
" \n",
|
| 551 |
+
" # Load tokenizer\n",
|
| 552 |
+
" try:\n",
|
| 553 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 554 |
+
" model_name,\n",
|
| 555 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 556 |
+
" use_fast=True\n",
|
| 557 |
+
" )\n",
|
| 558 |
+
" except:\n",
|
| 559 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 560 |
+
" model_name,\n",
|
| 561 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 562 |
+
" use_fast=False\n",
|
| 563 |
+
" )\n",
|
| 564 |
+
" \n",
|
| 565 |
+
" if tokenizer.pad_token is None:\n",
|
| 566 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 567 |
+
" \n",
|
| 568 |
+
" # Load model\n",
|
| 569 |
+
" model_kwargs = {\n",
|
| 570 |
+
" 'cache_dir': str(CACHE_DIR),\n",
|
| 571 |
+
" 'device_map': 'auto',\n",
|
| 572 |
+
" 'trust_remote_code': False\n",
|
| 573 |
+
" }\n",
|
| 574 |
+
" \n",
|
| 575 |
+
" if config['quantization']:\n",
|
| 576 |
+
" model_kwargs['quantization_config'] = config['quantization']\n",
|
| 577 |
+
" else:\n",
|
| 578 |
+
" model_kwargs['torch_dtype'] = config['dtype']\n",
|
| 579 |
+
" \n",
|
| 580 |
+
" model = AutoModelForCausalLM.from_pretrained(model_name, **model_kwargs)\n",
|
| 581 |
+
" model.eval()\n",
|
| 582 |
+
" \n",
|
| 583 |
+
" # Check VRAM\n",
|
| 584 |
+
" if torch.cuda.is_available():\n",
|
| 585 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 586 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 587 |
+
" \n",
|
| 588 |
+
" return model, tokenizer, config\n"
|
| 589 |
+
]
|
| 590 |
+
},
|
| 591 |
+
{
|
| 592 |
+
"cell_type": "markdown",
|
| 593 |
+
"id": "11b2221a",
|
| 594 |
+
"metadata": {},
|
| 595 |
+
"source": [
|
| 596 |
+
"#### Inference Code"
|
| 597 |
+
]
|
| 598 |
+
},
|
| 599 |
+
{
|
| 600 |
+
"cell_type": "code",
|
| 601 |
+
"execution_count": null,
|
| 602 |
+
"id": "229f96bd",
|
| 603 |
+
"metadata": {},
|
| 604 |
+
"outputs": [],
|
| 605 |
+
"source": [
|
| 606 |
+
"@contextmanager\n",
|
| 607 |
+
"def timeout(duration):\n",
|
| 608 |
+
" \"\"\"Context manager for timeout.\"\"\"\n",
|
| 609 |
+
" def handler(signum, frame):\n",
|
| 610 |
+
" raise TimeoutError(\"Operation timed out\")\n",
|
| 611 |
+
" \n",
|
| 612 |
+
" signal.signal(signal.SIGALRM, handler)\n",
|
| 613 |
+
" signal.alarm(duration)\n",
|
| 614 |
+
" try:\n",
|
| 615 |
+
" yield\n",
|
| 616 |
+
" finally:\n",
|
| 617 |
+
" signal.alarm(0)\n",
|
| 618 |
+
"\n",
|
| 619 |
+
"def query_model(prompt, model, tokenizer, config, use_timeout=False):\n",
|
| 620 |
+
" \"\"\"\n",
|
| 621 |
+
" Query model with given prompt.\n",
|
| 622 |
+
" \n",
|
| 623 |
+
" Args:\n",
|
| 624 |
+
" prompt: Input prompt string\n",
|
| 625 |
+
" model: Loaded model\n",
|
| 626 |
+
" tokenizer: Loaded tokenizer\n",
|
| 627 |
+
" config: Model configuration dict\n",
|
| 628 |
+
" use_timeout: Whether to use 60s timeout (for Qwen)\n",
|
| 629 |
+
" \n",
|
| 630 |
+
" Returns:\n",
|
| 631 |
+
" str: Model response or None on error\n",
|
| 632 |
+
" \"\"\"\n",
|
| 633 |
+
" try:\n",
|
| 634 |
+
" device = next(model.parameters()).device\n",
|
| 635 |
+
" \n",
|
| 636 |
+
" # Format as chat message\n",
|
| 637 |
+
" messages = [\n",
|
| 638 |
+
" {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
|
| 639 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 640 |
+
" ]\n",
|
| 641 |
+
" \n",
|
| 642 |
+
" # Tokenize\n",
|
| 643 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 644 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 645 |
+
" messages,\n",
|
| 646 |
+
" tokenize=False,\n",
|
| 647 |
+
" add_generation_prompt=True\n",
|
| 648 |
+
" )\n",
|
| 649 |
+
" else:\n",
|
| 650 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 651 |
+
" \n",
|
| 652 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 653 |
+
" \n",
|
| 654 |
+
" # Generation parameters\n",
|
| 655 |
+
" gen_kwargs = config['generation_params'].copy()\n",
|
| 656 |
+
" gen_kwargs['pad_token_id'] = tokenizer.eos_token_id\n",
|
| 657 |
+
" \n",
|
| 658 |
+
" # Generate\n",
|
| 659 |
+
" generation_fn = lambda: model.generate(**inputs, **gen_kwargs)\n",
|
| 660 |
+
" \n",
|
| 661 |
+
" if use_timeout:\n",
|
| 662 |
+
" with timeout(60):\n",
|
| 663 |
+
" with torch.no_grad():\n",
|
| 664 |
+
" outputs = generation_fn()\n",
|
| 665 |
+
" else:\n",
|
| 666 |
+
" with torch.no_grad():\n",
|
| 667 |
+
" outputs = generation_fn()\n",
|
| 668 |
+
" \n",
|
| 669 |
+
" # Decode\n",
|
| 670 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 671 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 672 |
+
" \n",
|
| 673 |
+
" return response.strip()\n",
|
| 674 |
+
" \n",
|
| 675 |
+
" except TimeoutError:\n",
|
| 676 |
+
" print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
|
| 677 |
+
" return None\n",
|
| 678 |
+
" except Exception as e:\n",
|
| 679 |
+
" print(f\"[ERROR] Generation failed: {e}\")\n",
|
| 680 |
+
" return None\n"
|
| 681 |
+
]
|
| 682 |
+
},
|
| 683 |
+
{
|
| 684 |
+
"cell_type": "markdown",
|
| 685 |
+
"id": "88f005f8",
|
| 686 |
+
"metadata": {},
|
| 687 |
+
"source": [
|
| 688 |
+
"#### Prompt creation"
|
| 689 |
+
]
|
| 690 |
+
},
|
| 691 |
+
{
|
| 692 |
+
"cell_type": "code",
|
| 693 |
+
"execution_count": null,
|
| 694 |
+
"id": "dfe05463",
|
| 695 |
+
"metadata": {},
|
| 696 |
+
"outputs": [],
|
| 697 |
+
"source": [
|
| 698 |
+
"def create_prompt(row):\n",
|
| 699 |
+
" \"\"\"Create annotation prompt from row data.\"\"\"\n",
|
| 700 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 701 |
+
" \n",
|
| 702 |
+
" # Gather hints\n",
|
| 703 |
+
" hints = []\n",
|
| 704 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 705 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 706 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 707 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 708 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 709 |
+
" hints.append(str(row['likely_country']))\n",
|
| 710 |
+
" \n",
|
| 711 |
+
" # Add tags if needed\n",
|
| 712 |
+
" if len(hints) < 3:\n",
|
| 713 |
+
" for i in range(1, 8):\n",
|
| 714 |
+
" tag_col = f'tag_{i}'\n",
|
| 715 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 716 |
+
" tag_val = str(row[tag_col])\n",
|
| 717 |
+
" if tag_val not in hints:\n",
|
| 718 |
+
" hints.append(tag_val)\n",
|
| 719 |
+
" if len(hints) >= 5:\n",
|
| 720 |
+
" break\n",
|
| 721 |
+
" \n",
|
| 722 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 723 |
+
" \n",
|
| 724 |
+
" return f\"\"\"Extract information about '{name}' ({hint_text}).\n",
|
| 725 |
+
"\n",
|
| 726 |
+
"Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
|
| 727 |
+
"\n",
|
| 728 |
+
"FORMAT REQUIREMENTS:\n",
|
| 729 |
+
"1. Full legal name in Western order (first last). VALUE ONLY.\n",
|
| 730 |
+
"2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
|
| 731 |
+
"3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
|
| 732 |
+
"4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
|
| 733 |
+
"5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
|
| 734 |
+
"\n",
|
| 735 |
+
"RULES:\n",
|
| 736 |
+
"- Professions MUST match the exact categories listed (actress = actor)\n",
|
| 737 |
+
"- \"online personality\" includes streamers, cosplayers, YouTubers, influencers\n",
|
| 738 |
+
"- \"public figure\" includes politicians, activists, journalists, authors\n",
|
| 739 |
+
"- Use \"Unknown\" when uncertain or for fictional characters\n",
|
| 740 |
+
"- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
|
| 741 |
+
"- For multi-role people, list up to 3 categories by relevance\n",
|
| 742 |
+
"\n",
|
| 743 |
+
"EXAMPLE FORMAT:\n",
|
| 744 |
+
"1. Taylor Swift\n",
|
| 745 |
+
"2. None\n",
|
| 746 |
+
"3. Female\n",
|
| 747 |
+
"4. singer/musician, public figure\n",
|
| 748 |
+
"5. United States\"\"\"\n"
|
| 749 |
+
]
|
| 750 |
+
},
|
| 751 |
+
{
|
| 752 |
+
"cell_type": "markdown",
|
| 753 |
+
"id": "854fa668",
|
| 754 |
+
"metadata": {},
|
| 755 |
+
"source": [
|
| 756 |
+
"#### Response parsing code"
|
| 757 |
+
]
|
| 758 |
+
},
|
| 759 |
+
{
|
| 760 |
+
"cell_type": "code",
|
| 761 |
+
"execution_count": null,
|
| 762 |
+
"id": "1a4be2ee",
|
| 763 |
+
"metadata": {},
|
| 764 |
+
"outputs": [],
|
| 765 |
+
"source": [
|
| 766 |
+
"def parse_response(response):\n",
|
| 767 |
+
" \"\"\"Parse model response into structured fields.\"\"\"\n",
|
| 768 |
+
" if not response:\n",
|
| 769 |
+
" return {\n",
|
| 770 |
+
" 'full_name': 'Unknown',\n",
|
| 771 |
+
" 'aliases': 'Unknown',\n",
|
| 772 |
+
" 'gender': 'Unknown',\n",
|
| 773 |
+
" 'profession_llm': 'Unknown',\n",
|
| 774 |
+
" 'country': 'Unknown'\n",
|
| 775 |
+
" }\n",
|
| 776 |
+
" \n",
|
| 777 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 778 |
+
" \n",
|
| 779 |
+
" fields = {\n",
|
| 780 |
+
" 'full_name': 'Unknown',\n",
|
| 781 |
+
" 'aliases': 'Unknown',\n",
|
| 782 |
+
" 'gender': 'Unknown',\n",
|
| 783 |
+
" 'profession_llm': 'Unknown',\n",
|
| 784 |
+
" 'country': 'Unknown'\n",
|
| 785 |
+
" }\n",
|
| 786 |
+
" \n",
|
| 787 |
+
" for line in lines:\n",
|
| 788 |
+
" if line.startswith('1.'):\n",
|
| 789 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 790 |
+
" elif line.startswith('2.'):\n",
|
| 791 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 792 |
+
" elif line.startswith('3.'):\n",
|
| 793 |
+
" gender_raw = line[2:].strip()\n",
|
| 794 |
+
" gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
|
| 795 |
+
" gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
|
| 796 |
+
" fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
|
| 797 |
+
" elif line.startswith('4.'):\n",
|
| 798 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 799 |
+
" elif line.startswith('5.'):\n",
|
| 800 |
+
" country_raw = line[2:].strip()\n",
|
| 801 |
+
" country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
|
| 802 |
+
" fields['country'] = country_raw\n",
|
| 803 |
+
" \n",
|
| 804 |
+
" return fields\n"
|
| 805 |
+
]
|
| 806 |
+
},
|
| 807 |
+
{
|
| 808 |
+
"cell_type": "markdown",
|
| 809 |
+
"id": "7e2f7a86",
|
| 810 |
+
"metadata": {},
|
| 811 |
+
"source": [
|
| 812 |
+
"#### CSV annotation"
|
| 813 |
+
]
|
| 814 |
+
},
|
| 815 |
+
{
|
| 816 |
+
"cell_type": "code",
|
| 817 |
+
"execution_count": null,
|
| 818 |
+
"id": "5f3dd5d6",
|
| 819 |
+
"metadata": {},
|
| 820 |
+
"outputs": [],
|
| 821 |
+
"source": [
|
| 822 |
+
"def annotate_dataset(model_type='mistral', test_mode=False, test_size=100, max_rows=50862, save_interval=10):\n",
|
| 823 |
+
" \"\"\"\n",
|
| 824 |
+
" Annotate dataset using specified model.\n",
|
| 825 |
+
" \n",
|
| 826 |
+
" Args:\n",
|
| 827 |
+
" model_type: 'mistral', 'gemma', or 'qwen'\n",
|
| 828 |
+
" test_mode: If True, only process test_size rows\n",
|
| 829 |
+
" test_size: Number of rows to process in test mode\n",
|
| 830 |
+
" max_rows: Maximum rows to process\n",
|
| 831 |
+
" save_interval: Save progress every N rows\n",
|
| 832 |
+
" \"\"\"\n",
|
| 833 |
+
" # Setup paths\n",
|
| 834 |
+
" input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 835 |
+
" output_file = current_dir.parent / f\"data/CSV/{model_type}_local_annotated_POI{'_test' if test_mode else ''}.csv\"\n",
|
| 836 |
+
" index_file = current_dir.parent / f\"misc/query_indicies/{model_type}_local_query_index.txt\"\n",
|
| 837 |
+
" index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 838 |
+
" \n",
|
| 839 |
+
" # Load model\n",
|
| 840 |
+
" model, tokenizer, config = load_model(model_type)\n",
|
| 841 |
+
" \n",
|
| 842 |
+
" # Load data\n",
|
| 843 |
+
" print(f\"Loaded {len(df)} rows from input file\")\n",
|
| 844 |
+
" df = pd.read_csv(input_file)\n",
|
| 845 |
+
" \n",
|
| 846 |
+
" # Merge existing annotations if available\n",
|
| 847 |
+
" if output_file.exists():\n",
|
| 848 |
+
" existing_df = pd.read_csv(output_file)\n",
|
| 849 |
+
" annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
|
| 850 |
+
" for col in annotation_cols:\n",
|
| 851 |
+
" if col in existing_df.columns:\n",
|
| 852 |
+
" df[col] = existing_df[col][:len(df)]\n",
|
| 853 |
+
" \n",
|
| 854 |
+
" # Apply limits\n",
|
| 855 |
+
" if test_mode:\n",
|
| 856 |
+
" df = df.head(test_size).copy()\n",
|
| 857 |
+
" elif max_rows:\n",
|
| 858 |
+
" df = df.head(max_rows).copy()\n",
|
| 859 |
+
" \n",
|
| 860 |
+
" # Create prompts\n",
|
| 861 |
+
" df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 862 |
+
" \n",
|
| 863 |
+
" # Load progress index\n",
|
| 864 |
+
" current_index = 0\n",
|
| 865 |
+
" if index_file.exists():\n",
|
| 866 |
+
" try:\n",
|
| 867 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 868 |
+
" except:\n",
|
| 869 |
+
" current_index = 0\n",
|
| 870 |
+
" \n",
|
| 871 |
+
" print(f\"Resuming from index {current_index}\")\n",
|
| 872 |
+
" \n",
|
| 873 |
+
" # Process rows\n",
|
| 874 |
+
" use_timeout = (model_type == 'qwen')\n",
|
| 875 |
+
" \n",
|
| 876 |
+
" for i in tqdm(range(current_index, len(df)), desc=f\"{model_type.capitalize()} Annotation\"):\n",
|
| 877 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 878 |
+
" \n",
|
| 879 |
+
" # Query with retries\n",
|
| 880 |
+
" response = None\n",
|
| 881 |
+
" for attempt in range(3):\n",
|
| 882 |
+
" response = query_model(prompt, model, tokenizer, config, use_timeout)\n",
|
| 883 |
+
" \n",
|
| 884 |
+
" if response and len(response.strip()) > 10:\n",
|
| 885 |
+
" break\n",
|
| 886 |
+
" \n",
|
| 887 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 888 |
+
" time.sleep(0.5)\n",
|
| 889 |
+
" \n",
|
| 890 |
+
" # Skip if invalid\n",
|
| 891 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 892 |
+
" print(f\"❌ Row {i}: failed after retries, skipping\")\n",
|
| 893 |
+
" continue\n",
|
| 894 |
+
" \n",
|
| 895 |
+
" # Parse and validate\n",
|
| 896 |
+
" parsed = parse_response(response)\n",
|
| 897 |
+
" \n",
|
| 898 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 899 |
+
" print(f\"❌ Row {i}: parsed as all Unknown, skipping\")\n",
|
| 900 |
+
" continue\n",
|
| 901 |
+
" \n",
|
| 902 |
+
" # Write fields\n",
|
| 903 |
+
" for key, value in parsed.items():\n",
|
| 904 |
+
" df.at[i, key] = value\n",
|
| 905 |
+
" \n",
|
| 906 |
+
" current_index = i + 1\n",
|
| 907 |
+
" \n",
|
| 908 |
+
" # GPU cleanup\n",
|
| 909 |
+
" if torch.cuda.is_available():\n",
|
| 910 |
+
" torch.cuda.empty_cache()\n",
|
| 911 |
+
" torch.cuda.synchronize()\n",
|
| 912 |
+
" \n",
|
| 913 |
+
" # Save progress\n",
|
| 914 |
+
" if (i + 1) % save_interval == 0 or (i + 1) == len(df):\n",
|
| 915 |
+
" df.to_csv(output_file, index=False)\n",
|
| 916 |
+
" index_file.write_text(str(current_index))\n",
|
| 917 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 918 |
+
" \n",
|
| 919 |
+
" # Final save\n",
|
| 920 |
+
" df.to_csv(output_file, index=False)\n",
|
| 921 |
+
" index_file.write_text(str(current_index))\n",
|
| 922 |
+
" print(f\"✓ Finished annotation with {model_type}\")\n"
|
| 923 |
+
]
|
| 924 |
+
},
|
| 925 |
+
{
|
| 926 |
+
"cell_type": "markdown",
|
| 927 |
+
"id": "55da2f4c",
|
| 928 |
+
"metadata": {},
|
| 929 |
+
"source": [
|
| 930 |
+
"### Usage Examples\n",
|
| 931 |
+
"Run annotation with your chosen model."
|
| 932 |
+
]
|
| 933 |
+
},
|
| 934 |
+
{
|
| 935 |
+
"cell_type": "code",
|
| 936 |
+
"execution_count": null,
|
| 937 |
+
"id": "351ea40c",
|
| 938 |
+
"metadata": {},
|
| 939 |
+
"outputs": [],
|
| 940 |
+
"source": [
|
| 941 |
+
"# Example 1: Annotate with Mistral (13.5 GB VRAM)\n",
|
| 942 |
+
"# annotate_dataset(model_type='mistral', test_mode=False)\n",
|
| 943 |
+
"\n",
|
| 944 |
+
"# Example 2: Annotate with Gemma (56.3 GB VRAM)\n",
|
| 945 |
+
"# annotate_dataset(model_type='gemma', test_mode=False)\n",
|
| 946 |
+
"\n",
|
| 947 |
+
"# Example 3: Annotate with Qwen (32.7 GB VRAM, 8-bit)\n",
|
| 948 |
+
"# annotate_dataset(model_type='qwen', test_mode=False)\n",
|
| 949 |
+
"\n",
|
| 950 |
+
"# Test mode (first 100 rows)\n",
|
| 951 |
+
"# annotate_dataset(model_type='mistral', test_mode=True, test_size=100)\n"
|
| 952 |
+
]
|
| 953 |
+
},
|
| 954 |
+
{
|
| 955 |
+
"cell_type": "markdown",
|
| 956 |
+
"id": "6431d347-d80c-4e8b-83a7-531e5df95a72",
|
| 957 |
+
"metadata": {},
|
| 958 |
+
"source": [
|
| 959 |
+
"## EuroLLM-9B-Instruct"
|
| 960 |
+
]
|
| 961 |
+
},
|
| 962 |
+
{
|
| 963 |
+
"cell_type": "code",
|
| 964 |
+
"execution_count": null,
|
| 965 |
+
"id": "e8203abc-e7c3-4cb6-aaeb-fdc6933981fc",
|
| 966 |
+
"metadata": {},
|
| 967 |
+
"outputs": [],
|
| 968 |
+
"source": [
|
| 969 |
+
"import pandas as pd\n",
|
| 970 |
+
"import json\n",
|
| 971 |
+
"import time\n",
|
| 972 |
+
"import re\n",
|
| 973 |
+
"from pathlib import Path\n",
|
| 974 |
+
"from tqdm import tqdm\n",
|
| 975 |
+
"import torch\n",
|
| 976 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 977 |
+
"import signal\n",
|
| 978 |
+
"from contextlib import contextmanager\n",
|
| 979 |
+
"\n",
|
| 980 |
+
"current_dir = Path.cwd()\n",
|
| 981 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 982 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 983 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 984 |
+
"# === PROCESS DATA ===\n",
|
| 985 |
+
"\n",
|
| 986 |
+
"\n",
|
| 987 |
+
"# === CONFIGURATION ===\n",
|
| 988 |
+
"TEST_MODE = False\n",
|
| 989 |
+
"TEST_SIZE = 100\n",
|
| 990 |
+
"MAX_ROWS = 50862\n",
|
| 991 |
+
"SAVE_INTERVAL = 10\n",
|
| 992 |
+
"\n",
|
| 993 |
+
"\n",
|
| 994 |
+
"index_file = current_dir.parent / \"misc/query_indicies/eurollm_local_query_index.txt\"\n",
|
| 995 |
+
"output_file = current_dir.parent / f\"data/CSV/eurollm_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 996 |
+
"\n",
|
| 997 |
+
"# Model settings\n",
|
| 998 |
+
"MODEL_NAME = \"utter-project/EuroLLM-9B-Instruct\"\n",
|
| 999 |
+
"#MODEL_NAME = \"Qwen/Qwen2.5-32B-Instruct\"\n",
|
| 1000 |
+
"#MODEL_NAME = \"Qwen/Qwen2.5-14B-Instruct\"\n",
|
| 1001 |
+
"#MODEL_NAME = \"Qwen/Qwen3-235B-A22B-Instruct-2507-FP8\"\n",
|
| 1002 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 1003 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 1004 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 1005 |
+
"\n",
|
| 1006 |
+
"# Define the SPECIFIC profession categories\n",
|
| 1007 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 1008 |
+
" \"actor\",\n",
|
| 1009 |
+
" \"adult performer\",\n",
|
| 1010 |
+
" \"singer/musician\",\n",
|
| 1011 |
+
" \"model\",\n",
|
| 1012 |
+
" \"online personality\",\n",
|
| 1013 |
+
" \"public figure\",\n",
|
| 1014 |
+
" \"voice actor/ASMR\",\n",
|
| 1015 |
+
" \"sports professional\",\n",
|
| 1016 |
+
" \"tv personality\"\n",
|
| 1017 |
+
"]\n",
|
| 1018 |
+
"\n",
|
| 1019 |
+
"# === LOAD MODEL ===\n",
|
| 1020 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 1021 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 1022 |
+
"print(f\"This may take a while on first run...\\n\")\n",
|
| 1023 |
+
"\n",
|
| 1024 |
+
"# Check GPU availability\n",
|
| 1025 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1026 |
+
"print(f\"Device: {device}\")\n",
|
| 1027 |
+
"\n",
|
| 1028 |
+
"if device == \"cpu\":\n",
|
| 1029 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 1030 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 1031 |
+
"\n",
|
| 1032 |
+
"# Get HF token from credentials file\n",
|
| 1033 |
+
"import os\n",
|
| 1034 |
+
"credentials_dir = current_dir.parent / \"misc/credentials\"\n",
|
| 1035 |
+
"hf_token_file = credentials_dir / \"hf_token.txt\"\n",
|
| 1036 |
+
"\n",
|
| 1037 |
+
"HF_TOKEN = None\n",
|
| 1038 |
+
"if hf_token_file.exists():\n",
|
| 1039 |
+
" HF_TOKEN = hf_token_file.read_text().strip()\n",
|
| 1040 |
+
" print(\"✅ HF token loaded from credentials file\")\n",
|
| 1041 |
+
"else:\n",
|
| 1042 |
+
" print(\"⚠️ HF token file not found at:\", hf_token_file)\n",
|
| 1043 |
+
" print(\" The script will try to use cached credentials from 'huggingface-cli login'\")\n",
|
| 1044 |
+
" print(\" Or create the file: misc/credentials/hf_token.txt with your token\")\n",
|
| 1045 |
+
" HF_TOKEN = None # Will use cached token if available\n",
|
| 1046 |
+
"\n",
|
| 1047 |
+
"# Load tokenizer\n",
|
| 1048 |
+
"print(\"Loading tokenizer...\")\n",
|
| 1049 |
+
"try:\n",
|
| 1050 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1051 |
+
" MODEL_NAME,\n",
|
| 1052 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1053 |
+
" use_fast=True,\n",
|
| 1054 |
+
" token=HF_TOKEN\n",
|
| 1055 |
+
" )\n",
|
| 1056 |
+
"except Exception as e:\n",
|
| 1057 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 1058 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1059 |
+
" MODEL_NAME,\n",
|
| 1060 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1061 |
+
" use_fast=False,\n",
|
| 1062 |
+
" token=HF_TOKEN\n",
|
| 1063 |
+
" )\n",
|
| 1064 |
+
"\n",
|
| 1065 |
+
"# Ensure pad token is set\n",
|
| 1066 |
+
"if tokenizer.pad_token is None:\n",
|
| 1067 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 1068 |
+
"\n",
|
| 1069 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 1070 |
+
"\n",
|
| 1071 |
+
"# Configure 8-bit quantization for A100\n",
|
| 1072 |
+
"print(\"Configuring 8-bit quantization...\")\n",
|
| 1073 |
+
"quantization_config = BitsAndBytesConfig(\n",
|
| 1074 |
+
" load_in_8bit=True,\n",
|
| 1075 |
+
" llm_int8_threshold=6.0,\n",
|
| 1076 |
+
" llm_int8_has_fp16_weight=False\n",
|
| 1077 |
+
")\n",
|
| 1078 |
+
"\n",
|
| 1079 |
+
"# Load model with 8-bit quantization\n",
|
| 1080 |
+
"print(\"Loading model with 8-bit quantization (this may take several minutes)...\")\n",
|
| 1081 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 1082 |
+
" MODEL_NAME,\n",
|
| 1083 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1084 |
+
" quantization_config=quantization_config,\n",
|
| 1085 |
+
" device_map=\"auto\",\n",
|
| 1086 |
+
" trust_remote_code=False,\n",
|
| 1087 |
+
" token=HF_TOKEN\n",
|
| 1088 |
+
")\n",
|
| 1089 |
+
"model.eval()\n",
|
| 1090 |
+
"print(\"✅ Model loaded with 8-bit quantization\")\n",
|
| 1091 |
+
"\n",
|
| 1092 |
+
"# Check VRAM usage\n",
|
| 1093 |
+
"if torch.cuda.is_available():\n",
|
| 1094 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 1095 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 1096 |
+
"\n",
|
| 1097 |
+
"# === LOAD DATA ===\n",
|
| 1098 |
+
"if output_file.exists():\n",
|
| 1099 |
+
" print(\"Loading annotated CSV...\")\n",
|
| 1100 |
+
" df = pd.read_csv(output_file)\n",
|
| 1101 |
+
"else:\n",
|
| 1102 |
+
" print(\"Loading raw input CSV...\")\n",
|
| 1103 |
+
" df = pd.read_csv(input_file)\n",
|
| 1104 |
+
"\n",
|
| 1105 |
+
"\n",
|
| 1106 |
+
"# Try to load profession mapping files\n",
|
| 1107 |
+
"try:\n",
|
| 1108 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 1109 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 1110 |
+
"except:\n",
|
| 1111 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 1112 |
+
"\n",
|
| 1113 |
+
"try:\n",
|
| 1114 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 1115 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 1116 |
+
"except:\n",
|
| 1117 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 1118 |
+
"\n",
|
| 1119 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 1120 |
+
"\n",
|
| 1121 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 1122 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 1123 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 1124 |
+
" print(f\" - {cat}\")\n",
|
| 1125 |
+
"\n",
|
| 1126 |
+
"if TEST_MODE:\n",
|
| 1127 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 1128 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 1129 |
+
"elif MAX_ROWS:\n",
|
| 1130 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 1131 |
+
"\n",
|
| 1132 |
+
"# === CREATE PROMPTS (OPTIMIZED FOR CLEAN OUTPUTS) ===\n",
|
| 1133 |
+
"def create_prompt(row):\n",
|
| 1134 |
+
" \"\"\"Create prompt for EuroLLM annotation with strict formatting requirements.\"\"\"\n",
|
| 1135 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 1136 |
+
" \n",
|
| 1137 |
+
" # Gather hints\n",
|
| 1138 |
+
" hints = []\n",
|
| 1139 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 1140 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 1141 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 1142 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 1143 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 1144 |
+
" hints.append(str(row['likely_country']))\n",
|
| 1145 |
+
" \n",
|
| 1146 |
+
" # Add tags if we don't have enough hints\n",
|
| 1147 |
+
" if len(hints) < 3:\n",
|
| 1148 |
+
" for i in range(1, 8):\n",
|
| 1149 |
+
" tag_col = f'tag_{i}'\n",
|
| 1150 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 1151 |
+
" tag_val = str(row[tag_col])\n",
|
| 1152 |
+
" if tag_val not in hints:\n",
|
| 1153 |
+
" hints.append(tag_val)\n",
|
| 1154 |
+
" if len(hints) >= 5:\n",
|
| 1155 |
+
" break\n",
|
| 1156 |
+
" \n",
|
| 1157 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 1158 |
+
" \n",
|
| 1159 |
+
" return f\"\"\"Extract information about '{name}'. \n",
|
| 1160 |
+
"Context hints (DO NOT copy these as professions): {hint_text}\n",
|
| 1161 |
+
"\n",
|
| 1162 |
+
"Respond with EXACTLY 5 numbered lines. Each line must contain ONLY the value, no labels or extra text.\n",
|
| 1163 |
+
"\n",
|
| 1164 |
+
"FORMAT REQUIREMENTS:\n",
|
| 1165 |
+
"1. Full legal name in Western order (first last). VALUE ONLY.\n",
|
| 1166 |
+
"2. Stage names/aliases, comma-separated. If none, write \"None\". VALUE ONLY.\n",
|
| 1167 |
+
"3. Gender: MUST be exactly one word: Male, Female, Other, or Unknown. VALUE ONLY.\n",
|
| 1168 |
+
"4. Professions: Choose up to 3 from this list ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality. Comma-separated. VALUE ONLY.\n",
|
| 1169 |
+
"5. Primary country: Country name only (e.g., \"China\", \"United States\", \"Colombia\"). VALUE ONLY.\n",
|
| 1170 |
+
"\n",
|
| 1171 |
+
"CRITICAL RULES FOR PROFESSIONS (Line 4):\n",
|
| 1172 |
+
"- ONLY use the exact profession categories listed above\n",
|
| 1173 |
+
"- DO NOT use descriptive words like \"sexy\", \"photorealistic\", \"celebrity\"\n",
|
| 1174 |
+
"- DO NOT copy the hint words as professions\n",
|
| 1175 |
+
"- If uncertain about profession, write \"Unknown\"\n",
|
| 1176 |
+
"- Valid professions are ONLY: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality\n",
|
| 1177 |
+
"- Actress = actor, streamer = online personality, YouTuber = online personality\n",
|
| 1178 |
+
"\n",
|
| 1179 |
+
"OTHER RULES:\n",
|
| 1180 |
+
"- Use \"Unknown\" when uncertain or for fictional characters\n",
|
| 1181 |
+
"- NO explanatory text, NO labels like \"Gender:\", NO prefixes\n",
|
| 1182 |
+
"- For multi-role people, list up to 3 categories by relevance\"\"\"\n",
|
| 1183 |
+
"\n",
|
| 1184 |
+
"# Create prompts\n",
|
| 1185 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 1186 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 1187 |
+
"print(\"✅ Prompts created\")\n",
|
| 1188 |
+
"\n",
|
| 1189 |
+
"@contextmanager\n",
|
| 1190 |
+
"def timeout(duration):\n",
|
| 1191 |
+
" def handler(signum, frame):\n",
|
| 1192 |
+
" raise TimeoutError(\"Operation timed out\")\n",
|
| 1193 |
+
" \n",
|
| 1194 |
+
" # Set the signal handler and alarm\n",
|
| 1195 |
+
" signal.signal(signal.SIGALRM, handler)\n",
|
| 1196 |
+
" signal.alarm(duration)\n",
|
| 1197 |
+
" try:\n",
|
| 1198 |
+
" yield\n",
|
| 1199 |
+
" finally:\n",
|
| 1200 |
+
" signal.alarm(0) # Disable the alarm\n",
|
| 1201 |
+
"\n",
|
| 1202 |
+
"\n",
|
| 1203 |
+
"def query_eurollm_local(prompt: str) -> str:\n",
|
| 1204 |
+
" \"\"\"Query EuroLLM locally via transformers with very low temperature.\"\"\"\n",
|
| 1205 |
+
" try:\n",
|
| 1206 |
+
" # Format as chat message for EuroLLM with strict system prompt\n",
|
| 1207 |
+
" messages = [\n",
|
| 1208 |
+
" {\"role\": \"system\", \"content\": \"You are a data extraction assistant. Respond with exactly 5 numbered lines containing ONLY values. No labels, no explanations, no prefixes. Follow the format precisely.\"},\n",
|
| 1209 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 1210 |
+
" ]\n",
|
| 1211 |
+
" \n",
|
| 1212 |
+
" # Tokenize\n",
|
| 1213 |
+
" if hasattr(tokenizer, 'apply_chat_template') and tokenizer.chat_template is not None:\n",
|
| 1214 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 1215 |
+
" messages,\n",
|
| 1216 |
+
" tokenize=False,\n",
|
| 1217 |
+
" add_generation_prompt=True\n",
|
| 1218 |
+
" )\n",
|
| 1219 |
+
" else:\n",
|
| 1220 |
+
" # Fallback for models without chat template\n",
|
| 1221 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 1222 |
+
" \n",
|
| 1223 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 1224 |
+
" \n",
|
| 1225 |
+
" # Generate with timeout and very low temperature\n",
|
| 1226 |
+
" with timeout(60):\n",
|
| 1227 |
+
" with torch.no_grad():\n",
|
| 1228 |
+
" outputs = model.generate(\n",
|
| 1229 |
+
" **inputs,\n",
|
| 1230 |
+
" max_new_tokens=100,\n",
|
| 1231 |
+
" temperature=0.01, # Very low temperature for more deterministic outputs\n",
|
| 1232 |
+
" do_sample=True, # Must be True when temperature is set\n",
|
| 1233 |
+
" pad_token_id=tokenizer.eos_token_id\n",
|
| 1234 |
+
" )\n",
|
| 1235 |
+
" \n",
|
| 1236 |
+
" # Decode\n",
|
| 1237 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 1238 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 1239 |
+
" \n",
|
| 1240 |
+
" return response.strip()\n",
|
| 1241 |
+
" \n",
|
| 1242 |
+
" except TimeoutError:\n",
|
| 1243 |
+
" print(f\"[ERROR] Generation timed out after 60 seconds\")\n",
|
| 1244 |
+
" return None\n",
|
| 1245 |
+
" except Exception as e:\n",
|
| 1246 |
+
" print(f\"Generation error: {e}\")\n",
|
| 1247 |
+
" import traceback\n",
|
| 1248 |
+
" traceback.print_exc()\n",
|
| 1249 |
+
" return None\n",
|
| 1250 |
+
"\n",
|
| 1251 |
+
" \n",
|
| 1252 |
+
"# === PARSE RESPONSE WITH CLEANING ===\n",
|
| 1253 |
+
"def parse_response(response):\n",
|
| 1254 |
+
" \"\"\"Parse EuroLLM response into structured fields with cleaning.\"\"\"\n",
|
| 1255 |
+
" if not response:\n",
|
| 1256 |
+
" return {\n",
|
| 1257 |
+
" 'full_name': 'Unknown',\n",
|
| 1258 |
+
" 'aliases': 'Unknown',\n",
|
| 1259 |
+
" 'gender': 'Unknown',\n",
|
| 1260 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1261 |
+
" 'country': 'Unknown'\n",
|
| 1262 |
+
" }\n",
|
| 1263 |
+
" \n",
|
| 1264 |
+
" # Valid profession categories\n",
|
| 1265 |
+
" VALID_PROFESSIONS = {\n",
|
| 1266 |
+
" \"actor\", \"adult performer\", \"singer/musician\", \"model\", \n",
|
| 1267 |
+
" \"online personality\", \"public figure\", \"voice actor/asmr\", \n",
|
| 1268 |
+
" \"sports professional\", \"tv personality\"\n",
|
| 1269 |
+
" }\n",
|
| 1270 |
+
" \n",
|
| 1271 |
+
" # Split into lines and clean\n",
|
| 1272 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 1273 |
+
" \n",
|
| 1274 |
+
" # Initialize with Unknown values\n",
|
| 1275 |
+
" fields = {\n",
|
| 1276 |
+
" 'full_name': 'Unknown',\n",
|
| 1277 |
+
" 'aliases': 'Unknown',\n",
|
| 1278 |
+
" 'gender': 'Unknown',\n",
|
| 1279 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1280 |
+
" 'country': 'Unknown'\n",
|
| 1281 |
+
" }\n",
|
| 1282 |
+
" \n",
|
| 1283 |
+
" # Extract information from each numbered line\n",
|
| 1284 |
+
" for line in lines:\n",
|
| 1285 |
+
" if line.startswith('1.'):\n",
|
| 1286 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 1287 |
+
" elif line.startswith('2.'):\n",
|
| 1288 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 1289 |
+
" elif line.startswith('3.'):\n",
|
| 1290 |
+
" # Clean gender field - remove any labels\n",
|
| 1291 |
+
" gender_raw = line[2:].strip()\n",
|
| 1292 |
+
" # Remove common prefixes\n",
|
| 1293 |
+
" gender_raw = re.sub(r'^(Gender:|gender:)\\s*', '', gender_raw, flags=re.IGNORECASE)\n",
|
| 1294 |
+
" # Extract just the gender word\n",
|
| 1295 |
+
" gender_match = re.search(r'\\b(Male|Female|Other|Unknown)\\b', gender_raw, re.IGNORECASE)\n",
|
| 1296 |
+
" fields['gender'] = gender_match.group(1).capitalize() if gender_match else gender_raw\n",
|
| 1297 |
+
" elif line.startswith('4.'):\n",
|
| 1298 |
+
" # Clean and validate profession field\n",
|
| 1299 |
+
" profession_raw = line[2:].strip()\n",
|
| 1300 |
+
" \n",
|
| 1301 |
+
" # Split by comma and validate each profession\n",
|
| 1302 |
+
" professions = [p.strip().lower() for p in profession_raw.split(',')]\n",
|
| 1303 |
+
" valid_profs = []\n",
|
| 1304 |
+
" \n",
|
| 1305 |
+
" for prof in professions:\n",
|
| 1306 |
+
" # Check if it's a valid profession\n",
|
| 1307 |
+
" if prof in VALID_PROFESSIONS:\n",
|
| 1308 |
+
" valid_profs.append(prof)\n",
|
| 1309 |
+
" # Check for common invalid entries\n",
|
| 1310 |
+
" elif prof in ['unknown', '']:\n",
|
| 1311 |
+
" continue\n",
|
| 1312 |
+
" # Reject descriptive words that aren't professions\n",
|
| 1313 |
+
" elif prof in ['sexy', 'photorealistic', 'celebrity', 'famous', 'popular', \n",
|
| 1314 |
+
" 'beautiful', 'attractive', 'hot', 'gorgeous']:\n",
|
| 1315 |
+
" continue\n",
|
| 1316 |
+
" # If it looks like it might be close to a valid profession, keep it\n",
|
| 1317 |
+
" elif any(valid in prof for valid in VALID_PROFESSIONS):\n",
|
| 1318 |
+
" # Try to extract the valid part\n",
|
| 1319 |
+
" for valid in VALID_PROFESSIONS:\n",
|
| 1320 |
+
" if valid in prof:\n",
|
| 1321 |
+
" valid_profs.append(valid)\n",
|
| 1322 |
+
" break\n",
|
| 1323 |
+
" \n",
|
| 1324 |
+
" # Set the cleaned professions or Unknown if none are valid\n",
|
| 1325 |
+
" if valid_profs:\n",
|
| 1326 |
+
" fields['profession_llm'] = ', '.join(valid_profs)\n",
|
| 1327 |
+
" else:\n",
|
| 1328 |
+
" fields['profession_llm'] = 'Unknown'\n",
|
| 1329 |
+
" \n",
|
| 1330 |
+
" elif line.startswith('5.'):\n",
|
| 1331 |
+
" # Clean country field - remove any labels\n",
|
| 1332 |
+
" country_raw = line[2:].strip()\n",
|
| 1333 |
+
" # Remove common prefixes like \"Primary country:\", \"Country:\", etc.\n",
|
| 1334 |
+
" country_raw = re.sub(r'^(Primary\\s+)?(associated\\s+)?country:\\s*', '', country_raw, flags=re.IGNORECASE)\n",
|
| 1335 |
+
" fields['country'] = country_raw\n",
|
| 1336 |
+
" \n",
|
| 1337 |
+
" return fields\n",
|
| 1338 |
+
"\n",
|
| 1339 |
+
"# === PROCESS DATA ===\n",
|
| 1340 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 1341 |
+
"\n",
|
| 1342 |
+
"# Load index\n",
|
| 1343 |
+
"current_index = 0\n",
|
| 1344 |
+
"if index_file.exists():\n",
|
| 1345 |
+
" try:\n",
|
| 1346 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 1347 |
+
" except:\n",
|
| 1348 |
+
" current_index = 0\n",
|
| 1349 |
+
"\n",
|
| 1350 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 1351 |
+
"\n",
|
| 1352 |
+
"start_time = time.time()\n",
|
| 1353 |
+
"\n",
|
| 1354 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"EuroLLM Local\"):\n",
|
| 1355 |
+
"\n",
|
| 1356 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 1357 |
+
"\n",
|
| 1358 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 1359 |
+
" response = None\n",
|
| 1360 |
+
" for attempt in range(3):\n",
|
| 1361 |
+
" response = query_eurollm_local(prompt)\n",
|
| 1362 |
+
" \n",
|
| 1363 |
+
" # DEBUG: Print first few responses to see what's happening\n",
|
| 1364 |
+
" if i < 5:\n",
|
| 1365 |
+
" print(f\"\\n=== DEBUG Row {i}, Attempt {attempt+1} ===\")\n",
|
| 1366 |
+
" print(f\"Response length: {len(response) if response else 0}\")\n",
|
| 1367 |
+
" print(f\"Response: {response[:500] if response else 'None'}\")\n",
|
| 1368 |
+
" print(\"=\" * 50)\n",
|
| 1369 |
+
" \n",
|
| 1370 |
+
" # Valid response?\n",
|
| 1371 |
+
" if response and len(response.strip()) > 10:\n",
|
| 1372 |
+
" break\n",
|
| 1373 |
+
" \n",
|
| 1374 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 1375 |
+
" time.sleep(0.5)\n",
|
| 1376 |
+
"\n",
|
| 1377 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 1378 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 1379 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 1380 |
+
" continue\n",
|
| 1381 |
+
"\n",
|
| 1382 |
+
" parsed = parse_response(response)\n",
|
| 1383 |
+
"\n",
|
| 1384 |
+
" # DEBUG: Print first few parsed results\n",
|
| 1385 |
+
" if i < 5:\n",
|
| 1386 |
+
" print(f\"\\n=== PARSED Row {i} ===\")\n",
|
| 1387 |
+
" for key, value in parsed.items():\n",
|
| 1388 |
+
" print(f\" {key}: {value}\")\n",
|
| 1389 |
+
" print(\"=\" * 50)\n",
|
| 1390 |
+
"\n",
|
| 1391 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 1392 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 1393 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 1394 |
+
" continue\n",
|
| 1395 |
+
"\n",
|
| 1396 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 1397 |
+
" for key, value in parsed.items():\n",
|
| 1398 |
+
" df.at[i, key] = value\n",
|
| 1399 |
+
"\n",
|
| 1400 |
+
" # Advance progress ONLY after successful write\n",
|
| 1401 |
+
" current_index = i + 1\n",
|
| 1402 |
+
"\n",
|
| 1403 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 1404 |
+
" if torch.cuda.is_available():\n",
|
| 1405 |
+
" torch.cuda.empty_cache()\n",
|
| 1406 |
+
" torch.cuda.synchronize()\n",
|
| 1407 |
+
"\n",
|
| 1408 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 1409 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 1410 |
+
" df.to_csv(output_file, index=False)\n",
|
| 1411 |
+
" with open(index_file, \"w\") as f:\n",
|
| 1412 |
+
" f.write(str(current_index))\n",
|
| 1413 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 1414 |
+
"\n",
|
| 1415 |
+
"# Final save\n",
|
| 1416 |
+
"df.to_csv(output_file, index=False)\n",
|
| 1417 |
+
"index_file.write_text(str(current_index))\n",
|
| 1418 |
+
"print(\"✅ Finished full dataset.\")"
|
| 1419 |
+
]
|
| 1420 |
+
},
|
| 1421 |
+
{
|
| 1422 |
+
"cell_type": "markdown",
|
| 1423 |
+
"id": "472e5ac2-ec04-4bfa-8a67-116277238c15",
|
| 1424 |
+
"metadata": {},
|
| 1425 |
+
"source": [
|
| 1426 |
+
"## Mistral 24b instruct"
|
| 1427 |
+
]
|
| 1428 |
+
},
|
| 1429 |
+
{
|
| 1430 |
+
"cell_type": "code",
|
| 1431 |
+
"execution_count": null,
|
| 1432 |
+
"id": "a55a5e30-83f3-4f7c-a537-b1216d4e8a07",
|
| 1433 |
+
"metadata": {
|
| 1434 |
+
"execution": {
|
| 1435 |
+
"iopub.execute_input": "2025-12-09T22:16:21.002786Z",
|
| 1436 |
+
"iopub.status.busy": "2025-12-09T22:16:21.002337Z"
|
| 1437 |
+
}
|
| 1438 |
+
},
|
| 1439 |
+
"outputs": [
|
| 1440 |
+
{
|
| 1441 |
+
"name": "stderr",
|
| 1442 |
+
"output_type": "stream",
|
| 1443 |
+
"text": [
|
| 1444 |
+
"/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
|
| 1445 |
+
" from .autonotebook import tqdm as notebook_tqdm\n"
|
| 1446 |
+
]
|
| 1447 |
+
},
|
| 1448 |
+
{
|
| 1449 |
+
"name": "stdout",
|
| 1450 |
+
"output_type": "stream",
|
| 1451 |
+
"text": [
|
| 1452 |
+
"Loading model: mistralai/Mistral-Small-Instruct-2409\n",
|
| 1453 |
+
"Cache directory: /shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/data/models\n",
|
| 1454 |
+
"This may take a while on first run (~65GB download)...\n",
|
| 1455 |
+
"\n",
|
| 1456 |
+
"Device: cuda\n",
|
| 1457 |
+
"Loading tokenizer...\n",
|
| 1458 |
+
"✅ Tokenizer loaded\n",
|
| 1459 |
+
"Loading model (this may take several minutes)...\n"
|
| 1460 |
+
]
|
| 1461 |
+
},
|
| 1462 |
+
{
|
| 1463 |
+
"name": "stderr",
|
| 1464 |
+
"output_type": "stream",
|
| 1465 |
+
"text": [
|
| 1466 |
+
"`torch_dtype` is deprecated! Use `dtype` instead!\n",
|
| 1467 |
+
"Loading checkpoint shards: 100%|██████████| 9/9 [02:42<00:00, 18.06s/it]\n"
|
| 1468 |
+
]
|
| 1469 |
+
},
|
| 1470 |
+
{
|
| 1471 |
+
"name": "stdout",
|
| 1472 |
+
"output_type": "stream",
|
| 1473 |
+
"text": [
|
| 1474 |
+
"✅ Model loaded\n",
|
| 1475 |
+
"VRAM used: 21.40 GB\n",
|
| 1476 |
+
"\n",
|
| 1477 |
+
"Loading raw input CSV...\n",
|
| 1478 |
+
"Loaded 50861 rows from input file\n",
|
| 1479 |
+
"Found existing annotations, merging...\n"
|
| 1480 |
+
]
|
| 1481 |
+
},
|
| 1482 |
+
{
|
| 1483 |
+
"name": "stderr",
|
| 1484 |
+
"output_type": "stream",
|
| 1485 |
+
"text": [
|
| 1486 |
+
"/tmp/ipykernel_3104208/1997558719.py:113: DtypeWarning: Columns (52,53,54,55,56) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 1487 |
+
" existing_df = pd.read_csv(output_file)\n"
|
| 1488 |
+
]
|
| 1489 |
+
},
|
| 1490 |
+
{
|
| 1491 |
+
"name": "stdout",
|
| 1492 |
+
"output_type": "stream",
|
| 1493 |
+
"text": [
|
| 1494 |
+
"Existing annotations has 50861 rows\n",
|
| 1495 |
+
"Merged annotations, continuing with 50861 total rows\n",
|
| 1496 |
+
"✅ Loaded professions.csv\n",
|
| 1497 |
+
"✅ Loaded profession mapping with 9 categories\n",
|
| 1498 |
+
"Loaded 50861 rows\n",
|
| 1499 |
+
"\n",
|
| 1500 |
+
"Profession categories (9):\n",
|
| 1501 |
+
" - actor\n",
|
| 1502 |
+
" - adult performer\n",
|
| 1503 |
+
" - singer/musician\n",
|
| 1504 |
+
" - model\n",
|
| 1505 |
+
" - online personality\n",
|
| 1506 |
+
" - public figure\n",
|
| 1507 |
+
" - voice actor/ASMR\n",
|
| 1508 |
+
" - sports professional\n",
|
| 1509 |
+
" - tv personality\n",
|
| 1510 |
+
"\n",
|
| 1511 |
+
"Creating prompts...\n",
|
| 1512 |
+
"✅ Prompts created\n",
|
| 1513 |
+
"Resuming from index 8810\n"
|
| 1514 |
+
]
|
| 1515 |
+
},
|
| 1516 |
+
{
|
| 1517 |
+
"name": "stderr",
|
| 1518 |
+
"output_type": "stream",
|
| 1519 |
+
"text": [
|
| 1520 |
+
"Mistral Local: 0%| | 0/42051 [00:00<?, ?it/s]/shares/weddigen.ki.uzh/laura_wagner/phase_01/pm-paper/.venv/lib/python3.11/site-packages/bitsandbytes/autograd/_functions.py:181: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization\n",
|
| 1521 |
+
" warnings.warn(f\"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization\")\n",
|
| 1522 |
+
"Mistral Local: 0%| | 7/42051 [00:57<93:01:03, 7.96s/it] "
|
| 1523 |
+
]
|
| 1524 |
+
}
|
| 1525 |
+
],
|
| 1526 |
+
"source": [
|
| 1527 |
+
"import pandas as pd\n",
|
| 1528 |
+
"import json\n",
|
| 1529 |
+
"import time\n",
|
| 1530 |
+
"import re\n",
|
| 1531 |
+
"from pathlib import Path\n",
|
| 1532 |
+
"from tqdm import tqdm\n",
|
| 1533 |
+
"import torch\n",
|
| 1534 |
+
"from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
|
| 1535 |
+
"\n",
|
| 1536 |
+
"current_dir = Path.cwd()\n",
|
| 1537 |
+
"input_file = current_dir.parent / \"data/CSV/model_adapter/real_person_adapter_step_02_NER.csv\"\n",
|
| 1538 |
+
"professions_file = current_dir.parent / \"misc/lists/professions.csv\"\n",
|
| 1539 |
+
"professions_mapped_file = current_dir.parent / \"misc/lists/professions_mapped.csv\"\n",
|
| 1540 |
+
"# === PROCESS DATA ===\n",
|
| 1541 |
+
"\n",
|
| 1542 |
+
"\n",
|
| 1543 |
+
"# === CONFIGURATION ===\n",
|
| 1544 |
+
"TEST_MODE = False\n",
|
| 1545 |
+
"TEST_SIZE = 100\n",
|
| 1546 |
+
"MAX_ROWS = 50862\n",
|
| 1547 |
+
"SAVE_INTERVAL = 10\n",
|
| 1548 |
+
"\n",
|
| 1549 |
+
"output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 1550 |
+
"index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
|
| 1551 |
+
"\n",
|
| 1552 |
+
"\n",
|
| 1553 |
+
"# Model settings\n",
|
| 1554 |
+
"#MODEL_NAME = \"mistralai/Mistral-Small-3.1-24B-Instruct-2503\"\n",
|
| 1555 |
+
"MODEL_NAME = \"mistralai/Mistral-Small-Instruct-2409\"\n",
|
| 1556 |
+
"#MODEL_NAME = \"mistralai/Mistral-7B-Instruct-v0.3\"\n",
|
| 1557 |
+
"CACHE_DIR = current_dir.parent / \"data/models\"\n",
|
| 1558 |
+
"CACHE_DIR.mkdir(parents=True, exist_ok=True)\n",
|
| 1559 |
+
"\n",
|
| 1560 |
+
"# Define the SPECIFIC profession categories\n",
|
| 1561 |
+
"PROFESSION_CATEGORIES = [\n",
|
| 1562 |
+
" \"actor\",\n",
|
| 1563 |
+
" \"adult performer\",\n",
|
| 1564 |
+
" \"singer/musician\",\n",
|
| 1565 |
+
" \"model\",\n",
|
| 1566 |
+
" \"online personality\",\n",
|
| 1567 |
+
" \"public figure\",\n",
|
| 1568 |
+
" \"voice actor/ASMR\",\n",
|
| 1569 |
+
" \"sports professional\",\n",
|
| 1570 |
+
" \"tv personality\"\n",
|
| 1571 |
+
"]\n",
|
| 1572 |
+
"\n",
|
| 1573 |
+
"# === LOAD MODEL ===\n",
|
| 1574 |
+
"print(f\"Loading model: {MODEL_NAME}\")\n",
|
| 1575 |
+
"print(f\"Cache directory: {CACHE_DIR}\")\n",
|
| 1576 |
+
"print(f\"This may take a while on first run (~65GB download)...\\n\")\n",
|
| 1577 |
+
"\n",
|
| 1578 |
+
"# Check GPU availability\n",
|
| 1579 |
+
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
|
| 1580 |
+
"print(f\"Device: {device}\")\n",
|
| 1581 |
+
"\n",
|
| 1582 |
+
"if device == \"cpu\":\n",
|
| 1583 |
+
" print(\"⚠️ WARNING: No GPU detected! Inference will be VERY slow.\")\n",
|
| 1584 |
+
" print(\" Consider using a GPU or reducing model size.\")\n",
|
| 1585 |
+
"\n",
|
| 1586 |
+
"# Load tokenizer\n",
|
| 1587 |
+
"print(\"Loading tokenizer...\")\n",
|
| 1588 |
+
"try:\n",
|
| 1589 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1590 |
+
" MODEL_NAME,\n",
|
| 1591 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1592 |
+
" use_fast=True\n",
|
| 1593 |
+
" )\n",
|
| 1594 |
+
"except Exception as e:\n",
|
| 1595 |
+
" print(f\"Failed with use_fast=True, trying use_fast=False...\")\n",
|
| 1596 |
+
" tokenizer = AutoTokenizer.from_pretrained(\n",
|
| 1597 |
+
" MODEL_NAME,\n",
|
| 1598 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1599 |
+
" use_fast=False\n",
|
| 1600 |
+
" )\n",
|
| 1601 |
+
"\n",
|
| 1602 |
+
"# Ensure pad token is set\n",
|
| 1603 |
+
"if tokenizer.pad_token is None:\n",
|
| 1604 |
+
" tokenizer.pad_token = tokenizer.eos_token\n",
|
| 1605 |
+
"\n",
|
| 1606 |
+
"print(\"✅ Tokenizer loaded\")\n",
|
| 1607 |
+
"\n",
|
| 1608 |
+
"quantization_config = BitsAndBytesConfig(\n",
|
| 1609 |
+
" load_in_8bit=True\n",
|
| 1610 |
+
")\n",
|
| 1611 |
+
"\n",
|
| 1612 |
+
"\n",
|
| 1613 |
+
"# Load model with optimizations\n",
|
| 1614 |
+
"print(\"Loading model (this may take several minutes)...\")\n",
|
| 1615 |
+
"model = AutoModelForCausalLM.from_pretrained(\n",
|
| 1616 |
+
" MODEL_NAME,\n",
|
| 1617 |
+
" cache_dir=str(CACHE_DIR),\n",
|
| 1618 |
+
" torch_dtype=torch.bfloat16,\n",
|
| 1619 |
+
" quantization_config=quantization_config,\n",
|
| 1620 |
+
" device_map=\"auto\",\n",
|
| 1621 |
+
" trust_remote_code=False\n",
|
| 1622 |
+
")\n",
|
| 1623 |
+
"model.eval()\n",
|
| 1624 |
+
"print(\"✅ Model loaded\")\n",
|
| 1625 |
+
"\n",
|
| 1626 |
+
"# Check VRAM usage\n",
|
| 1627 |
+
"if torch.cuda.is_available():\n",
|
| 1628 |
+
" vram_gb = torch.cuda.max_memory_allocated() / 1024**3\n",
|
| 1629 |
+
" print(f\"VRAM used: {vram_gb:.2f} GB\\n\")\n",
|
| 1630 |
+
"\n",
|
| 1631 |
+
"# === LOAD DATA ===\n",
|
| 1632 |
+
"print(\"Loading raw input CSV...\")\n",
|
| 1633 |
+
"df = pd.read_csv(input_file) # ALWAYS load the full input\n",
|
| 1634 |
+
"print(f\"Loaded {len(df)} rows from input file\")\n",
|
| 1635 |
+
"\n",
|
| 1636 |
+
"# If we have previous annotations, merge them\n",
|
| 1637 |
+
"if output_file.exists():\n",
|
| 1638 |
+
" print(\"Found existing annotations, merging...\")\n",
|
| 1639 |
+
" existing_df = pd.read_csv(output_file)\n",
|
| 1640 |
+
" print(f\"Existing annotations has {len(existing_df)} rows\")\n",
|
| 1641 |
+
" \n",
|
| 1642 |
+
" # Update df with existing annotations\n",
|
| 1643 |
+
" # Only update the columns that were annotated\n",
|
| 1644 |
+
" annotation_cols = ['full_name', 'aliases', 'gender', 'profession_llm', 'country']\n",
|
| 1645 |
+
" for col in annotation_cols:\n",
|
| 1646 |
+
" if col in existing_df.columns:\n",
|
| 1647 |
+
" df[col] = existing_df[col][:len(df)] # Make sure we don't exceed df length\n",
|
| 1648 |
+
" \n",
|
| 1649 |
+
" print(f\"Merged annotations, continuing with {len(df)} total rows\")\n",
|
| 1650 |
+
"\n",
|
| 1651 |
+
"\n",
|
| 1652 |
+
"# Try to load profession mapping files\n",
|
| 1653 |
+
"try:\n",
|
| 1654 |
+
" professions_df = pd.read_csv(professions_file)\n",
|
| 1655 |
+
" print(f\"✅ Loaded professions.csv\")\n",
|
| 1656 |
+
"except:\n",
|
| 1657 |
+
" print(\"⚠️ Warning: professions.csv not found\")\n",
|
| 1658 |
+
"\n",
|
| 1659 |
+
"try:\n",
|
| 1660 |
+
" prof_mapped_df = pd.read_csv(professions_mapped_file)\n",
|
| 1661 |
+
" print(f\"✅ Loaded profession mapping with {len(prof_mapped_df)} categories\")\n",
|
| 1662 |
+
"except:\n",
|
| 1663 |
+
" print(\"⚠️ Warning: professions_mapped.csv not found, using default categories\")\n",
|
| 1664 |
+
"\n",
|
| 1665 |
+
"profession_str = \", \".join(PROFESSION_CATEGORIES)\n",
|
| 1666 |
+
"\n",
|
| 1667 |
+
"print(f\"Loaded {len(df)} rows\")\n",
|
| 1668 |
+
"print(f\"\\nProfession categories ({len(PROFESSION_CATEGORIES)}):\")\n",
|
| 1669 |
+
"for cat in PROFESSION_CATEGORIES:\n",
|
| 1670 |
+
" print(f\" - {cat}\")\n",
|
| 1671 |
+
"\n",
|
| 1672 |
+
"if TEST_MODE:\n",
|
| 1673 |
+
" print(f\"\\nRunning in TEST MODE with {TEST_SIZE} samples\")\n",
|
| 1674 |
+
" df = df.head(TEST_SIZE).copy()\n",
|
| 1675 |
+
"elif MAX_ROWS:\n",
|
| 1676 |
+
" df = df.head(MAX_ROWS).copy()\n",
|
| 1677 |
+
"\n",
|
| 1678 |
+
"# === CREATE PROMPTS (DEEPSEEK STYLE) ===\n",
|
| 1679 |
+
"def create_prompt(row):\n",
|
| 1680 |
+
" \"\"\"Create prompt for Mistral annotation with specific profession categories.\"\"\"\n",
|
| 1681 |
+
" name = row['real_name'] if pd.notna(row.get('real_name')) else row.get('name', '')\n",
|
| 1682 |
+
" \n",
|
| 1683 |
+
" # Gather hints\n",
|
| 1684 |
+
" hints = []\n",
|
| 1685 |
+
" if pd.notna(row.get('likely_profession')):\n",
|
| 1686 |
+
" hints.append(str(row['likely_profession']))\n",
|
| 1687 |
+
" if pd.notna(row.get('likely_nationality')):\n",
|
| 1688 |
+
" hints.append(str(row['likely_nationality']))\n",
|
| 1689 |
+
" if pd.notna(row.get('likely_country')):\n",
|
| 1690 |
+
" hints.append(str(row['likely_country']))\n",
|
| 1691 |
+
" \n",
|
| 1692 |
+
" # Add tags if we don't have enough hints\n",
|
| 1693 |
+
" if len(hints) < 3:\n",
|
| 1694 |
+
" for i in range(1, 8):\n",
|
| 1695 |
+
" tag_col = f'tag_{i}'\n",
|
| 1696 |
+
" if tag_col in row and pd.notna(row[tag_col]):\n",
|
| 1697 |
+
" tag_val = str(row[tag_col])\n",
|
| 1698 |
+
" if tag_val not in hints:\n",
|
| 1699 |
+
" hints.append(tag_val)\n",
|
| 1700 |
+
" if len(hints) >= 5:\n",
|
| 1701 |
+
" break\n",
|
| 1702 |
+
" \n",
|
| 1703 |
+
" hint_text = \", \".join(hints[:5]) if hints else \"none\"\n",
|
| 1704 |
+
" \n",
|
| 1705 |
+
" return f\"\"\"Given '{name}' ({hint_text}), provide:\n",
|
| 1706 |
+
"1. Full legal name (Western order if non-latin script)\n",
|
| 1707 |
+
"2. Any stage names/aliases (comma separated)\n",
|
| 1708 |
+
"3. Gender (Male/Female/Other/Unknown)\n",
|
| 1709 |
+
"4. Top 3 most likely professions from ONLY these categories:\n",
|
| 1710 |
+
" - actor\n",
|
| 1711 |
+
" - adult performer\n",
|
| 1712 |
+
" - singer/musician\n",
|
| 1713 |
+
" - model\n",
|
| 1714 |
+
" - online personality (includes streamers, cosplayers, influencers)\n",
|
| 1715 |
+
" - public figure (includes politicians, activists, journalists, authors)\n",
|
| 1716 |
+
" - voice actor/ASMR\n",
|
| 1717 |
+
" - sports professional\n",
|
| 1718 |
+
" - tv personality (includes hosts, presenters, reality TV)\n",
|
| 1719 |
+
"\n",
|
| 1720 |
+
"5. Primary country associated\n",
|
| 1721 |
+
"\n",
|
| 1722 |
+
"IMPORTANT:\n",
|
| 1723 |
+
"- Choose professions ONLY from the 9 categories above\n",
|
| 1724 |
+
"- Provide up to 3 professions, comma-separated, ordered by relevance\n",
|
| 1725 |
+
"- Be SPECIFIC: choose the most accurate category for each role\n",
|
| 1726 |
+
"- \"online personality\" includes: streamers, cosplayers, YouTubers, influencers, content creators\n",
|
| 1727 |
+
"- Use 'Unknown' when uncertain or for fictional characters/places\n",
|
| 1728 |
+
"- For multi-role people, list all relevant categories (e.g., \"actor, singer/musician, online personality\")\n",
|
| 1729 |
+
"- For country respond with one word only, for example China or Columbia\n",
|
| 1730 |
+
"- actress = actor\n",
|
| 1731 |
+
"\n",
|
| 1732 |
+
"Respond with exactly 5 numbered lines.\"\"\"\n",
|
| 1733 |
+
"\n",
|
| 1734 |
+
"# Create prompts\n",
|
| 1735 |
+
"print(\"\\nCreating prompts...\")\n",
|
| 1736 |
+
"df['prompt'] = df.apply(create_prompt, axis=1)\n",
|
| 1737 |
+
"print(\"✅ Prompts created\")\n",
|
| 1738 |
+
"\n",
|
| 1739 |
+
"# === QUERY MISTRAL LOCAL ===\n",
|
| 1740 |
+
"def query_mistral_local(prompt: str) -> str:\n",
|
| 1741 |
+
" \"\"\"Query Mistral locally via transformers.\"\"\"\n",
|
| 1742 |
+
" try:\n",
|
| 1743 |
+
" # Format as chat message for Mistral\n",
|
| 1744 |
+
" messages = [\n",
|
| 1745 |
+
" {\"role\": \"system\", \"content\": \"You are an assistant that extracts key data on a person based on the name. Respond with exactly 5 numbered lines. For professions, choose ONLY from these categories: actor, adult performer, singer/musician, model, online personality, public figure, voice actor/ASMR, sports professional, tv personality.\"},\n",
|
| 1746 |
+
" {\"role\": \"user\", \"content\": prompt}\n",
|
| 1747 |
+
" ]\n",
|
| 1748 |
+
" \n",
|
| 1749 |
+
" # Tokenize\n",
|
| 1750 |
+
" if hasattr(tokenizer, 'apply_chat_template'):\n",
|
| 1751 |
+
" text = tokenizer.apply_chat_template(\n",
|
| 1752 |
+
" messages,\n",
|
| 1753 |
+
" tokenize=False,\n",
|
| 1754 |
+
" add_generation_prompt=True\n",
|
| 1755 |
+
" )\n",
|
| 1756 |
+
" else:\n",
|
| 1757 |
+
" # Fallback for older tokenizers\n",
|
| 1758 |
+
" text = f\"[INST] {prompt} [/INST]\"\n",
|
| 1759 |
+
" \n",
|
| 1760 |
+
" inputs = tokenizer([text], return_tensors=\"pt\", padding=True).to(device)\n",
|
| 1761 |
+
" \n",
|
| 1762 |
+
" # Generate\n",
|
| 1763 |
+
" with torch.no_grad():\n",
|
| 1764 |
+
" outputs = model.generate(\n",
|
| 1765 |
+
" **inputs,\n",
|
| 1766 |
+
" max_new_tokens=512,\n",
|
| 1767 |
+
" temperature=0.05,\n",
|
| 1768 |
+
" do_sample=True,\n",
|
| 1769 |
+
" top_p=0.8,\n",
|
| 1770 |
+
" pad_token_id=tokenizer.pad_token_id if tokenizer.pad_token_id else tokenizer.eos_token_id\n",
|
| 1771 |
+
" )\n",
|
| 1772 |
+
" \n",
|
| 1773 |
+
" # Decode\n",
|
| 1774 |
+
" generated_ids = outputs[0][inputs['input_ids'].shape[1]:]\n",
|
| 1775 |
+
" response = tokenizer.decode(generated_ids, skip_special_tokens=True)\n",
|
| 1776 |
+
" \n",
|
| 1777 |
+
" return response.strip()\n",
|
| 1778 |
+
" \n",
|
| 1779 |
+
" except Exception as e:\n",
|
| 1780 |
+
" print(f\"Generation error: {e}\")\n",
|
| 1781 |
+
" return None\n",
|
| 1782 |
+
"\n",
|
| 1783 |
+
"# === PARSE RESPONSE (DEEPSEEK STYLE) ===\n",
|
| 1784 |
+
"def parse_response(response):\n",
|
| 1785 |
+
" \"\"\"Parse Mistral response into structured fields.\"\"\"\n",
|
| 1786 |
+
" if not response:\n",
|
| 1787 |
+
" return {\n",
|
| 1788 |
+
" 'full_name': 'Unknown',\n",
|
| 1789 |
+
" 'aliases': 'Unknown',\n",
|
| 1790 |
+
" 'gender': 'Unknown',\n",
|
| 1791 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1792 |
+
" 'country': 'Unknown'\n",
|
| 1793 |
+
" }\n",
|
| 1794 |
+
" \n",
|
| 1795 |
+
" # Split into lines and clean\n",
|
| 1796 |
+
" lines = [line.strip() for line in response.split('\\n') if line.strip()]\n",
|
| 1797 |
+
" \n",
|
| 1798 |
+
" # Initialize with Unknown values\n",
|
| 1799 |
+
" fields = {\n",
|
| 1800 |
+
" 'full_name': 'Unknown',\n",
|
| 1801 |
+
" 'aliases': 'Unknown',\n",
|
| 1802 |
+
" 'gender': 'Unknown',\n",
|
| 1803 |
+
" 'profession_llm': 'Unknown',\n",
|
| 1804 |
+
" 'country': 'Unknown'\n",
|
| 1805 |
+
" }\n",
|
| 1806 |
+
" \n",
|
| 1807 |
+
" # Extract information from each numbered line\n",
|
| 1808 |
+
" for line in lines:\n",
|
| 1809 |
+
" if line.startswith('1.'):\n",
|
| 1810 |
+
" fields['full_name'] = line[2:].strip()\n",
|
| 1811 |
+
" elif line.startswith('2.'):\n",
|
| 1812 |
+
" fields['aliases'] = line[2:].strip()\n",
|
| 1813 |
+
" elif line.startswith('3.'):\n",
|
| 1814 |
+
" fields['gender'] = line[2:].strip()\n",
|
| 1815 |
+
" elif line.startswith('4.'):\n",
|
| 1816 |
+
" fields['profession_llm'] = line[2:].strip()\n",
|
| 1817 |
+
" elif line.startswith('5.'):\n",
|
| 1818 |
+
" fields['country'] = line[2:].strip()\n",
|
| 1819 |
+
" \n",
|
| 1820 |
+
" return fields\n",
|
| 1821 |
+
"\n",
|
| 1822 |
+
"# === PROCESS DATA ===\n",
|
| 1823 |
+
"output_file = current_dir.parent / f\"data/CSV/mistral24_local_annotated_POI{'_test' if TEST_MODE else ''}.csv\"\n",
|
| 1824 |
+
"index_file = current_dir.parent / \"misc/query_indicies/mistral24_local_query_index.txt\"\n",
|
| 1825 |
+
"\n",
|
| 1826 |
+
"index_file.parent.mkdir(parents=True, exist_ok=True)\n",
|
| 1827 |
+
"\n",
|
| 1828 |
+
"# Load index\n",
|
| 1829 |
+
"current_index = 0\n",
|
| 1830 |
+
"if index_file.exists():\n",
|
| 1831 |
+
" try:\n",
|
| 1832 |
+
" current_index = int(index_file.read_text().strip())\n",
|
| 1833 |
+
" except:\n",
|
| 1834 |
+
" current_index = 0\n",
|
| 1835 |
+
"\n",
|
| 1836 |
+
"print(f\"Resuming from index {current_index}\")\n",
|
| 1837 |
+
"\n",
|
| 1838 |
+
"start_time = time.time()\n",
|
| 1839 |
+
"\n",
|
| 1840 |
+
"for i in tqdm(range(current_index, len(df)), desc=\"Mistral Local\"):\n",
|
| 1841 |
+
"\n",
|
| 1842 |
+
" prompt = df.at[i, \"prompt\"]\n",
|
| 1843 |
+
"\n",
|
| 1844 |
+
" # -------- MODEL QUERY WITH RETRIES --------\n",
|
| 1845 |
+
" response = None\n",
|
| 1846 |
+
" for attempt in range(3):\n",
|
| 1847 |
+
" response = query_mistral_local(prompt)\n",
|
| 1848 |
+
" \n",
|
| 1849 |
+
" # Valid response?\n",
|
| 1850 |
+
" if response and len(response.strip()) > 10:\n",
|
| 1851 |
+
" break\n",
|
| 1852 |
+
" \n",
|
| 1853 |
+
" print(f\"⚠️ Row {i}: Empty or invalid response, retry {attempt+1}/3\")\n",
|
| 1854 |
+
" time.sleep(0.5)\n",
|
| 1855 |
+
"\n",
|
| 1856 |
+
" # If still invalid → DO NOT overwrite previous data\n",
|
| 1857 |
+
" if not response or len(response.strip()) <= 10:\n",
|
| 1858 |
+
" print(f\"❌ Row {i}: failed after retries, not writing, not advancing index\")\n",
|
| 1859 |
+
" continue\n",
|
| 1860 |
+
"\n",
|
| 1861 |
+
" parsed = parse_response(response)\n",
|
| 1862 |
+
"\n",
|
| 1863 |
+
" # Additional safety: skip rows that parsed as all 'Unknown'\n",
|
| 1864 |
+
" if all(v == \"Unknown\" for v in parsed.values()):\n",
|
| 1865 |
+
" print(f\"❌ Row {i}: parsed as all Unknown (likely model crash); skipping.\")\n",
|
| 1866 |
+
" continue\n",
|
| 1867 |
+
"\n",
|
| 1868 |
+
" # -------- WRITE PARSED FIELDS SAFELY --------\n",
|
| 1869 |
+
" for key, value in parsed.items():\n",
|
| 1870 |
+
" df.at[i, key] = value\n",
|
| 1871 |
+
"\n",
|
| 1872 |
+
" # Advance progress ONLY after successful write\n",
|
| 1873 |
+
" current_index = i + 1\n",
|
| 1874 |
+
"\n",
|
| 1875 |
+
" # -------- GPU MEMORY CLEANUP --------\n",
|
| 1876 |
+
" if torch.cuda.is_available():\n",
|
| 1877 |
+
" torch.cuda.empty_cache()\n",
|
| 1878 |
+
" torch.cuda.synchronize()\n",
|
| 1879 |
+
"\n",
|
| 1880 |
+
" # -------- SAVE LIKE YOUR DEEPSEEK VERSION --------\n",
|
| 1881 |
+
" if (i + 1) % SAVE_INTERVAL == 0 or (i + 1) == len(df):\n",
|
| 1882 |
+
" df.to_csv(output_file, index=False)\n",
|
| 1883 |
+
" with open(index_file, \"w\") as f:\n",
|
| 1884 |
+
" f.write(str(current_index))\n",
|
| 1885 |
+
" print(f\"💾 Progress saved after row {i+1}\")\n",
|
| 1886 |
+
"\n",
|
| 1887 |
+
"# Final save\n",
|
| 1888 |
+
"df.to_csv(output_file, index=False)\n",
|
| 1889 |
+
"index_file.write_text(str(current_index))\n",
|
| 1890 |
+
"print(\"✅ Finished full dataset.\")\n"
|
| 1891 |
+
]
|
| 1892 |
+
},
|
| 1893 |
+
{
|
| 1894 |
+
"cell_type": "code",
|
| 1895 |
+
"execution_count": null,
|
| 1896 |
+
"id": "d7212e75-0ff6-45a0-8695-c4a3d3e02818",
|
| 1897 |
+
"metadata": {},
|
| 1898 |
+
"outputs": [],
|
| 1899 |
+
"source": [
|
| 1900 |
+
"import transformers\n",
|
| 1901 |
+
"print(f\"Transformers version: {transformers.__version__}\")\n",
|
| 1902 |
+
"\n",
|
| 1903 |
+
"# Check if Mistral3 is available\n",
|
| 1904 |
+
"try:\n",
|
| 1905 |
+
" from transformers import Mistral3ForCausalLM\n",
|
| 1906 |
+
" print(\"✅ Mistral3 is available\")\n",
|
| 1907 |
+
"except ImportError:\n",
|
| 1908 |
+
" print(\"❌ Mistral3 not available in this transformers version\")"
|
| 1909 |
+
]
|
| 1910 |
+
},
|
| 1911 |
+
{
|
| 1912 |
+
"cell_type": "code",
|
| 1913 |
+
"execution_count": null,
|
| 1914 |
+
"id": "a6ab032e-246e-4c4e-9776-ff0bfbf6fd9c",
|
| 1915 |
+
"metadata": {},
|
| 1916 |
+
"outputs": [],
|
| 1917 |
+
"source": []
|
| 1918 |
+
}
|
| 1919 |
+
],
|
| 1920 |
+
"metadata": {
|
| 1921 |
+
"kernelspec": {
|
| 1922 |
+
"display_name": "pm-paper",
|
| 1923 |
+
"language": "python",
|
| 1924 |
+
"name": "pm-paper"
|
| 1925 |
+
},
|
| 1926 |
+
"language_info": {
|
| 1927 |
+
"codemirror_mode": {
|
| 1928 |
+
"name": "ipython",
|
| 1929 |
+
"version": 3
|
| 1930 |
+
},
|
| 1931 |
+
"file_extension": ".py",
|
| 1932 |
+
"mimetype": "text/x-python",
|
| 1933 |
+
"name": "python",
|
| 1934 |
+
"nbconvert_exporter": "python",
|
| 1935 |
+
"pygments_lexer": "ipython3",
|
| 1936 |
+
"version": "3.11.13"
|
| 1937 |
+
}
|
| 1938 |
+
},
|
| 1939 |
+
"nbformat": 4,
|
| 1940 |
+
"nbformat_minor": 5
|
| 1941 |
+
}
|
jupyter_notebooks/Section_2-3-4_Figure_8_Step_2_response_comparison_and_consensus_extraction.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
jupyter_notebooks/Section_2-3-4__Figure_8a_sunburst_gender.ipynb
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Prepare *.json for Figure 8a"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 1,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stdout",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"✓ Saved 8a.json\n"
|
| 20 |
+
]
|
| 21 |
+
}
|
| 22 |
+
],
|
| 23 |
+
"source": [
|
| 24 |
+
"import pandas as pd\n",
|
| 25 |
+
"from collections import defaultdict\n",
|
| 26 |
+
"import json\n",
|
| 27 |
+
"from pathlib import Path \n",
|
| 28 |
+
"\n",
|
| 29 |
+
"current_dir = Path.cwd()\n",
|
| 30 |
+
"sunburst_json = current_dir.parent / \"public/json/8a.json\"\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"# Load consensus CSV\n",
|
| 33 |
+
"consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus.csv\"\n",
|
| 34 |
+
"df = pd.read_csv(consensus_file)\n",
|
| 35 |
+
"\n",
|
| 36 |
+
"# ---- Normalize Gender (group Non-binary and Unknown into 'Other') ----\n",
|
| 37 |
+
"def normalize_gender(g):\n",
|
| 38 |
+
" g = str(g).strip().lower()\n",
|
| 39 |
+
" if g in [\"female\", \"woman\", \"female (group)\", \"female (transgender)\", \"female (virtual persona)\", \"female (group members)\"]:\n",
|
| 40 |
+
" return \"Female\"\n",
|
| 41 |
+
" elif g in [\"male\", \"male (android)\", \"male (character)\"]:\n",
|
| 42 |
+
" return \"Male\"\n",
|
| 43 |
+
" else:\n",
|
| 44 |
+
" return \"Other\"\n",
|
| 45 |
+
"\n",
|
| 46 |
+
"df['gender_normalized'] = df['consensus_gender'].apply(normalize_gender)\n",
|
| 47 |
+
"\n",
|
| 48 |
+
"# ---- Step 1: Limit to top 10 countries ----\n",
|
| 49 |
+
"df['country_cleaned'] = df['consensus_country'].apply(lambda x: x if x not in ['Unknown', '', None] else 'Other')\n",
|
| 50 |
+
"top_countries = df['country_cleaned'].value_counts().nlargest(12).index.tolist()\n",
|
| 51 |
+
"df['country_limited'] = df['country_cleaned'].apply(lambda x: x if x in top_countries else 'Other')\n",
|
| 52 |
+
"\n",
|
| 53 |
+
"# ---- Step 2: Normalize and limit professions ----\n",
|
| 54 |
+
"valid_categories = [\n",
|
| 55 |
+
" \"actor\", \"adult performer\", \"singer/musician\", \"model\",\n",
|
| 56 |
+
" \"online personality\", \"tv personality\", \"voice actor/asmr\", \"public figure\", \"sports professional\"\n",
|
| 57 |
+
"]\n",
|
| 58 |
+
"\n",
|
| 59 |
+
"def remap_profession(profession):\n",
|
| 60 |
+
" profession_lower = str(profession).strip().lower()\n",
|
| 61 |
+
" if profession_lower == 'unknown' or profession_lower not in valid_categories:\n",
|
| 62 |
+
" return 'Other'\n",
|
| 63 |
+
" elif profession_lower == 'fictional character':\n",
|
| 64 |
+
" return 'actor'\n",
|
| 65 |
+
" elif profession_lower in ['voice actor', 'voice actor/asmr']:\n",
|
| 66 |
+
" return 'voice actor/ASMR'\n",
|
| 67 |
+
" return profession_lower\n",
|
| 68 |
+
"\n",
|
| 69 |
+
"df['profession_limited'] = df['consensus_primary_profession'].apply(remap_profession)\n",
|
| 70 |
+
"\n",
|
| 71 |
+
"# ---- Step 3: Group by gender and profession ----\n",
|
| 72 |
+
"sunburst_data = df.groupby(['gender_normalized', 'profession_limited']).size().reset_index(name='count')\n",
|
| 73 |
+
"\n",
|
| 74 |
+
"# ---- Step 4: Create nested structure for D3.js ----\n",
|
| 75 |
+
"sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
|
| 76 |
+
"gender_map = defaultdict(list)\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"for _, row in sunburst_data.iterrows():\n",
|
| 79 |
+
" gender = row['gender_normalized']\n",
|
| 80 |
+
" profession = row['profession_limited']\n",
|
| 81 |
+
" count = int(row['count'])\n",
|
| 82 |
+
" gender_map[gender].append({\"name\": profession, \"value\": count})\n",
|
| 83 |
+
"\n",
|
| 84 |
+
"# Sort 'Other' professions to appear last\n",
|
| 85 |
+
"for gender, professions in gender_map.items():\n",
|
| 86 |
+
" professions_sorted = sorted(professions, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
|
| 87 |
+
" gender_map[gender] = professions_sorted\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"# Build the final nested structure\n",
|
| 90 |
+
"for gender, professions in gender_map.items():\n",
|
| 91 |
+
" sunburst_dict[\"children\"].append({\"name\": gender, \"children\": professions})\n",
|
| 92 |
+
"\n",
|
| 93 |
+
"# ---- Step 5: Save to a JSON file ----\n",
|
| 94 |
+
"with open(sunburst_json, \"w\", encoding='utf-8') as f:\n",
|
| 95 |
+
" json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
|
| 96 |
+
"\n",
|
| 97 |
+
"print(\"✓ Saved 8a.json\")\n"
|
| 98 |
+
]
|
| 99 |
+
},
|
| 100 |
+
{
|
| 101 |
+
"cell_type": "code",
|
| 102 |
+
"execution_count": null,
|
| 103 |
+
"metadata": {},
|
| 104 |
+
"outputs": [],
|
| 105 |
+
"source": []
|
| 106 |
+
}
|
| 107 |
+
],
|
| 108 |
+
"metadata": {
|
| 109 |
+
"kernelspec": {
|
| 110 |
+
"display_name": "latm",
|
| 111 |
+
"language": "python",
|
| 112 |
+
"name": "python3"
|
| 113 |
+
},
|
| 114 |
+
"language_info": {
|
| 115 |
+
"codemirror_mode": {
|
| 116 |
+
"name": "ipython",
|
| 117 |
+
"version": 3
|
| 118 |
+
},
|
| 119 |
+
"file_extension": ".py",
|
| 120 |
+
"mimetype": "text/x-python",
|
| 121 |
+
"name": "python",
|
| 122 |
+
"nbconvert_exporter": "python",
|
| 123 |
+
"pygments_lexer": "ipython3",
|
| 124 |
+
"version": "3.10.15"
|
| 125 |
+
}
|
| 126 |
+
},
|
| 127 |
+
"nbformat": 4,
|
| 128 |
+
"nbformat_minor": 2
|
| 129 |
+
}
|
jupyter_notebooks/Section_2-3-4__Figure_8b_sunburst_profession.ipynb
ADDED
|
@@ -0,0 +1,332 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Prepare *.json for Figure 8b"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 9,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stdout",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"✓ Saved sunburst_countries_A.json\n"
|
| 20 |
+
]
|
| 21 |
+
}
|
| 22 |
+
],
|
| 23 |
+
"source": [
|
| 24 |
+
"import pandas as pd\n",
|
| 25 |
+
"from collections import defaultdict\n",
|
| 26 |
+
"import json\n",
|
| 27 |
+
"from pathlib import Path\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"current_dir = Path.cwd()\n",
|
| 30 |
+
"\n",
|
| 31 |
+
"sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
|
| 32 |
+
"\n",
|
| 33 |
+
"# Load consensus CSV\n",
|
| 34 |
+
"consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus_va.csv\"\n",
|
| 35 |
+
"df = pd.read_csv(consensus_file)\n",
|
| 36 |
+
"\n",
|
| 37 |
+
"# ============================================================\n",
|
| 38 |
+
"# 1. NORMALIZE COUNTRIES\n",
|
| 39 |
+
"# ============================================================\n",
|
| 40 |
+
"\n",
|
| 41 |
+
"def normalize_country(x: str):\n",
|
| 42 |
+
" if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
|
| 43 |
+
" return \"Other\"\n",
|
| 44 |
+
" x = x.strip()\n",
|
| 45 |
+
"\n",
|
| 46 |
+
" # Ensure matching with your HTML manualOrder\n",
|
| 47 |
+
" replacements = {\n",
|
| 48 |
+
" \"USA\": \"United States\",\n",
|
| 49 |
+
" \"US\": \"United States\",\n",
|
| 50 |
+
" \"U.S.\": \"United States\",\n",
|
| 51 |
+
" \"UK\": \"United Kingdom\",\n",
|
| 52 |
+
" \"U.K.\": \"United Kingdom\"\n",
|
| 53 |
+
" }\n",
|
| 54 |
+
" return replacements.get(x, x)\n",
|
| 55 |
+
"\n",
|
| 56 |
+
"df[\"country_clean\"] = df[\"consensus_country\"].apply(normalize_country)\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"# Limit to the top 15 countries, merge the rest into \"Other\"\n",
|
| 59 |
+
"top_countries = df[\"country_clean\"].value_counts().nlargest(25).index.tolist()\n",
|
| 60 |
+
"\n",
|
| 61 |
+
"df[\"country_limited\"] = df[\"country_clean\"].apply(\n",
|
| 62 |
+
" lambda x: x if x in top_countries else \"Other\"\n",
|
| 63 |
+
")\n",
|
| 64 |
+
"\n",
|
| 65 |
+
"# ============================================================\n",
|
| 66 |
+
"# 2. NORMALIZE PROFESSIONS\n",
|
| 67 |
+
"# ============================================================\n",
|
| 68 |
+
"\n",
|
| 69 |
+
"def normalize_profession(x: str):\n",
|
| 70 |
+
" if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
|
| 71 |
+
" return \"Other\"\n",
|
| 72 |
+
" x = x.strip().lower()\n",
|
| 73 |
+
" \n",
|
| 74 |
+
" mapping = {\n",
|
| 75 |
+
" \"actor\": \"Actor\",\n",
|
| 76 |
+
" \"model\": \"Model\",\n",
|
| 77 |
+
" \"adult performer\": \"Adult Performer\",\n",
|
| 78 |
+
" \"singer/musician\": \"Singer, Musician\",\n",
|
| 79 |
+
" \"online personality\": \"Online Personality\",\n",
|
| 80 |
+
" \"sports professional\": \"Sports Professional\",\n",
|
| 81 |
+
" \"voice actor/asmr\": \"Voice Actor\", # ← fixed key\n",
|
| 82 |
+
" \"public figure\": \"Public Figure\", # ← now its own category\n",
|
| 83 |
+
" \"tv personality\": \"Other\",\n",
|
| 84 |
+
" }\n",
|
| 85 |
+
" return mapping.get(x, \"Other\")\n",
|
| 86 |
+
"\n",
|
| 87 |
+
"df[\"profession_clean\"] = df[\"consensus_primary_profession\"].apply(normalize_profession)\n",
|
| 88 |
+
"\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"df[\"profession_limited\"] = df[\"profession_clean\"]\n",
|
| 91 |
+
"\n",
|
| 92 |
+
"\n",
|
| 93 |
+
"# ============================================================\n",
|
| 94 |
+
"# 3. GROUP INTO SUNBURST STRUCTURE\n",
|
| 95 |
+
"# ============================================================\n",
|
| 96 |
+
"\n",
|
| 97 |
+
"sunburst_data = (\n",
|
| 98 |
+
" df.groupby([\"country_limited\", \"profession_limited\"])\n",
|
| 99 |
+
" .size()\n",
|
| 100 |
+
" .reset_index(name=\"count\")\n",
|
| 101 |
+
")\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
|
| 104 |
+
"country_map = defaultdict(list)\n",
|
| 105 |
+
"\n",
|
| 106 |
+
"for _, row in sunburst_data.iterrows():\n",
|
| 107 |
+
" c = row[\"country_limited\"]\n",
|
| 108 |
+
" p = row[\"profession_limited\"]\n",
|
| 109 |
+
" v = int(row[\"count\"])\n",
|
| 110 |
+
"\n",
|
| 111 |
+
" country_map[c].append({\"name\": p, \"value\": v})\n",
|
| 112 |
+
"\n",
|
| 113 |
+
"# Sort professions inside each country so \"Other\" is last\n",
|
| 114 |
+
"for c, profs in country_map.items():\n",
|
| 115 |
+
" profs_sorted = sorted(profs, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
|
| 116 |
+
" country_map[c] = profs_sorted\n",
|
| 117 |
+
"\n",
|
| 118 |
+
"# Calculate total datapoints per country and sort\n",
|
| 119 |
+
"country_totals = []\n",
|
| 120 |
+
"for c, profs in country_map.items():\n",
|
| 121 |
+
" total = sum(p[\"value\"] for p in profs)\n",
|
| 122 |
+
" country_totals.append((c, total, profs))\n",
|
| 123 |
+
"\n",
|
| 124 |
+
"# Sort by total (descending), but put \"Other\" last\n",
|
| 125 |
+
"country_totals.sort(key=lambda x: (x[0] == \"Other\", -x[1]))\n",
|
| 126 |
+
"\n",
|
| 127 |
+
"# Build final JSON with sorted countries\n",
|
| 128 |
+
"for c, total, profs in country_totals:\n",
|
| 129 |
+
" sunburst_dict[\"children\"].append({\"name\": c, \"children\": profs})\n",
|
| 130 |
+
"\n",
|
| 131 |
+
"# ============================================================\n",
|
| 132 |
+
"# 4. SAVE JSON\n",
|
| 133 |
+
"# ============================================================\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"with open(sunburst_path, \"w\", encoding=\"utf-8\") as f:\n",
|
| 136 |
+
" json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
|
| 137 |
+
"\n",
|
| 138 |
+
"print(\"✓ Saved sunburst_countries_A.json\")"
|
| 139 |
+
]
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"cell_type": "markdown",
|
| 143 |
+
"metadata": {},
|
| 144 |
+
"source": [
|
| 145 |
+
"# version that only considers data up until dec 31st 2024"
|
| 146 |
+
]
|
| 147 |
+
},
|
| 148 |
+
{
|
| 149 |
+
"cell_type": "code",
|
| 150 |
+
"execution_count": 10,
|
| 151 |
+
"metadata": {},
|
| 152 |
+
"outputs": [
|
| 153 |
+
{
|
| 154 |
+
"name": "stdout",
|
| 155 |
+
"output_type": "stream",
|
| 156 |
+
"text": [
|
| 157 |
+
"✓ Filtered to 16059 records published on or before December 31, 2024\n",
|
| 158 |
+
"✓ Saved sunburst_countries_A.json (2024 data only)\n"
|
| 159 |
+
]
|
| 160 |
+
}
|
| 161 |
+
],
|
| 162 |
+
"source": [
|
| 163 |
+
"import pandas as pd\n",
|
| 164 |
+
"from collections import defaultdict\n",
|
| 165 |
+
"import json\n",
|
| 166 |
+
"from pathlib import Path\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"current_dir = Path.cwd()\n",
|
| 169 |
+
"sunburst_path = current_dir.parent / \"public/json/sunburst_countries_A.json\"\n",
|
| 170 |
+
"\n",
|
| 171 |
+
"# Load consensus CSV\n",
|
| 172 |
+
"consensus_file = current_dir.parent / \"data/CSV/analyzed_llm_agreement_consensus_va.csv\"\n",
|
| 173 |
+
"df = pd.read_csv(consensus_file)\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"# ============================================================\n",
|
| 176 |
+
"# FILTER DATA UP TO DECEMBER 31, 2024\n",
|
| 177 |
+
"# ============================================================\n",
|
| 178 |
+
"# Convert publishedAt to datetime\n",
|
| 179 |
+
"df[\"publishedAt\"] = pd.to_datetime(df[\"publishedAt\"], errors=\"coerce\", utc=True)\n",
|
| 180 |
+
"\n",
|
| 181 |
+
"# Filter to only include data up to December 31, 2024\n",
|
| 182 |
+
"# Make cutoff_date timezone-aware (UTC) to match publishedAt\n",
|
| 183 |
+
"cutoff_date = pd.Timestamp(\"2024-12-31 23:59:59\", tz=\"UTC\")\n",
|
| 184 |
+
"df = df[df[\"publishedAt\"] <= cutoff_date]\n",
|
| 185 |
+
"\n",
|
| 186 |
+
"print(f\"✓ Filtered to {len(df)} records published on or before December 31, 2024\")\n",
|
| 187 |
+
"\n",
|
| 188 |
+
"# ============================================================\n",
|
| 189 |
+
"# 1. NORMALIZE COUNTRIES\n",
|
| 190 |
+
"# ============================================================\n",
|
| 191 |
+
"def normalize_country(x: str):\n",
|
| 192 |
+
" if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
|
| 193 |
+
" return \"Other\"\n",
|
| 194 |
+
" x = x.strip()\n",
|
| 195 |
+
" # Ensure matching with your HTML manualOrder\n",
|
| 196 |
+
" replacements = {\n",
|
| 197 |
+
" \"USA\": \"United States\",\n",
|
| 198 |
+
" \"US\": \"United States\",\n",
|
| 199 |
+
" \"U.S.\": \"United States\",\n",
|
| 200 |
+
" \"UK\": \"United Kingdom\",\n",
|
| 201 |
+
" \"U.K.\": \"United Kingdom\"\n",
|
| 202 |
+
" }\n",
|
| 203 |
+
" return replacements.get(x, x)\n",
|
| 204 |
+
"\n",
|
| 205 |
+
"df[\"country_clean\"] = df[\"consensus_country\"].apply(normalize_country)\n",
|
| 206 |
+
"\n",
|
| 207 |
+
"# Limit to the top 15 countries, merge the rest into \"Other\"\n",
|
| 208 |
+
"top_countries = df[\"country_clean\"].value_counts().nlargest(25).index.tolist()\n",
|
| 209 |
+
"df[\"country_limited\"] = df[\"country_clean\"].apply(\n",
|
| 210 |
+
" lambda x: x if x in top_countries else \"Other\"\n",
|
| 211 |
+
")\n",
|
| 212 |
+
"\n",
|
| 213 |
+
"# ============================================================\n",
|
| 214 |
+
"# 2. NORMALIZE PROFESSIONS\n",
|
| 215 |
+
"# ============================================================\n",
|
| 216 |
+
"def normalize_profession(x: str):\n",
|
| 217 |
+
" if not isinstance(x, str) or x.strip() == \"\" or x.lower() == \"unknown\":\n",
|
| 218 |
+
" return \"Other\"\n",
|
| 219 |
+
" x = x.strip().lower()\n",
|
| 220 |
+
" mapping = {\n",
|
| 221 |
+
" \"actor\": \"Actor\",\n",
|
| 222 |
+
" \"model\": \"Model\",\n",
|
| 223 |
+
" \"adult performer\": \"Adult Performer\",\n",
|
| 224 |
+
" \"singer/musician\": \"Singer, Musician\",\n",
|
| 225 |
+
" \"online personality\": \"Online Personality\",\n",
|
| 226 |
+
" \"sports professional\": \"Sports Professional\",\n",
|
| 227 |
+
" \"voice actor/asmr\": \"Voice Actor\",\n",
|
| 228 |
+
" \"public figure\": \"Public Figure\",\n",
|
| 229 |
+
" \"tv personality\": \"Other\",\n",
|
| 230 |
+
" }\n",
|
| 231 |
+
" return mapping.get(x, \"Other\")\n",
|
| 232 |
+
"\n",
|
| 233 |
+
"df[\"profession_clean\"] = df[\"consensus_primary_profession\"].apply(normalize_profession)\n",
|
| 234 |
+
"\n",
|
| 235 |
+
"# Define top professions to keep\n",
|
| 236 |
+
"top_prof = [\n",
|
| 237 |
+
" \"Actor\",\n",
|
| 238 |
+
" \"Model\", \n",
|
| 239 |
+
" \"Adult Performer\",\n",
|
| 240 |
+
" \"Singer, Musician\",\n",
|
| 241 |
+
" \"Online Personality\",\n",
|
| 242 |
+
" \"Sports Professional\",\n",
|
| 243 |
+
" \"Voice Actor\",\n",
|
| 244 |
+
" \"Public Figure\",\n",
|
| 245 |
+
" \"Other\"\n",
|
| 246 |
+
"]\n",
|
| 247 |
+
"\n",
|
| 248 |
+
"# Re-limit professions, everything else → Other\n",
|
| 249 |
+
"df[\"profession_limited\"] = df[\"profession_clean\"].apply(\n",
|
| 250 |
+
" lambda x: x if x in top_prof else \"Other\"\n",
|
| 251 |
+
")\n",
|
| 252 |
+
"\n",
|
| 253 |
+
"# ============================================================\n",
|
| 254 |
+
"# 3. GROUP INTO SUNBURST STRUCTURE\n",
|
| 255 |
+
"# ============================================================\n",
|
| 256 |
+
"sunburst_data = (\n",
|
| 257 |
+
" df.groupby([\"country_limited\", \"profession_limited\"])\n",
|
| 258 |
+
" .size()\n",
|
| 259 |
+
" .reset_index(name=\"count\")\n",
|
| 260 |
+
")\n",
|
| 261 |
+
"\n",
|
| 262 |
+
"sunburst_dict = {\"name\": \"root\", \"children\": []}\n",
|
| 263 |
+
"country_map = defaultdict(list)\n",
|
| 264 |
+
"\n",
|
| 265 |
+
"for _, row in sunburst_data.iterrows():\n",
|
| 266 |
+
" c = row[\"country_limited\"]\n",
|
| 267 |
+
" p = row[\"profession_limited\"]\n",
|
| 268 |
+
" v = int(row[\"count\"])\n",
|
| 269 |
+
" country_map[c].append({\"name\": p, \"value\": v})\n",
|
| 270 |
+
"\n",
|
| 271 |
+
"# Sort professions inside each country so \"Other\" is last\n",
|
| 272 |
+
"for c, profs in country_map.items():\n",
|
| 273 |
+
" profs_sorted = sorted(profs, key=lambda d: (d[\"name\"] == \"Other\", d[\"name\"]))\n",
|
| 274 |
+
" country_map[c] = profs_sorted\n",
|
| 275 |
+
"\n",
|
| 276 |
+
"# Calculate total datapoints per country and sort\n",
|
| 277 |
+
"country_totals = []\n",
|
| 278 |
+
"for c, profs in country_map.items():\n",
|
| 279 |
+
" total = sum(p[\"value\"] for p in profs)\n",
|
| 280 |
+
" country_totals.append((c, total, profs))\n",
|
| 281 |
+
"\n",
|
| 282 |
+
"# Sort by total (descending), but put \"Other\" last\n",
|
| 283 |
+
"country_totals.sort(key=lambda x: (x[0] == \"Other\", -x[1]))\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"# Build final JSON with sorted countries\n",
|
| 286 |
+
"for c, total, profs in country_totals:\n",
|
| 287 |
+
" sunburst_dict[\"children\"].append({\"name\": c, \"children\": profs})\n",
|
| 288 |
+
"\n",
|
| 289 |
+
"# ============================================================\n",
|
| 290 |
+
"# 4. SAVE JSON\n",
|
| 291 |
+
"# ============================================================\n",
|
| 292 |
+
"with open(sunburst_path, \"w\", encoding=\"utf-8\") as f:\n",
|
| 293 |
+
" json.dump(sunburst_dict, f, ensure_ascii=False, indent=2)\n",
|
| 294 |
+
"\n",
|
| 295 |
+
"print(\"✓ Saved sunburst_countries_A.json (2024 data only)\")"
|
| 296 |
+
]
|
| 297 |
+
},
|
| 298 |
+
{
|
| 299 |
+
"cell_type": "markdown",
|
| 300 |
+
"metadata": {},
|
| 301 |
+
"source": [
|
| 302 |
+
"the resulting *.json is the input for Figure_8.html"
|
| 303 |
+
]
|
| 304 |
+
},
|
| 305 |
+
{
|
| 306 |
+
"cell_type": "markdown",
|
| 307 |
+
"metadata": {},
|
| 308 |
+
"source": []
|
| 309 |
+
}
|
| 310 |
+
],
|
| 311 |
+
"metadata": {
|
| 312 |
+
"kernelspec": {
|
| 313 |
+
"display_name": "latm",
|
| 314 |
+
"language": "python",
|
| 315 |
+
"name": "python3"
|
| 316 |
+
},
|
| 317 |
+
"language_info": {
|
| 318 |
+
"codemirror_mode": {
|
| 319 |
+
"name": "ipython",
|
| 320 |
+
"version": 3
|
| 321 |
+
},
|
| 322 |
+
"file_extension": ".py",
|
| 323 |
+
"mimetype": "text/x-python",
|
| 324 |
+
"name": "python",
|
| 325 |
+
"nbconvert_exporter": "python",
|
| 326 |
+
"pygments_lexer": "ipython3",
|
| 327 |
+
"version": "3.10.15"
|
| 328 |
+
}
|
| 329 |
+
},
|
| 330 |
+
"nbformat": 4,
|
| 331 |
+
"nbformat_minor": 4
|
| 332 |
+
}
|
jupyter_notebooks/Section_2-3_Figure_5_co-occurence_promotional_tags.ipynb
ADDED
|
@@ -0,0 +1,314 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Create *.json for figure 5 (Co-occurence network of Tags)"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": 5,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [
|
| 15 |
+
{
|
| 16 |
+
"name": "stdout",
|
| 17 |
+
"output_type": "stream",
|
| 18 |
+
"text": [
|
| 19 |
+
"Processing: america\n",
|
| 20 |
+
" ✅ Saved to /home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/public/json/tags_america.json\n"
|
| 21 |
+
]
|
| 22 |
+
}
|
| 23 |
+
],
|
| 24 |
+
"source": [
|
| 25 |
+
"import pandas as pd\n",
|
| 26 |
+
"from itertools import combinations\n",
|
| 27 |
+
"from collections import Counter, defaultdict\n",
|
| 28 |
+
"import json\n",
|
| 29 |
+
"import re\n",
|
| 30 |
+
"import os\n",
|
| 31 |
+
"\n",
|
| 32 |
+
"from pathlib import Path\n",
|
| 33 |
+
"current_dir = Path.cwd()\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"# === CONFIG ===\n",
|
| 36 |
+
"file_path = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
|
| 37 |
+
"output_dir = current_dir.parent / \"public/json/\"\n",
|
| 38 |
+
"#target_terms = [\"asian\", \"indian\", \"man\", \"woman\", \"german\", \"korean\", \"american\", \"russian\", \"style\", \"japanese\", \"chinese\"] # Add any tags you want to process\n",
|
| 39 |
+
"#target_terms = [\"character\", \"instagram\", \"youtuber\", \"actor\", \"actress\", \"celebrity\", \"vtuber\", \"kpop\"] # Add any tags you want to process\n",
|
| 40 |
+
"target_terms = [\"america\"] # Add any tags you want to process\n",
|
| 41 |
+
"min_connections = 1 # minimum number of link connections per node\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"# === LOAD DATA ===\n",
|
| 44 |
+
"df = pd.read_csv(file_path)\n",
|
| 45 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 46 |
+
"df_tags = df[tag_columns]\n",
|
| 47 |
+
"\n",
|
| 48 |
+
"# === MAIN LOOP ===\n",
|
| 49 |
+
"for target_term in target_terms:\n",
|
| 50 |
+
" print(f\"Processing: {target_term}\")\n",
|
| 51 |
+
" \n",
|
| 52 |
+
" pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
|
| 53 |
+
" df_filtered = df_tags[df_tags.apply(\n",
|
| 54 |
+
" lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
|
| 55 |
+
" axis=1\n",
|
| 56 |
+
" )]\n",
|
| 57 |
+
"\n",
|
| 58 |
+
" # Skip if no data matches\n",
|
| 59 |
+
" if df_filtered.empty:\n",
|
| 60 |
+
" print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
|
| 61 |
+
" continue\n",
|
| 62 |
+
"\n",
|
| 63 |
+
" # === COUNT INDIVIDUAL TAGS ===\n",
|
| 64 |
+
" all_tags = df_filtered.values.flatten()\n",
|
| 65 |
+
" all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 66 |
+
" tag_counts = Counter(all_tags)\n",
|
| 67 |
+
"\n",
|
| 68 |
+
" # === CO-OCCURRENCE ===\n",
|
| 69 |
+
" co_occurrences = defaultdict(int)\n",
|
| 70 |
+
" for tags in df_filtered.itertuples(index=False, name=None):\n",
|
| 71 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 72 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 73 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 74 |
+
"\n",
|
| 75 |
+
" edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
|
| 76 |
+
"\n",
|
| 77 |
+
" # === FILTER BY CONNECTIONS ===\n",
|
| 78 |
+
" connected_tags = Counter()\n",
|
| 79 |
+
" for tag1, tag2, _ in edges:\n",
|
| 80 |
+
" connected_tags[tag1] += 1\n",
|
| 81 |
+
" connected_tags[tag2] += 1\n",
|
| 82 |
+
"\n",
|
| 83 |
+
" nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
|
| 84 |
+
" valid_ids = set(node[\"id\"] for node in nodes)\n",
|
| 85 |
+
" links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
|
| 86 |
+
" for tag1, tag2, weight in edges\n",
|
| 87 |
+
" if tag1 in valid_ids and tag2 in valid_ids]\n",
|
| 88 |
+
"\n",
|
| 89 |
+
" if not nodes or not links:\n",
|
| 90 |
+
" print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
|
| 91 |
+
" continue\n",
|
| 92 |
+
"\n",
|
| 93 |
+
" # === EXPORT ===\n",
|
| 94 |
+
" d3_data = {\"nodes\": nodes, \"links\": links}\n",
|
| 95 |
+
" safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
|
| 96 |
+
" output_file = os.path.join(output_dir, f\"tags_{safe_term}.json\")\n",
|
| 97 |
+
" \n",
|
| 98 |
+
" with open(output_file, \"w\") as f:\n",
|
| 99 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 100 |
+
" \n",
|
| 101 |
+
" print(f\" ✅ Saved to {output_file}\")\n"
|
| 102 |
+
]
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"cell_type": "markdown",
|
| 106 |
+
"metadata": {},
|
| 107 |
+
"source": [
|
| 108 |
+
"## Different Countries"
|
| 109 |
+
]
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"cell_type": "code",
|
| 113 |
+
"execution_count": 2,
|
| 114 |
+
"metadata": {},
|
| 115 |
+
"outputs": [
|
| 116 |
+
{
|
| 117 |
+
"name": "stderr",
|
| 118 |
+
"output_type": "stream",
|
| 119 |
+
"text": [
|
| 120 |
+
"/tmp/ipykernel_68582/2797381217.py:15: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 121 |
+
" df = pd.read_csv(file_path)\n"
|
| 122 |
+
]
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"name": "stdout",
|
| 126 |
+
"output_type": "stream",
|
| 127 |
+
"text": [
|
| 128 |
+
"Processing: united states\n",
|
| 129 |
+
" ⚠️ No matches for 'united states', skipping.\n",
|
| 130 |
+
"Processing: korea\n",
|
| 131 |
+
" ✅ Saved to data/json/tags_korea_poi.json\n",
|
| 132 |
+
"Processing: uk\n",
|
| 133 |
+
" ✅ Saved to data/json/tags_uk_poi.json\n",
|
| 134 |
+
"Processing: russia\n",
|
| 135 |
+
" ✅ Saved to data/json/tags_russia_poi.json\n",
|
| 136 |
+
"Processing: china\n",
|
| 137 |
+
" ✅ Saved to data/json/tags_china_poi.json\n",
|
| 138 |
+
"Processing: canada\n",
|
| 139 |
+
" ✅ Saved to data/json/tags_canada_poi.json\n",
|
| 140 |
+
"Processing: India\n",
|
| 141 |
+
" ✅ Saved to data/json/tags_india_poi.json\n",
|
| 142 |
+
"Processing: germany\n",
|
| 143 |
+
" ✅ Saved to data/json/tags_germany_poi.json\n"
|
| 144 |
+
]
|
| 145 |
+
}
|
| 146 |
+
],
|
| 147 |
+
"source": [
|
| 148 |
+
"import pandas as pd\n",
|
| 149 |
+
"from itertools import combinations\n",
|
| 150 |
+
"from collections import Counter, defaultdict\n",
|
| 151 |
+
"import json\n",
|
| 152 |
+
"import re\n",
|
| 153 |
+
"import os\n",
|
| 154 |
+
"\n",
|
| 155 |
+
"# === CONFIG ===\n",
|
| 156 |
+
"file_path = \"data/model_subsets/all_models_poi_true.csv\"\n",
|
| 157 |
+
"output_dir = \"data/json/\"\n",
|
| 158 |
+
"target_terms = [\"united states\", \"korea\", \"uk\", \"russia\", \"china\", \"canada\", \"India\", \"germany\"] # Add any tags you want to process\n",
|
| 159 |
+
"min_connections = 1 # minimum number of link connections per node\n",
|
| 160 |
+
"\n",
|
| 161 |
+
"# === LOAD DATA ===\n",
|
| 162 |
+
"df = pd.read_csv(file_path)\n",
|
| 163 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 164 |
+
"df_tags = df[tag_columns]\n",
|
| 165 |
+
"\n",
|
| 166 |
+
"# === MAIN LOOP ===\n",
|
| 167 |
+
"for target_term in target_terms:\n",
|
| 168 |
+
" print(f\"Processing: {target_term}\")\n",
|
| 169 |
+
" \n",
|
| 170 |
+
" pattern = re.compile(rf'\\b{re.escape(target_term)}\\b', flags=re.IGNORECASE)\n",
|
| 171 |
+
" df_filtered = df_tags[df_tags.apply(\n",
|
| 172 |
+
" lambda row: row.astype(str).apply(lambda x: bool(pattern.search(x))).any(),\n",
|
| 173 |
+
" axis=1\n",
|
| 174 |
+
" )]\n",
|
| 175 |
+
"\n",
|
| 176 |
+
" # Skip if no data matches\n",
|
| 177 |
+
" if df_filtered.empty:\n",
|
| 178 |
+
" print(f\" ⚠️ No matches for '{target_term}', skipping.\")\n",
|
| 179 |
+
" continue\n",
|
| 180 |
+
"\n",
|
| 181 |
+
" # === COUNT INDIVIDUAL TAGS ===\n",
|
| 182 |
+
" all_tags = df_filtered.values.flatten()\n",
|
| 183 |
+
" all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 184 |
+
" tag_counts = Counter(all_tags)\n",
|
| 185 |
+
"\n",
|
| 186 |
+
" # === CO-OCCURRENCE ===\n",
|
| 187 |
+
" co_occurrences = defaultdict(int)\n",
|
| 188 |
+
" for tags in df_filtered.itertuples(index=False, name=None):\n",
|
| 189 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 190 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 191 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 192 |
+
"\n",
|
| 193 |
+
" edges = [(list(pair)[0], list(pair)[1], weight) for pair, weight in co_occurrences.items()]\n",
|
| 194 |
+
"\n",
|
| 195 |
+
" # === FILTER BY CONNECTIONS ===\n",
|
| 196 |
+
" connected_tags = Counter()\n",
|
| 197 |
+
" for tag1, tag2, _ in edges:\n",
|
| 198 |
+
" connected_tags[tag1] += 1\n",
|
| 199 |
+
" connected_tags[tag2] += 1\n",
|
| 200 |
+
"\n",
|
| 201 |
+
" nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts if connected_tags[tag] >= min_connections]\n",
|
| 202 |
+
" valid_ids = set(node[\"id\"] for node in nodes)\n",
|
| 203 |
+
" links = [{\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
|
| 204 |
+
" for tag1, tag2, weight in edges\n",
|
| 205 |
+
" if tag1 in valid_ids and tag2 in valid_ids]\n",
|
| 206 |
+
"\n",
|
| 207 |
+
" if not nodes or not links:\n",
|
| 208 |
+
" print(f\" ⚠️ Not enough connections for '{target_term}', skipping.\")\n",
|
| 209 |
+
" continue\n",
|
| 210 |
+
"\n",
|
| 211 |
+
" # === EXPORT ===\n",
|
| 212 |
+
" d3_data = {\"nodes\": nodes, \"links\": links}\n",
|
| 213 |
+
" safe_term = re.sub(r'\\W+', '_', target_term.lower())\n",
|
| 214 |
+
" output_file = os.path.join(output_dir, f\"tags_{safe_term}_poi.json\")\n",
|
| 215 |
+
" \n",
|
| 216 |
+
" with open(output_file, \"w\") as f:\n",
|
| 217 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 218 |
+
" \n",
|
| 219 |
+
" print(f\" ✅ Saved to {output_file}\")\n"
|
| 220 |
+
]
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"cell_type": "code",
|
| 224 |
+
"execution_count": 5,
|
| 225 |
+
"metadata": {},
|
| 226 |
+
"outputs": [
|
| 227 |
+
{
|
| 228 |
+
"name": "stderr",
|
| 229 |
+
"output_type": "stream",
|
| 230 |
+
"text": [
|
| 231 |
+
"/tmp/ipykernel_79893/1420003123.py:13: DtypeWarning: Columns (50,51,54,55,56,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,458) have mixed types. Specify dtype option on import or set low_memory=False.\n",
|
| 232 |
+
" df = pd.read_csv(file_path)\n"
|
| 233 |
+
]
|
| 234 |
+
},
|
| 235 |
+
{
|
| 236 |
+
"name": "stdout",
|
| 237 |
+
"output_type": "stream",
|
| 238 |
+
"text": [
|
| 239 |
+
"✅ Exported 60330 nodes and 16921 links to public/json/nodes_all.json\n"
|
| 240 |
+
]
|
| 241 |
+
}
|
| 242 |
+
],
|
| 243 |
+
"source": [
|
| 244 |
+
"import pandas as pd\n",
|
| 245 |
+
"from itertools import combinations\n",
|
| 246 |
+
"from collections import Counter, defaultdict\n",
|
| 247 |
+
"import json\n",
|
| 248 |
+
"import os\n",
|
| 249 |
+
"\n",
|
| 250 |
+
"# === CONFIG ===\n",
|
| 251 |
+
"file_path = \"data/model_subsets/all_models_poi_false.csv\"\n",
|
| 252 |
+
"output_file = \"public/json/nodes_all.json\"\n",
|
| 253 |
+
"min_link_threshold = 10 # Only keep edges with co-occurrence >= this\n",
|
| 254 |
+
"\n",
|
| 255 |
+
"# === LOAD DATA ===\n",
|
| 256 |
+
"df = pd.read_csv(file_path)\n",
|
| 257 |
+
"tag_columns = [f\"tag_{i}\" for i in range(1, 8)]\n",
|
| 258 |
+
"df_tags = df[tag_columns]\n",
|
| 259 |
+
"\n",
|
| 260 |
+
"# === COUNT INDIVIDUAL TAGS ===\n",
|
| 261 |
+
"all_tags = df_tags.values.flatten()\n",
|
| 262 |
+
"all_tags = [tag for tag in all_tags if pd.notna(tag)]\n",
|
| 263 |
+
"tag_counts = Counter(all_tags)\n",
|
| 264 |
+
"\n",
|
| 265 |
+
"# === CO-OCCURRENCE ===\n",
|
| 266 |
+
"co_occurrences = defaultdict(int)\n",
|
| 267 |
+
"for tags in df_tags.itertuples(index=False, name=None):\n",
|
| 268 |
+
" tags = [tag for tag in tags if pd.notna(tag)]\n",
|
| 269 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 270 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 271 |
+
"\n",
|
| 272 |
+
"# === Build Edges (Filtered by co-occurrence threshold)\n",
|
| 273 |
+
"edges = [\n",
|
| 274 |
+
" {\"source\": list(pair)[0], \"target\": list(pair)[1], \"value\": weight}\n",
|
| 275 |
+
" for pair, weight in co_occurrences.items()\n",
|
| 276 |
+
" if weight >= min_link_threshold\n",
|
| 277 |
+
"]\n",
|
| 278 |
+
"\n",
|
| 279 |
+
"# === Build Nodes (All tags that appear, regardless of links)\n",
|
| 280 |
+
"nodes = [{\"id\": tag, \"size\": tag_counts[tag]} for tag in tag_counts]\n",
|
| 281 |
+
"\n",
|
| 282 |
+
"# === EXPORT ===\n",
|
| 283 |
+
"d3_data = {\"nodes\": nodes, \"links\": edges}\n",
|
| 284 |
+
"\n",
|
| 285 |
+
"os.makedirs(os.path.dirname(output_file), exist_ok=True)\n",
|
| 286 |
+
"with open(output_file, \"w\") as f:\n",
|
| 287 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 288 |
+
"\n",
|
| 289 |
+
"print(f\"✅ Exported {len(nodes)} nodes and {len(edges)} links to {output_file}\")\n"
|
| 290 |
+
]
|
| 291 |
+
}
|
| 292 |
+
],
|
| 293 |
+
"metadata": {
|
| 294 |
+
"kernelspec": {
|
| 295 |
+
"display_name": "Python 3 (ipykernel)",
|
| 296 |
+
"language": "python",
|
| 297 |
+
"name": "python3"
|
| 298 |
+
},
|
| 299 |
+
"language_info": {
|
| 300 |
+
"codemirror_mode": {
|
| 301 |
+
"name": "ipython",
|
| 302 |
+
"version": 3
|
| 303 |
+
},
|
| 304 |
+
"file_extension": ".py",
|
| 305 |
+
"mimetype": "text/x-python",
|
| 306 |
+
"name": "python",
|
| 307 |
+
"nbconvert_exporter": "python",
|
| 308 |
+
"pygments_lexer": "ipython3",
|
| 309 |
+
"version": "3.12.10"
|
| 310 |
+
}
|
| 311 |
+
},
|
| 312 |
+
"nbformat": 4,
|
| 313 |
+
"nbformat_minor": 4
|
| 314 |
+
}
|
jupyter_notebooks/Section_2-4_Figure_9_Training_tags_Sankey.ipynb
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Sankey Diagram from model dataset"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": null,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [],
|
| 15 |
+
"source": [
|
| 16 |
+
"import pandas as pd\n",
|
| 17 |
+
"from pathlib import Path\n",
|
| 18 |
+
"current_dir = Path.cwd()"
|
| 19 |
+
]
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"cell_type": "code",
|
| 23 |
+
"execution_count": null,
|
| 24 |
+
"metadata": {},
|
| 25 |
+
"outputs": [
|
| 26 |
+
{
|
| 27 |
+
"name": "stderr",
|
| 28 |
+
"output_type": "stream",
|
| 29 |
+
"text": [
|
| 30 |
+
"/tmp/ipykernel_9667/181225218.py:8: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
| 31 |
+
" df['has_tags'] = df['has_tags'].fillna(False)\n",
|
| 32 |
+
"/tmp/ipykernel_9667/181225218.py:9: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
|
| 33 |
+
" df['has_sex_tags'] = df['has_sex_tags'].fillna(False)\n"
|
| 34 |
+
]
|
| 35 |
+
},
|
| 36 |
+
{
|
| 37 |
+
"name": "stdout",
|
| 38 |
+
"output_type": "stream",
|
| 39 |
+
"text": [
|
| 40 |
+
"🟢 All Adapters with Tags → Explicit: 4970\n",
|
| 41 |
+
"🔵 All Adapters with Tags → Non-Explicit: 6181\n",
|
| 42 |
+
"✅ Sankey structure saved to sankey_tags_focus_output.csv\n"
|
| 43 |
+
]
|
| 44 |
+
}
|
| 45 |
+
],
|
| 46 |
+
"source": [
|
| 47 |
+
"import pandas as pd\n",
|
| 48 |
+
"# Load your dataset\n",
|
| 49 |
+
"df = pd.read_csv(\"data/CSV/all_models_with_tags.csv\", nrows=40000)\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"\n",
|
| 52 |
+
"# Fill missing values\n",
|
| 53 |
+
"df['has_tags'] = df['has_tags'].fillna(False)\n",
|
| 54 |
+
"df['has_sex_tags'] = df['has_sex_tags'].fillna(False)\n",
|
| 55 |
+
"df['poi'] = df['poi'].fillna(False)\n",
|
| 56 |
+
"\n",
|
| 57 |
+
"# Normalize and map types\n",
|
| 58 |
+
"df['type_normalized'] = df['type'].str.lower().str.strip()\n",
|
| 59 |
+
"type_mapping = {\n",
|
| 60 |
+
" 'checkpoint': 'Checkpoint',\n",
|
| 61 |
+
" 'lora': 'LoRA',\n",
|
| 62 |
+
" 'locon': 'LOCON',\n",
|
| 63 |
+
" 'textualinversion': 'TextualInversion',\n",
|
| 64 |
+
" 'controlnet': 'Other',\n",
|
| 65 |
+
" 'vae': 'Other',\n",
|
| 66 |
+
" 'upscaler': 'Other',\n",
|
| 67 |
+
" 'poses': 'Other',\n",
|
| 68 |
+
" 'workflows': 'Other',\n",
|
| 69 |
+
" 'other': 'Other'\n",
|
| 70 |
+
"}\n",
|
| 71 |
+
"df['type_mapped'] = df['type_normalized'].map(type_mapping).fillna('Other')\n",
|
| 72 |
+
"\n",
|
| 73 |
+
"\n",
|
| 74 |
+
"\n",
|
| 75 |
+
"\n",
|
| 76 |
+
"\n",
|
| 77 |
+
"# Identify Adapters\n",
|
| 78 |
+
"adapter_types = ['LoRA', 'LOCON', 'TextualInversion', 'Controlnet']\n",
|
| 79 |
+
"df['is_adapter'] = df['type_mapped'].isin(adapter_types)\n",
|
| 80 |
+
"adapter_df = df[df['is_adapter']]\n",
|
| 81 |
+
"\n",
|
| 82 |
+
"# Build Sankey links\n",
|
| 83 |
+
"sankey_links = []\n",
|
| 84 |
+
"\n",
|
| 85 |
+
"# Total → each model type\n",
|
| 86 |
+
"for t in df['type_mapped'].unique():\n",
|
| 87 |
+
" count = df[df['type_mapped'] == t].shape[0]\n",
|
| 88 |
+
" sankey_links.append({'source': 'Total', 'target': t, 'value': count})\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"# Adapter types → Adapters\n",
|
| 91 |
+
"for t in adapter_types:\n",
|
| 92 |
+
" count = df[df['type_mapped'] == t].shape[0]\n",
|
| 93 |
+
" if count > 0:\n",
|
| 94 |
+
" sankey_links.append({'source': t, 'target': 'Adapters', 'value': count})\n",
|
| 95 |
+
"# Redefine adapter_types to only include those relevant for POI analysis\n",
|
| 96 |
+
"adapter_types = ['LoRA', 'LOCON', 'TextualInversion']\n",
|
| 97 |
+
"df['is_adapter'] = df['type_mapped'].isin(adapter_types)\n",
|
| 98 |
+
"adapter_df = df[df['is_adapter']]\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"# Adapters → has_tags / has_no_tags\n",
|
| 101 |
+
"tagged_df = adapter_df[adapter_df['has_tags'] == True]\n",
|
| 102 |
+
"no_tag_df = adapter_df[adapter_df['has_tags'] == False]\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"# Add has_no_tags count\n",
|
| 105 |
+
"no_tag_count = no_tag_df.shape[0]\n",
|
| 106 |
+
"if no_tag_count > 0:\n",
|
| 107 |
+
" sankey_links.append({'source': 'Adapters', 'target': 'has_no_tags', 'value': no_tag_count})\n",
|
| 108 |
+
"\n",
|
| 109 |
+
"# Now subdivide has_tags group\n",
|
| 110 |
+
"has_tag_count = tagged_df.shape[0]\n",
|
| 111 |
+
"if has_tag_count > 0:\n",
|
| 112 |
+
" sankey_links.append({'source': 'Adapters', 'target': 'has_tags', 'value': has_tag_count})\n",
|
| 113 |
+
"\n",
|
| 114 |
+
" for poi_val in [True, False]:\n",
|
| 115 |
+
" poi_label = f'has_tags + POI {\"True\" if poi_val else \"False\"}'\n",
|
| 116 |
+
" subset = tagged_df[tagged_df['poi'] == poi_val]\n",
|
| 117 |
+
" poi_count = subset.shape[0] # ✅ Fix: missing in original code\n",
|
| 118 |
+
"\n",
|
| 119 |
+
" if poi_count > 0:\n",
|
| 120 |
+
" sankey_links.append({'source': 'has_tags', 'target': poi_label, 'value': poi_count})\n",
|
| 121 |
+
"\n",
|
| 122 |
+
" for explicit in [True, False]:\n",
|
| 123 |
+
" explicit_label = f'{poi_label} + {\"explicit\" if explicit else \"non-explicit\"}'\n",
|
| 124 |
+
" exp_count = subset[subset['has_sex_tags'] == explicit].shape[0]\n",
|
| 125 |
+
" if exp_count > 0:\n",
|
| 126 |
+
" sankey_links.append({'source': poi_label, 'target': explicit_label, 'value': exp_count})\n",
|
| 127 |
+
"\n",
|
| 128 |
+
"# --- New Section: Count Models (Rows) Containing Specific Tags ---\n",
|
| 129 |
+
"\n",
|
| 130 |
+
"# --- Output total explicit and non-explicit from all adapters with tags ---\n",
|
| 131 |
+
"\n",
|
| 132 |
+
"explicit_total = tagged_df[tagged_df['has_sex_tags'] == True].shape[0]\n",
|
| 133 |
+
"non_explicit_total = tagged_df[tagged_df['has_sex_tags'] == False].shape[0]\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"# Add to sankey_links (optional)\n",
|
| 136 |
+
"if explicit_total > 0:\n",
|
| 137 |
+
" sankey_links.append({'source': 'has_tags', 'target': 'All Adapters has_tags + explicit', 'value': explicit_total})\n",
|
| 138 |
+
"\n",
|
| 139 |
+
"if non_explicit_total > 0:\n",
|
| 140 |
+
" sankey_links.append({'source': 'has_tags', 'target': 'All Adapters has_tags + non-explicit', 'value': non_explicit_total})\n",
|
| 141 |
+
"\n",
|
| 142 |
+
"# Also print them clearly\n",
|
| 143 |
+
"print(f\"🟢 All Adapters with Tags → Explicit: {explicit_total}\")\n",
|
| 144 |
+
"print(f\"🔵 All Adapters with Tags → Non-Explicit: {non_explicit_total}\")\n",
|
| 145 |
+
"\n",
|
| 146 |
+
"\n",
|
| 147 |
+
"# Define target tags to check\n",
|
| 148 |
+
"target_tags = ['rape', 'loli', 'shota', 'lolicon']\n",
|
| 149 |
+
"\n",
|
| 150 |
+
"# Identify all tag columns\n",
|
| 151 |
+
"tag_columns = [col for col in df.columns if col.startswith('tag') and col[3:].isdigit()]\n",
|
| 152 |
+
"\n",
|
| 153 |
+
"# Lowercase tag values for matching\n",
|
| 154 |
+
"df_tags_lower = df[tag_columns].astype(str).apply(lambda col: col.str.lower().str.strip())\n",
|
| 155 |
+
"\n",
|
| 156 |
+
"# For each target tag, check how many rows contain it\n",
|
| 157 |
+
"for tag in target_tags:\n",
|
| 158 |
+
" contains_tag = df_tags_lower.isin([tag])\n",
|
| 159 |
+
" \n",
|
| 160 |
+
" # Only count rows that are both explicit and contain the tag\n",
|
| 161 |
+
" matching_rows = df[(df['has_sex_tags'] == True) & contains_tag.any(axis=1)]\n",
|
| 162 |
+
" rows_with_tag = matching_rows.shape[0]\n",
|
| 163 |
+
" \n",
|
| 164 |
+
" if rows_with_tag > 0:\n",
|
| 165 |
+
" sankey_links.append({'source': 'has_tags', 'target': f'has_tag_{tag}', 'value': int(rows_with_tag)})\n",
|
| 166 |
+
"\n",
|
| 167 |
+
"\n",
|
| 168 |
+
"# Save to CSV\n",
|
| 169 |
+
"sankey_df = pd.DataFrame(sankey_links)\n",
|
| 170 |
+
"sankey_df.to_csv(\"data/CSV/sankey_data.csv\", index=False)\n",
|
| 171 |
+
"print(\"✅ Sankey structure saved to sankey_tags_focus_output.csv\")\n"
|
| 172 |
+
]
|
| 173 |
+
},
|
| 174 |
+
{
|
| 175 |
+
"cell_type": "code",
|
| 176 |
+
"execution_count": null,
|
| 177 |
+
"metadata": {},
|
| 178 |
+
"outputs": [],
|
| 179 |
+
"source": []
|
| 180 |
+
}
|
| 181 |
+
],
|
| 182 |
+
"metadata": {
|
| 183 |
+
"kernelspec": {
|
| 184 |
+
"display_name": "latm",
|
| 185 |
+
"language": "python",
|
| 186 |
+
"name": "python3"
|
| 187 |
+
},
|
| 188 |
+
"language_info": {
|
| 189 |
+
"codemirror_mode": {
|
| 190 |
+
"name": "ipython",
|
| 191 |
+
"version": 3
|
| 192 |
+
},
|
| 193 |
+
"file_extension": ".py",
|
| 194 |
+
"mimetype": "text/x-python",
|
| 195 |
+
"name": "python",
|
| 196 |
+
"nbconvert_exporter": "python",
|
| 197 |
+
"pygments_lexer": "ipython3",
|
| 198 |
+
"version": "3.10.15"
|
| 199 |
+
}
|
| 200 |
+
},
|
| 201 |
+
"nbformat": 4,
|
| 202 |
+
"nbformat_minor": 2
|
| 203 |
+
}
|
jupyter_notebooks/Section_2-4_Figure_9_ectract_LoRA_metadata_v2.ipynb
ADDED
|
@@ -0,0 +1,414 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "f36422c8",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# LoRA metadata"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "raw",
|
| 13 |
+
"id": "8a2feb6e",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"LoRA Metadata Processing Workflow\n",
|
| 17 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 18 |
+
"│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
|
| 19 |
+
"│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
|
| 20 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 21 |
+
" ↓\n",
|
| 22 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 23 |
+
"│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
|
| 24 |
+
"│ Files Using API │ │ using rotating API keys. │\n",
|
| 25 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 26 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 27 |
+
"│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
|
| 28 |
+
"│ from SafeTensor │ │ images, model type, and architecture. │\n",
|
| 29 |
+
"│ Files │ │ │\n",
|
| 30 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 31 |
+
" ↓\n",
|
| 32 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 33 |
+
"│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
|
| 34 |
+
"│ Metadata as JSON │ │ files for later analysis. │\n",
|
| 35 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 36 |
+
"\n",
|
| 37 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 38 |
+
"│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
|
| 39 |
+
"│ for Consolidation │ │ details, and filter necessary attributes. │\n",
|
| 40 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 41 |
+
" ↓\n",
|
| 42 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 43 |
+
"│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
|
| 44 |
+
"│ Tags & Model Info │ │ and systems used for model creation. │\n",
|
| 45 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 46 |
+
" ↓\n",
|
| 47 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 48 |
+
"│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
|
| 49 |
+
"│ Metadata to CSV │ │ format for final analysis. ���\n",
|
| 50 |
+
"└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
|
| 51 |
+
]
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"cell_type": "code",
|
| 55 |
+
"execution_count": null,
|
| 56 |
+
"id": "efc9939d",
|
| 57 |
+
"metadata": {},
|
| 58 |
+
"outputs": [],
|
| 59 |
+
"source": [
|
| 60 |
+
"import os\n",
|
| 61 |
+
"import re\n",
|
| 62 |
+
"import json\n",
|
| 63 |
+
"import csv\n",
|
| 64 |
+
"import struct\n",
|
| 65 |
+
"import requests\n",
|
| 66 |
+
"from pathlib import Path\n",
|
| 67 |
+
"import pandas as pd\n",
|
| 68 |
+
"from collections import Counter\n",
|
| 69 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 70 |
+
"from pathlib import Path\n",
|
| 71 |
+
"import matplotlib.pyplot as plt\n",
|
| 72 |
+
"from matplotlib.font_manager import FontProperties\n",
|
| 73 |
+
"from matplotlib import font_manager\n",
|
| 74 |
+
"import pandas as pd\n",
|
| 75 |
+
"from collections import Counter\n",
|
| 76 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 77 |
+
"\n",
|
| 78 |
+
"# Define the current directory and important file paths\n",
|
| 79 |
+
"current_dir = Path.cwd()\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"# Define frequently used directories\n",
|
| 82 |
+
"\n",
|
| 83 |
+
"data_dir = current_dir.parent / 'data/csv/adapters.csv'\n",
|
| 84 |
+
"fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
|
| 85 |
+
"plots_dir = current_dir.parent / 'results/plots'\n",
|
| 86 |
+
"raw_data_dir = current_dir.parent / 'data/adapter_metadata/lora' ### location of the LoRA metadata (JSON)\n",
|
| 87 |
+
"temp_dir = current_dir.parent / 'data/raw/adapters_safetensors'\n",
|
| 88 |
+
"misc_dir = current_dir.parent / 'misc'\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"# File paths\n",
|
| 91 |
+
"adapters_csv = current_dir.parent / 'data/csv/adapters.csv'\n",
|
| 92 |
+
"output_json_dir = raw_data_dir\n",
|
| 93 |
+
"api_keys_file = misc_dir / 'credentials/civit.txt'\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"# Ensure directories exist\n",
|
| 96 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 97 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 98 |
+
"\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"# Load fonts into Matplotlib\n",
|
| 101 |
+
"for font_path in font_paths:\n",
|
| 102 |
+
" font_manager.fontManager.addfont(font_path)\n",
|
| 103 |
+
"\n",
|
| 104 |
+
"# Set default font family for plots\n",
|
| 105 |
+
"plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
|
| 106 |
+
"\n",
|
| 107 |
+
"print('Paths and fonts initialized successfully.')\n",
|
| 108 |
+
"\n",
|
| 109 |
+
"print('Paths initialized successfully.')"
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "markdown",
|
| 114 |
+
"id": "87a58593",
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"source": [
|
| 117 |
+
"## Step 2: Download LoRA and extract *.safetensors metadata\n",
|
| 118 |
+
"This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
|
| 119 |
+
]
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"cell_type": "code",
|
| 123 |
+
"execution_count": null,
|
| 124 |
+
"id": "abd3a0bc",
|
| 125 |
+
"metadata": {},
|
| 126 |
+
"outputs": [],
|
| 127 |
+
"source": [
|
| 128 |
+
"import os\n",
|
| 129 |
+
"import sys\n",
|
| 130 |
+
"import csv\n",
|
| 131 |
+
"import json\n",
|
| 132 |
+
"import struct\n",
|
| 133 |
+
"import time\n",
|
| 134 |
+
"import requests\n",
|
| 135 |
+
"import signal\n",
|
| 136 |
+
"import contextlib\n",
|
| 137 |
+
"from pathlib import Path\n",
|
| 138 |
+
"import re\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"# === Paste your API keys here ===\n",
|
| 141 |
+
"API_KEYS = [\n",
|
| 142 |
+
" \"399c06ea6d1b7349556a115376ec346b\", #DISCORD\n",
|
| 143 |
+
" \"213be9d373130f86e394c6fea4d75162\", #ASDD 1\n",
|
| 144 |
+
" \"4f180c0c56334b74394b467c5e5b8201\", #ASDD 2\n",
|
| 145 |
+
" \"bdfba7ac53290f66bc76130f25b74336\", #BSDD \n",
|
| 146 |
+
" \"43294f4a27b388624a896db5a65f445a\"\n",
|
| 147 |
+
"]\n",
|
| 148 |
+
"if not API_KEYS or any(not isinstance(k, str) or not k.strip() for k in API_KEYS):\n",
|
| 149 |
+
" raise ValueError(\"Please paste at least one valid API key into API_KEYS.\")\n",
|
| 150 |
+
"\n",
|
| 151 |
+
"# === Config (adjust paths as needed) ===\n",
|
| 152 |
+
"current_dir = Path.cwd()\n",
|
| 153 |
+
"output_json_dir = current_dir.parent / \"data/adapter_metadata/lora\" # where JSON outputs go\n",
|
| 154 |
+
"temp_dir = current_dir.parent / \"data/raw/adapters_safetensors\" # where downloads go\n",
|
| 155 |
+
"csv_path = current_dir.parent / \"data/csv/adapters_poi_false_sfw.csv\"\n",
|
| 156 |
+
"\n",
|
| 157 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 158 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 159 |
+
"\n",
|
| 160 |
+
"# === API key state ===\n",
|
| 161 |
+
"current_key_index = 0\n",
|
| 162 |
+
"\n",
|
| 163 |
+
"\n",
|
| 164 |
+
"\n",
|
| 165 |
+
"def safe_filename(name: str, max_length: int = 100) -> str:\n",
|
| 166 |
+
" # Replace unsafe chars\n",
|
| 167 |
+
" sanitized = re.sub(r'[^a-zA-Z0-9_\\-]', '_', name)\n",
|
| 168 |
+
" # Truncate if too long\n",
|
| 169 |
+
" if len(sanitized) > max_length:\n",
|
| 170 |
+
" sanitized = sanitized[:max_length]\n",
|
| 171 |
+
" return sanitized\n",
|
| 172 |
+
"\n",
|
| 173 |
+
"\n",
|
| 174 |
+
"def get_headers():\n",
|
| 175 |
+
" global current_key_index\n",
|
| 176 |
+
" return {\n",
|
| 177 |
+
" \"Accept\": \"application/json\",\n",
|
| 178 |
+
" \"Authorization\": f\"Bearer {API_KEYS[current_key_index].strip()}\"\n",
|
| 179 |
+
" }\n",
|
| 180 |
+
"\n",
|
| 181 |
+
"def rotate_api_key():\n",
|
| 182 |
+
" global current_key_index\n",
|
| 183 |
+
" if current_key_index < len(API_KEYS) - 1:\n",
|
| 184 |
+
" current_key_index += 1\n",
|
| 185 |
+
" print(f\"🔁 Rotated to API key #{current_key_index + 1}\")\n",
|
| 186 |
+
" else:\n",
|
| 187 |
+
" raise Exception(\"All API keys have been exhausted.\")\n",
|
| 188 |
+
"\n",
|
| 189 |
+
"# === Utilities ===\n",
|
| 190 |
+
"def save_json(data, filename):\n",
|
| 191 |
+
" with open(filename, 'w', encoding=\"utf-8\") as f:\n",
|
| 192 |
+
" json.dump(data, f, indent=4, ensure_ascii=False)\n",
|
| 193 |
+
"\n",
|
| 194 |
+
"def parse_safetensors(file_path):\n",
|
| 195 |
+
" # Minimal, tolerant metadata reader; returns {} on failure.\n",
|
| 196 |
+
" try:\n",
|
| 197 |
+
" with open(file_path, 'rb') as f:\n",
|
| 198 |
+
" file_data = f.read()\n",
|
| 199 |
+
" # Many safetensors use 8-byte header length; this code follows your original logic\n",
|
| 200 |
+
" # (4-byte) but keeps the 8-byte skip. Keep if it's working in your dataset.\n",
|
| 201 |
+
" metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
|
| 202 |
+
" metadata_bytes = file_data[8:8 + metadata_size]\n",
|
| 203 |
+
" metadata_str = metadata_bytes.decode('utf-8', errors='replace')\n",
|
| 204 |
+
" metadata = json.loads(metadata_str)\n",
|
| 205 |
+
" return metadata.get('__metadata__', {})\n",
|
| 206 |
+
" except Exception as e:\n",
|
| 207 |
+
" print(f\"Error parsing safetensors file: {e}\")\n",
|
| 208 |
+
" return {}\n",
|
| 209 |
+
"\n",
|
| 210 |
+
"# === Timeout context ===\n",
|
| 211 |
+
"class TimeoutException(Exception):\n",
|
| 212 |
+
" pass\n",
|
| 213 |
+
"\n",
|
| 214 |
+
"@contextlib.contextmanager\n",
|
| 215 |
+
"def time_limit(seconds):\n",
|
| 216 |
+
" def signal_handler(signum, frame):\n",
|
| 217 |
+
" raise TimeoutException(f\"Timed out after {seconds} seconds\")\n",
|
| 218 |
+
" # Note: SIGALRM works on Unix-like OS; on Windows this will be a no-op.\n",
|
| 219 |
+
" try:\n",
|
| 220 |
+
" signal.signal(signal.SIGALRM, signal_handler)\n",
|
| 221 |
+
" signal.alarm(seconds)\n",
|
| 222 |
+
" except Exception:\n",
|
| 223 |
+
" # Fallback: no hard alarm on non-Unix systems\n",
|
| 224 |
+
" pass\n",
|
| 225 |
+
" try:\n",
|
| 226 |
+
" yield\n",
|
| 227 |
+
" finally:\n",
|
| 228 |
+
" try:\n",
|
| 229 |
+
" signal.alarm(0)\n",
|
| 230 |
+
" except Exception:\n",
|
| 231 |
+
" pass\n",
|
| 232 |
+
"\n",
|
| 233 |
+
"# === Download with timeout, retries, backoff, and key rotation ===\n",
|
| 234 |
+
"def download_file(url, output_folder, timeout=30, overall_timeout=120, max_retries=3):\n",
|
| 235 |
+
" filename = url.split(\"/\")[-1]\n",
|
| 236 |
+
" output_path = os.path.join(output_folder, filename)\n",
|
| 237 |
+
"\n",
|
| 238 |
+
" global current_key_index\n",
|
| 239 |
+
" retries = 0\n",
|
| 240 |
+
" backoff = 2\n",
|
| 241 |
+
"\n",
|
| 242 |
+
" while current_key_index < len(API_KEYS):\n",
|
| 243 |
+
" try:\n",
|
| 244 |
+
" with time_limit(overall_timeout): # global cap per download\n",
|
| 245 |
+
" #print(f\"➡️ GET {url} using key #{current_key_index + 1}\")\n",
|
| 246 |
+
" resp = requests.get(\n",
|
| 247 |
+
" url,\n",
|
| 248 |
+
" headers=get_headers(),\n",
|
| 249 |
+
" stream=True,\n",
|
| 250 |
+
" timeout=(10, timeout), # (connect timeout, per-chunk read timeout)\n",
|
| 251 |
+
" )\n",
|
| 252 |
+
"\n",
|
| 253 |
+
" # Auth errors → rotate key\n",
|
| 254 |
+
" if resp.status_code in (401, 403):\n",
|
| 255 |
+
" print(f\"❌ Auth {resp.status_code} with key #{current_key_index + 1}. Rotating.\")\n",
|
| 256 |
+
" rotate_api_key()\n",
|
| 257 |
+
" retries = 0\n",
|
| 258 |
+
" backoff = 2\n",
|
| 259 |
+
" continue\n",
|
| 260 |
+
"\n",
|
| 261 |
+
" # Not found → bubble up as FileNotFoundError (do not rotate)\n",
|
| 262 |
+
" if resp.status_code == 404:\n",
|
| 263 |
+
" raise FileNotFoundError(f\"Model not found at {url}\")\n",
|
| 264 |
+
"\n",
|
| 265 |
+
" # Rate limit → either rotate or wait/backoff\n",
|
| 266 |
+
" if resp.status_code == 429:\n",
|
| 267 |
+
" print(\"⏳ Rate limited (429).\", end=\" \")\n",
|
| 268 |
+
" if current_key_index < len(API_KEYS) - 1:\n",
|
| 269 |
+
" print(\"Rotating key.\")\n",
|
| 270 |
+
" rotate_api_key()\n",
|
| 271 |
+
" retries = 0\n",
|
| 272 |
+
" backoff = 2\n",
|
| 273 |
+
" continue\n",
|
| 274 |
+
" else:\n",
|
| 275 |
+
" print(f\"Waiting {backoff}s (no other keys).\")\n",
|
| 276 |
+
" time.sleep(backoff)\n",
|
| 277 |
+
" backoff = min(backoff * 2, 60)\n",
|
| 278 |
+
" continue\n",
|
| 279 |
+
"\n",
|
| 280 |
+
" # Other HTTP errors → raise to RequestException path\n",
|
| 281 |
+
" resp.raise_for_status()\n",
|
| 282 |
+
"\n",
|
| 283 |
+
" # Save file\n",
|
| 284 |
+
" with open(output_path, 'wb') as fh:\n",
|
| 285 |
+
" for chunk in resp.iter_content(chunk_size=8192):\n",
|
| 286 |
+
" if chunk:\n",
|
| 287 |
+
" fh.write(chunk)\n",
|
| 288 |
+
"\n",
|
| 289 |
+
" return output_path, filename\n",
|
| 290 |
+
"\n",
|
| 291 |
+
" except TimeoutException as e:\n",
|
| 292 |
+
" # Hard overall timeout → propagate\n",
|
| 293 |
+
" raise e\n",
|
| 294 |
+
" except requests.exceptions.RequestException as e:\n",
|
| 295 |
+
" # Network-ish errors: retry same key with backoff up to max_retries\n",
|
| 296 |
+
" retries += 1\n",
|
| 297 |
+
" if retries <= max_retries:\n",
|
| 298 |
+
" print(f\"🌐 Network error (try {retries}/{max_retries}) with key #{current_key_index + 1}: {e}\")\n",
|
| 299 |
+
" time.sleep(backoff)\n",
|
| 300 |
+
" backoff = min(backoff * 2, 60)\n",
|
| 301 |
+
" continue\n",
|
| 302 |
+
" else:\n",
|
| 303 |
+
" raise Exception(f\"Failed to download {url} after {max_retries} retries: {e}\")\n",
|
| 304 |
+
"\n",
|
| 305 |
+
" # If we exit the loop, we truly ran out\n",
|
| 306 |
+
" raise Exception(\"All API keys have been exhausted or failed.\")\n",
|
| 307 |
+
"\n",
|
| 308 |
+
"# === Main processing ===\n",
|
| 309 |
+
"def process_csv(csv_path):\n",
|
| 310 |
+
" with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
|
| 311 |
+
" reader = csv.DictReader(csvfile)\n",
|
| 312 |
+
" for index, row in enumerate(reader):\n",
|
| 313 |
+
" # Collect up to 20 version IDs; use the most recent\n",
|
| 314 |
+
" version_ids = []\n",
|
| 315 |
+
" for i in range(1, 21):\n",
|
| 316 |
+
" k = f'version_id_{i}'\n",
|
| 317 |
+
" if k in row and row[k]:\n",
|
| 318 |
+
" try:\n",
|
| 319 |
+
" version_ids.append(int(float(row[k])))\n",
|
| 320 |
+
" except ValueError:\n",
|
| 321 |
+
" print(f\"Invalid version_id value '{row[k]}' in row: {row}\")\n",
|
| 322 |
+
"\n",
|
| 323 |
+
" if not version_ids:\n",
|
| 324 |
+
" print(f\"No valid version IDs found in row: {row}\")\n",
|
| 325 |
+
" continue\n",
|
| 326 |
+
"\n",
|
| 327 |
+
" most_recent_version_id = str(max(version_ids))\n",
|
| 328 |
+
" name = row.get('name', 'unknown')\n",
|
| 329 |
+
" sanitized_name = safe_filename(name, max_length=100)\n",
|
| 330 |
+
" new_json_file = os.path.join(\n",
|
| 331 |
+
" output_json_dir,\n",
|
| 332 |
+
" f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
|
| 333 |
+
" )\n",
|
| 334 |
+
"\n",
|
| 335 |
+
" # Skip if JSON already exists\n",
|
| 336 |
+
" if os.path.exists(new_json_file):\n",
|
| 337 |
+
" #print(f\"↩️ Skipping versionID {most_recent_version_id} (JSON already exists)\")\n",
|
| 338 |
+
" continue\n",
|
| 339 |
+
"\n",
|
| 340 |
+
" try:\n",
|
| 341 |
+
" adapter_file, fname = download_file(\n",
|
| 342 |
+
" row['downloadUrl'], str(temp_dir),\n",
|
| 343 |
+
" timeout=30, overall_timeout=180\n",
|
| 344 |
+
" )\n",
|
| 345 |
+
" metadata = parse_safetensors(adapter_file)\n",
|
| 346 |
+
"\n",
|
| 347 |
+
" civitaidata = {\n",
|
| 348 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 349 |
+
" for k, v in row.items()\n",
|
| 350 |
+
" }\n",
|
| 351 |
+
" new_json_data = {\n",
|
| 352 |
+
" \"civitaidata\": civitaidata,\n",
|
| 353 |
+
" \"metadata\": metadata,\n",
|
| 354 |
+
" \"versionID\": most_recent_version_id\n",
|
| 355 |
+
" }\n",
|
| 356 |
+
" save_json(new_json_data, new_json_file)\n",
|
| 357 |
+
" #print(f\"✅ Created JSON for versionID {most_recent_version_id} with file {fname}\")\n",
|
| 358 |
+
"\n",
|
| 359 |
+
" except FileNotFoundError as e:\n",
|
| 360 |
+
" print(f\"⚠️ {e} — saving empty metadata.\")\n",
|
| 361 |
+
" civitaidata = {\n",
|
| 362 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 363 |
+
" for k, v in row.items()\n",
|
| 364 |
+
" }\n",
|
| 365 |
+
" empty_json = {\n",
|
| 366 |
+
" \"civitaidata\": civitaidata,\n",
|
| 367 |
+
" \"metadata\": {},\n",
|
| 368 |
+
" \"versionID\": most_recent_version_id,\n",
|
| 369 |
+
" \"error\": \"Model not found (404)\"\n",
|
| 370 |
+
" }\n",
|
| 371 |
+
" save_json(empty_json, new_json_file)\n",
|
| 372 |
+
" except Exception as e:\n",
|
| 373 |
+
" print(f\"⚠️ Error processing versionID {most_recent_version_id}: {e}\")\n",
|
| 374 |
+
" civitaidata = {\n",
|
| 375 |
+
" k: (int(v) if str(v).isdigit() else v)\n",
|
| 376 |
+
" for k, v in row.items()\n",
|
| 377 |
+
" }\n",
|
| 378 |
+
" empty_json = {\n",
|
| 379 |
+
" \"civitaidata\": civitaidata,\n",
|
| 380 |
+
" \"metadata\": {},\n",
|
| 381 |
+
" \"versionID\": most_recent_version_id,\n",
|
| 382 |
+
" \"error\": str(e)\n",
|
| 383 |
+
" }\n",
|
| 384 |
+
" save_json(empty_json, new_json_file)\n",
|
| 385 |
+
" print(f\"💾 Saved empty JSON for versionID {most_recent_version_id} due to failure.\")\n",
|
| 386 |
+
"\n",
|
| 387 |
+
"# === Run ===\n",
|
| 388 |
+
"if __name__ == \"__main__\":\n",
|
| 389 |
+
" process_csv(csv_path)\n"
|
| 390 |
+
]
|
| 391 |
+
}
|
| 392 |
+
],
|
| 393 |
+
"metadata": {
|
| 394 |
+
"kernelspec": {
|
| 395 |
+
"display_name": "Python 3 (ipykernel)",
|
| 396 |
+
"language": "python",
|
| 397 |
+
"name": "python3"
|
| 398 |
+
},
|
| 399 |
+
"language_info": {
|
| 400 |
+
"codemirror_mode": {
|
| 401 |
+
"name": "ipython",
|
| 402 |
+
"version": 3
|
| 403 |
+
},
|
| 404 |
+
"file_extension": ".py",
|
| 405 |
+
"mimetype": "text/x-python",
|
| 406 |
+
"name": "python",
|
| 407 |
+
"nbconvert_exporter": "python",
|
| 408 |
+
"pygments_lexer": "ipython3",
|
| 409 |
+
"version": "3.13.9"
|
| 410 |
+
}
|
| 411 |
+
},
|
| 412 |
+
"nbformat": 4,
|
| 413 |
+
"nbformat_minor": 5
|
| 414 |
+
}
|
jupyter_notebooks/Section_2-4_Figure_9_extract_LoRA_metadata.ipynb
ADDED
|
@@ -0,0 +1,557 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "781346f0-fab6-45f9-9193-dd547f201864",
|
| 6 |
+
"metadata": {
|
| 7 |
+
"execution": {
|
| 8 |
+
"iopub.execute_input": "2025-02-08T11:14:38.839305Z",
|
| 9 |
+
"iopub.status.busy": "2025-02-08T11:14:38.837963Z",
|
| 10 |
+
"iopub.status.idle": "2025-02-08T11:14:38.845161Z",
|
| 11 |
+
"shell.execute_reply": "2025-02-08T11:14:38.844121Z",
|
| 12 |
+
"shell.execute_reply.started": "2025-02-08T11:14:38.839221Z"
|
| 13 |
+
}
|
| 14 |
+
},
|
| 15 |
+
"source": [
|
| 16 |
+
"# Section 3-4: LoRA metadata"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "markdown",
|
| 21 |
+
"id": "90b42e45-31bc-4cbd-b820-cf2daccbd712",
|
| 22 |
+
"metadata": {
|
| 23 |
+
"execution": {
|
| 24 |
+
"iopub.execute_input": "2025-02-08T21:10:00.955922Z",
|
| 25 |
+
"iopub.status.busy": "2025-02-08T21:10:00.955285Z",
|
| 26 |
+
"iopub.status.idle": "2025-02-08T21:10:00.959160Z",
|
| 27 |
+
"shell.execute_reply": "2025-02-08T21:10:00.958709Z",
|
| 28 |
+
"shell.execute_reply.started": "2025-02-08T21:10:00.955902Z"
|
| 29 |
+
}
|
| 30 |
+
},
|
| 31 |
+
"source": [
|
| 32 |
+
"### How are models trained? \n",
|
| 33 |
+
"### What are the most common training tags? \n",
|
| 34 |
+
"### What tagging systems are used?"
|
| 35 |
+
]
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"cell_type": "raw",
|
| 39 |
+
"id": "79b10960-c7c6-4aa6-b91e-f0caf5c2a7dd",
|
| 40 |
+
"metadata": {},
|
| 41 |
+
"source": [
|
| 42 |
+
"LoRA Metadata Processing Workflow\n",
|
| 43 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 44 |
+
"│ Load CSV File │ --> │ Read adapter metadata CSV file. │\n",
|
| 45 |
+
"│ Read Model Versions │ │ Extract model version IDs and relevant data. │\n",
|
| 46 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 47 |
+
" ↓\n",
|
| 48 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 49 |
+
"│ Download Adapter │ --> │ Use stored download URLs to fetch adapter files │\n",
|
| 50 |
+
"│ Files Using API │ │ using rotating API keys. │\n",
|
| 51 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 52 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 53 |
+
"│ Parse Metadata │ --> │ Extract safetensors metadata, such as training │\n",
|
| 54 |
+
"│ from SafeTensor │ │ images, model type, and architecture. │\n",
|
| 55 |
+
"│ Files │ │ │\n",
|
| 56 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 57 |
+
" ↓\n",
|
| 58 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 59 |
+
"│ Store Parsed │ --> │ Save extracted metadata into structured JSON │\n",
|
| 60 |
+
"│ Metadata as JSON │ │ files for later analysis. │\n",
|
| 61 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 62 |
+
"\n",
|
| 63 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 64 |
+
"│ Process JSON Files │ --> │ Read saved JSON metadata, extract relevant │\n",
|
| 65 |
+
"│ for Consolidation │ │ details, and filter necessary attributes. │\n",
|
| 66 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 67 |
+
" ↓\n",
|
| 68 |
+
"┌──────────────────────┐ ┌───────────────────────��───────────────────────────┐\n",
|
| 69 |
+
"│ Extract Training │ --> │ Identify most frequent training tags, architectures│\n",
|
| 70 |
+
"│ Tags & Model Info │ │ and systems used for model creation. │\n",
|
| 71 |
+
"└─────────┬────────────┘ └───────────────────────────────────────────────────┘\n",
|
| 72 |
+
" ↓\n",
|
| 73 |
+
"┌──────────────────────┐ ┌───────────────────────────────────────────────────┐\n",
|
| 74 |
+
"│ Save Consolidated │ --> │ Store all processed metadata in a structured CSV │\n",
|
| 75 |
+
"│ Metadata to CSV │ │ format for final analysis. │\n",
|
| 76 |
+
"└──────────────────────┘ └───────────────────────────────────────────────────┘\n"
|
| 77 |
+
]
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"cell_type": "markdown",
|
| 81 |
+
"id": "fd202dbd-fc0f-45f1-9d88-94c7d3635207",
|
| 82 |
+
"metadata": {
|
| 83 |
+
"execution": {
|
| 84 |
+
"iopub.execute_input": "2025-02-08T20:54:00.478266Z",
|
| 85 |
+
"iopub.status.busy": "2025-02-08T20:54:00.477080Z",
|
| 86 |
+
"iopub.status.idle": "2025-02-08T20:54:00.481464Z",
|
| 87 |
+
"shell.execute_reply": "2025-02-08T20:54:00.480830Z",
|
| 88 |
+
"shell.execute_reply.started": "2025-02-08T20:54:00.478244Z"
|
| 89 |
+
}
|
| 90 |
+
},
|
| 91 |
+
"source": [
|
| 92 |
+
"### Define Common Paths, Fonts etc."
|
| 93 |
+
]
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"cell_type": "code",
|
| 97 |
+
"execution_count": 24,
|
| 98 |
+
"id": "9da03b00",
|
| 99 |
+
"metadata": {
|
| 100 |
+
"execution": {
|
| 101 |
+
"iopub.execute_input": "2025-02-08T21:11:15.085872Z",
|
| 102 |
+
"iopub.status.busy": "2025-02-08T21:11:15.084865Z",
|
| 103 |
+
"iopub.status.idle": "2025-02-08T21:11:15.117479Z",
|
| 104 |
+
"shell.execute_reply": "2025-02-08T21:11:15.116955Z",
|
| 105 |
+
"shell.execute_reply.started": "2025-02-08T21:11:15.085846Z"
|
| 106 |
+
}
|
| 107 |
+
},
|
| 108 |
+
"outputs": [
|
| 109 |
+
{
|
| 110 |
+
"name": "stdout",
|
| 111 |
+
"output_type": "stream",
|
| 112 |
+
"text": [
|
| 113 |
+
"Paths and fonts initialized successfully.\n",
|
| 114 |
+
"Paths initialized successfully.\n"
|
| 115 |
+
]
|
| 116 |
+
}
|
| 117 |
+
],
|
| 118 |
+
"source": [
|
| 119 |
+
"import os\n",
|
| 120 |
+
"import re\n",
|
| 121 |
+
"import json\n",
|
| 122 |
+
"import csv\n",
|
| 123 |
+
"import struct\n",
|
| 124 |
+
"import requests\n",
|
| 125 |
+
"from pathlib import Path\n",
|
| 126 |
+
"import pandas as pd\n",
|
| 127 |
+
"from collections import Counter\n",
|
| 128 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 129 |
+
"from pathlib import Path\n",
|
| 130 |
+
"import matplotlib.pyplot as plt\n",
|
| 131 |
+
"from matplotlib.font_manager import FontProperties\n",
|
| 132 |
+
"from matplotlib import font_manager\n",
|
| 133 |
+
"import pandas as pd\n",
|
| 134 |
+
"from collections import Counter\n",
|
| 135 |
+
"from concurrent.futures import ProcessPoolExecutor\n",
|
| 136 |
+
"\n",
|
| 137 |
+
"# Define the current directory and important file paths\n",
|
| 138 |
+
"current_dir = Path.cwd()\n",
|
| 139 |
+
"\n",
|
| 140 |
+
"# Define frequently used directories\n",
|
| 141 |
+
"data_dir = current_dir.parent / 'data/CSV'\n",
|
| 142 |
+
"fonts_dir = current_dir.parent / 'misc/assets/fonts'\n",
|
| 143 |
+
"plots_dir = current_dir.parent / 'plots'\n",
|
| 144 |
+
"raw_data_dir = current_dir.parent / 'data/raw/adapter_metadata'\n",
|
| 145 |
+
"temp_dir = current_dir.parent / 'data/models/adapter_temp'\n",
|
| 146 |
+
"misc_dir = current_dir.parent / 'misc'\n",
|
| 147 |
+
"\n",
|
| 148 |
+
"# File paths\n",
|
| 149 |
+
"adapters_csv = data_dir / 'model_subsets/Civiverse_adapters_poi_false.csv'\n",
|
| 150 |
+
"output_json_dir = raw_data_dir\n",
|
| 151 |
+
"api_keys_file = misc_dir / 'api_keys.txt'\n",
|
| 152 |
+
"\n",
|
| 153 |
+
"# Ensure directories exist\n",
|
| 154 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 155 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 156 |
+
"\n",
|
| 157 |
+
"# Font paths for Matplotlib\n",
|
| 158 |
+
"font_paths = [\n",
|
| 159 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-ExtraBold.ttf',\n",
|
| 160 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Bold.ttf',\n",
|
| 161 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-ExtraLight.ttf',\n",
|
| 162 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Light.ttf',\n",
|
| 163 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Medium.ttf',\n",
|
| 164 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-SemiBold.ttf',\n",
|
| 165 |
+
" fonts_dir / 'Noto_Sans_JP/static/NotoSansJP-Regular.ttf',\n",
|
| 166 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-ExtraBold.ttf',\n",
|
| 167 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Bold.ttf',\n",
|
| 168 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-ExtraLight.ttf',\n",
|
| 169 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Light.ttf',\n",
|
| 170 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Medium.ttf',\n",
|
| 171 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-SemiBold.ttf',\n",
|
| 172 |
+
" fonts_dir / 'Noto_Sans_SC/static/NotoSansSC-Regular.ttf'\n",
|
| 173 |
+
"]\n",
|
| 174 |
+
"\n",
|
| 175 |
+
"# Load fonts into Matplotlib\n",
|
| 176 |
+
"for font_path in font_paths:\n",
|
| 177 |
+
" font_manager.fontManager.addfont(font_path)\n",
|
| 178 |
+
"\n",
|
| 179 |
+
"# Set default font family for plots\n",
|
| 180 |
+
"plt.rcParams['font.family'] = ['Noto Sans JP', 'Noto Sans SC', 'sans-serif']\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"print('Paths and fonts initialized successfully.')\n",
|
| 183 |
+
"\n",
|
| 184 |
+
"print('Paths initialized successfully.')"
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"cell_type": "markdown",
|
| 189 |
+
"id": "9c6b84ec-c23e-4047-a288-cfab2f5d8fdb",
|
| 190 |
+
"metadata": {},
|
| 191 |
+
"source": [
|
| 192 |
+
"## Step 1: Download LoRA and extract *.safetensors metadata"
|
| 193 |
+
]
|
| 194 |
+
},
|
| 195 |
+
{
|
| 196 |
+
"cell_type": "markdown",
|
| 197 |
+
"id": "e484c678-70a8-4b70-b0aa-9ce69414c43b",
|
| 198 |
+
"metadata": {},
|
| 199 |
+
"source": [
|
| 200 |
+
"This script downloads LoRA adapters from the filtered Civiverse-Models dataset and extracts the metadata found within the *.safetensors' data structure"
|
| 201 |
+
]
|
| 202 |
+
},
|
| 203 |
+
{
|
| 204 |
+
"cell_type": "code",
|
| 205 |
+
"execution_count": 3,
|
| 206 |
+
"id": "ecf31f5c",
|
| 207 |
+
"metadata": {
|
| 208 |
+
"execution": {
|
| 209 |
+
"iopub.execute_input": "2025-02-08T20:49:20.487791Z",
|
| 210 |
+
"iopub.status.busy": "2025-02-08T20:49:20.487464Z",
|
| 211 |
+
"iopub.status.idle": "2025-02-08T20:49:20.499881Z",
|
| 212 |
+
"shell.execute_reply": "2025-02-08T20:49:20.499471Z",
|
| 213 |
+
"shell.execute_reply.started": "2025-02-08T20:49:20.487771Z"
|
| 214 |
+
}
|
| 215 |
+
},
|
| 216 |
+
"outputs": [],
|
| 217 |
+
"source": [
|
| 218 |
+
"output_json_dir = output_json_dir\n",
|
| 219 |
+
"temp_dir = temp_dir ### better delete these later for space efficiency\n",
|
| 220 |
+
"api_karussell = api_keys_file\n",
|
| 221 |
+
"os.makedirs(output_json_dir, exist_ok=True)\n",
|
| 222 |
+
"os.makedirs(temp_dir, exist_ok=True)\n",
|
| 223 |
+
"\n",
|
| 224 |
+
"# Load API keys\n",
|
| 225 |
+
"def load_api_keys(api_path):\n",
|
| 226 |
+
" if not os.path.exists(api_path):\n",
|
| 227 |
+
" raise FileNotFoundError(f\"API keys file does not exist: {api_path}\")\n",
|
| 228 |
+
" with open(api_path, 'r') as file:\n",
|
| 229 |
+
" return [line.strip() for line in file if line.strip()]\n",
|
| 230 |
+
"\n",
|
| 231 |
+
"api_keys = load_api_keys(api_karussell) # Replace with the path to your API keys file\n",
|
| 232 |
+
"current_key_index = 0\n",
|
| 233 |
+
"\n",
|
| 234 |
+
"def get_headers():\n",
|
| 235 |
+
" \"\"\"Generate request headers with the current API key.\"\"\"\n",
|
| 236 |
+
" global current_key_index\n",
|
| 237 |
+
" return {\n",
|
| 238 |
+
" \"Accept\": \"application/json\",\n",
|
| 239 |
+
" \"Authorization\": f\"Bearer {api_keys[current_key_index]}\"\n",
|
| 240 |
+
" }\n",
|
| 241 |
+
"\n",
|
| 242 |
+
"def rotate_api_key():\n",
|
| 243 |
+
" \"\"\"Rotate to the next API key.\"\"\"\n",
|
| 244 |
+
" global current_key_index\n",
|
| 245 |
+
" if current_key_index < len(api_keys) - 1:\n",
|
| 246 |
+
" current_key_index += 1\n",
|
| 247 |
+
" else:\n",
|
| 248 |
+
" raise Exception(\"All API keys have been exhausted.\")\n",
|
| 249 |
+
"\n",
|
| 250 |
+
"# Function to parse .safetensors metadata\n",
|
| 251 |
+
"def parse_safetensors(file_path):\n",
|
| 252 |
+
" try:\n",
|
| 253 |
+
" with open(file_path, 'rb') as f:\n",
|
| 254 |
+
" file_data = f.read()\n",
|
| 255 |
+
" metadata_size = struct.unpack('<I', file_data[:4])[0]\n",
|
| 256 |
+
" metadata_bytes = file_data[8:8 + metadata_size]\n",
|
| 257 |
+
" metadata_str = metadata_bytes.decode('utf-8')\n",
|
| 258 |
+
" metadata = json.loads(metadata_str)\n",
|
| 259 |
+
" return metadata.get('__metadata__', {})\n",
|
| 260 |
+
" except Exception as e:\n",
|
| 261 |
+
" print(f\"Error parsing safetensors file: {e}\")\n",
|
| 262 |
+
" return {}\n",
|
| 263 |
+
"\n",
|
| 264 |
+
"def save_json(data, filename):\n",
|
| 265 |
+
" with open(filename, 'w') as f:\n",
|
| 266 |
+
" json.dump(data, f, indent=4)\n",
|
| 267 |
+
"\n",
|
| 268 |
+
"def download_file(url, output_folder):\n",
|
| 269 |
+
" \"\"\"Download file with API key rotation.\"\"\"\n",
|
| 270 |
+
" filename = url.split(\"/\")[-1]\n",
|
| 271 |
+
" output_path = os.path.join(output_folder, filename)\n",
|
| 272 |
+
"\n",
|
| 273 |
+
" global current_key_index\n",
|
| 274 |
+
" while current_key_index < len(api_keys):\n",
|
| 275 |
+
" try:\n",
|
| 276 |
+
" response = requests.get(url, headers=get_headers(), stream=True)\n",
|
| 277 |
+
" if response.status_code == 401: # Unauthorized\n",
|
| 278 |
+
" print(f\"API key {current_key_index + 1} failed. Trying next key.\")\n",
|
| 279 |
+
" rotate_api_key()\n",
|
| 280 |
+
" continue\n",
|
| 281 |
+
" elif response.status_code == 403: # Forbidden\n",
|
| 282 |
+
" print(f\"Access forbidden for API key {current_key_index + 1}. Rotating to the next key.\")\n",
|
| 283 |
+
" rotate_api_key()\n",
|
| 284 |
+
" continue\n",
|
| 285 |
+
" response.raise_for_status()\n",
|
| 286 |
+
"\n",
|
| 287 |
+
" # Save the file to the specified output folder\n",
|
| 288 |
+
" with open(output_path, 'wb') as file:\n",
|
| 289 |
+
" for chunk in response.iter_content(chunk_size=8192):\n",
|
| 290 |
+
" file.write(chunk)\n",
|
| 291 |
+
" return output_path, filename\n",
|
| 292 |
+
" except requests.exceptions.RequestException as e:\n",
|
| 293 |
+
" print(f\"Error downloading file: {e}\")\n",
|
| 294 |
+
" rotate_api_key()\n",
|
| 295 |
+
" raise Exception(\"All API keys failed.\")\n",
|
| 296 |
+
"\n",
|
| 297 |
+
"def process_csv(csv_path):\n",
|
| 298 |
+
" # Read the CSV and process each row\n",
|
| 299 |
+
" with open(csv_path, newline='', encoding='utf-8') as csvfile:\n",
|
| 300 |
+
" reader = csv.DictReader(csvfile)\n",
|
| 301 |
+
" for index, row in enumerate(reader):\n",
|
| 302 |
+
" version_ids = []\n",
|
| 303 |
+
" for i in range(1, 21):\n",
|
| 304 |
+
" key = f'version_id_{i}'\n",
|
| 305 |
+
" if key in row and row[key]:\n",
|
| 306 |
+
" try:\n",
|
| 307 |
+
" version_ids.append(int(float(row[key])))\n",
|
| 308 |
+
" except ValueError:\n",
|
| 309 |
+
" print(f\"Invalid version_id value '{row[key]}' in row: {row}\")\n",
|
| 310 |
+
"\n",
|
| 311 |
+
" if not version_ids:\n",
|
| 312 |
+
" print(f\"No valid version IDs found in row: {row}\")\n",
|
| 313 |
+
" continue\n",
|
| 314 |
+
"\n",
|
| 315 |
+
" most_recent_version_id = str(max(version_ids))\n",
|
| 316 |
+
"\n",
|
| 317 |
+
" try:\n",
|
| 318 |
+
" adapter_file, filename = download_file(row['downloadUrl'], temp_dir)\n",
|
| 319 |
+
" metadata = parse_safetensors(adapter_file)\n",
|
| 320 |
+
" # Add all CSV data under 'civitaidata'\n",
|
| 321 |
+
" civitaidata = {key: int(value) if value.isdigit() else value for key, value in row.items()}\n",
|
| 322 |
+
" new_json_data = {\n",
|
| 323 |
+
" \"civitaidata\": civitaidata,\n",
|
| 324 |
+
" \"metadata\": metadata,\n",
|
| 325 |
+
" \"versionID\": most_recent_version_id\n",
|
| 326 |
+
" }\n",
|
| 327 |
+
" sanitized_name = row['name'].replace(\" \", \"_\").replace(\"/\", \"_\")\n",
|
| 328 |
+
" new_json_file = os.path.join(\n",
|
| 329 |
+
" output_json_dir,\n",
|
| 330 |
+
" f\"{index:08d}_{most_recent_version_id}_{sanitized_name}.json\"\n",
|
| 331 |
+
" )\n",
|
| 332 |
+
" save_json(new_json_data, new_json_file)\n",
|
| 333 |
+
" print(f\"Created new JSON for versionID {most_recent_version_id} with filename {filename}\")\n",
|
| 334 |
+
" except Exception as e:\n",
|
| 335 |
+
" print(f\"Error processing versionID {most_recent_version_id}: {e}\")"
|
| 336 |
+
]
|
| 337 |
+
},
|
| 338 |
+
{
|
| 339 |
+
"cell_type": "markdown",
|
| 340 |
+
"id": "708131db-95d8-4427-9397-74c72d3edf48",
|
| 341 |
+
"metadata": {},
|
| 342 |
+
"source": [
|
| 343 |
+
"Uncomment the following to download and process model adapters"
|
| 344 |
+
]
|
| 345 |
+
},
|
| 346 |
+
{
|
| 347 |
+
"cell_type": "code",
|
| 348 |
+
"execution_count": 4,
|
| 349 |
+
"id": "51ce1943-e509-442d-8b92-1a67bc686471",
|
| 350 |
+
"metadata": {
|
| 351 |
+
"execution": {
|
| 352 |
+
"iopub.execute_input": "2025-02-08T20:49:21.547880Z",
|
| 353 |
+
"iopub.status.busy": "2025-02-08T20:49:21.547383Z",
|
| 354 |
+
"iopub.status.idle": "2025-02-08T20:49:21.550529Z",
|
| 355 |
+
"shell.execute_reply": "2025-02-08T20:49:21.550087Z",
|
| 356 |
+
"shell.execute_reply.started": "2025-02-08T20:49:21.547860Z"
|
| 357 |
+
}
|
| 358 |
+
},
|
| 359 |
+
"outputs": [],
|
| 360 |
+
"source": [
|
| 361 |
+
"csv_path = adapters_csv \n",
|
| 362 |
+
"#process_csv(csv_path) #### UNCOMMENT HERE"
|
| 363 |
+
]
|
| 364 |
+
},
|
| 365 |
+
{
|
| 366 |
+
"cell_type": "markdown",
|
| 367 |
+
"id": "4c2c52ee-83ad-4269-a7bf-16225b71f427",
|
| 368 |
+
"metadata": {},
|
| 369 |
+
"source": [
|
| 370 |
+
"## Step 2: Accumulate training tags and compare with auto-tagging vocabulary"
|
| 371 |
+
]
|
| 372 |
+
},
|
| 373 |
+
{
|
| 374 |
+
"cell_type": "code",
|
| 375 |
+
"execution_count": 5,
|
| 376 |
+
"id": "6de2d9a6",
|
| 377 |
+
"metadata": {
|
| 378 |
+
"execution": {
|
| 379 |
+
"iopub.execute_input": "2025-02-08T20:49:31.218029Z",
|
| 380 |
+
"iopub.status.busy": "2025-02-08T20:49:31.217597Z",
|
| 381 |
+
"iopub.status.idle": "2025-02-08T20:49:32.646108Z",
|
| 382 |
+
"shell.execute_reply": "2025-02-08T20:49:32.645340Z",
|
| 383 |
+
"shell.execute_reply.started": "2025-02-08T20:49:31.218005Z"
|
| 384 |
+
}
|
| 385 |
+
},
|
| 386 |
+
"outputs": [
|
| 387 |
+
{
|
| 388 |
+
"name": "stdout",
|
| 389 |
+
"output_type": "stream",
|
| 390 |
+
"text": [
|
| 391 |
+
"Processed JSON files and summary saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/model_subsets/Section_6-5/Lora_metadata.csv\n",
|
| 392 |
+
"Tag summary saved to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/CSV/model_subsets/Section_6-5/LoRA_training_tags_acc.csv\n"
|
| 393 |
+
]
|
| 394 |
+
}
|
| 395 |
+
],
|
| 396 |
+
"source": [
|
| 397 |
+
"interrogator_vocab = current_dir.parent / 'misc/autotagging-vocabularies/clip_interrogator.csv'\n",
|
| 398 |
+
"deepbooru_vocab = current_dir.parent / 'misc/autotagging-vocabularies/deepbooru_tags.txt'\n",
|
| 399 |
+
"\n",
|
| 400 |
+
"\n",
|
| 401 |
+
"def load_tags():\n",
|
| 402 |
+
" try:\n",
|
| 403 |
+
" clip_interrogator_tags = pd.read_csv(\n",
|
| 404 |
+
" interrogator_vocab, header=None, usecols=[0], on_bad_lines='skip'\n",
|
| 405 |
+
" )[0].tolist()\n",
|
| 406 |
+
" except pd.errors.ParserError as e:\n",
|
| 407 |
+
" print(f\"Error reading 'clip_interrogator.csv': {e}\")\n",
|
| 408 |
+
" clip_interrogator_tags = []\n",
|
| 409 |
+
"\n",
|
| 410 |
+
" try:\n",
|
| 411 |
+
" with open(deepbooru_vocab, 'r', encoding='utf-8') as file:\n",
|
| 412 |
+
" deepbooru_tags = {line.strip() for line in file}\n",
|
| 413 |
+
" except FileNotFoundError as e:\n",
|
| 414 |
+
" print(f\"Error reading 'deepbooru_tags.txt': {e}\")\n",
|
| 415 |
+
" deepbooru_tags = set()\n",
|
| 416 |
+
"\n",
|
| 417 |
+
" return clip_interrogator_tags, deepbooru_tags\n",
|
| 418 |
+
"\n",
|
| 419 |
+
"clip_interrogator_tags, deepbooru_tags = load_tags()\n",
|
| 420 |
+
"\n",
|
| 421 |
+
"# Determines tagging system based on tag matches\n",
|
| 422 |
+
"def determine_tagging_system(tags):\n",
|
| 423 |
+
" clip_interrogator_matches = sum(1 for tag in tags if tag.strip() in clip_interrogator_tags)\n",
|
| 424 |
+
" deepbooru_matches = sum(1 for tag in tags if tag.strip() in deepbooru_tags or tag.strip().replace(' ', '_') in deepbooru_tags)\n",
|
| 425 |
+
" \n",
|
| 426 |
+
" if deepbooru_matches > clip_interrogator_matches:\n",
|
| 427 |
+
" return 'deepbooru'\n",
|
| 428 |
+
" elif clip_interrogator_matches > deepbooru_matches:\n",
|
| 429 |
+
" return 'clip-interrogator'\n",
|
| 430 |
+
" elif clip_interrogator_matches == 0 and deepbooru_matches == 0 and tags:\n",
|
| 431 |
+
" return 'other'\n",
|
| 432 |
+
" else:\n",
|
| 433 |
+
" return 'no tag metadata'\n",
|
| 434 |
+
"\n",
|
| 435 |
+
"def process_json_file(file_path):\n",
|
| 436 |
+
" try:\n",
|
| 437 |
+
" with open(file_path, 'r', encoding='utf-8') as file:\n",
|
| 438 |
+
" data = json.load(file)\n",
|
| 439 |
+
" except (json.JSONDecodeError, IOError) as e:\n",
|
| 440 |
+
" print(f\"Error processing file {file_path}: {e}\")\n",
|
| 441 |
+
" return None\n",
|
| 442 |
+
"\n",
|
| 443 |
+
" # Handle ss_tag_frequency - Ensure it's parsed correctly\n",
|
| 444 |
+
" tag_frequency_raw = data.get('metadata', {}).get('ss_tag_frequency', {})\n",
|
| 445 |
+
" if isinstance(tag_frequency_raw, str):\n",
|
| 446 |
+
" try:\n",
|
| 447 |
+
" tag_frequency_raw = json.loads(tag_frequency_raw)\n",
|
| 448 |
+
" except json.JSONDecodeError:\n",
|
| 449 |
+
" print(f\"Error decoding ss_tag_frequency in {file_path}\")\n",
|
| 450 |
+
" return None # Skip files with unreadable tag data\n",
|
| 451 |
+
"\n",
|
| 452 |
+
" if not isinstance(tag_frequency_raw, dict) or not tag_frequency_raw:\n",
|
| 453 |
+
" return None # Skip files with empty or nonexistent ss_tag_frequency\n",
|
| 454 |
+
"\n",
|
| 455 |
+
" filename = os.path.basename(file_path).replace('.json', '')\n",
|
| 456 |
+
" modelspec_title = data.get('metadata', {}).get('ss_output_name', '') # Using ss_output_name instead\n",
|
| 457 |
+
" modelspec_architecture = data.get('metadata', {}).get('ss_network_module', '')\n",
|
| 458 |
+
" ss_num_train_images = data.get('metadata', {}).get('ss_num_train_images', 0)\n",
|
| 459 |
+
" ss_steps = data.get('metadata', {}).get('ss_steps', 0)\n",
|
| 460 |
+
" ss_sd_model_name = data.get('metadata', {}).get('ss_sd_model_name', '')\n",
|
| 461 |
+
"\n",
|
| 462 |
+
" # Extract first-level tag frequencies\n",
|
| 463 |
+
" tag_frequency = next(iter(tag_frequency_raw.values()), {}) # Get the first nested dict\n",
|
| 464 |
+
" tags = list(tag_frequency.keys())\n",
|
| 465 |
+
" training_system = determine_tagging_system(tags) if tags else 'undetermined'\n",
|
| 466 |
+
"\n",
|
| 467 |
+
" tag_items = list(tag_frequency.items())[:20]\n",
|
| 468 |
+
" tag_data = {f'tag{i+1:02d}': tag_items[i][0] if i < len(tag_items) else None for i in range(20)}\n",
|
| 469 |
+
" tag_data.update({f'tag{i+1:02d}_no': tag_items[i][1] if i < len(tag_items) else None for i in range(20)})\n",
|
| 470 |
+
"\n",
|
| 471 |
+
" row = {\n",
|
| 472 |
+
" 'filename': filename,\n",
|
| 473 |
+
" 'modelspec_title': modelspec_title,\n",
|
| 474 |
+
" 'modelspec_architecture': modelspec_architecture,\n",
|
| 475 |
+
" 'ss_num_train_images': ss_num_train_images,\n",
|
| 476 |
+
" 'ss_steps': ss_steps,\n",
|
| 477 |
+
" 'ss_sd_model_name': ss_sd_model_name,\n",
|
| 478 |
+
" 'training_system': training_system\n",
|
| 479 |
+
" }\n",
|
| 480 |
+
" row.update(tag_data)\n",
|
| 481 |
+
"\n",
|
| 482 |
+
" return row, tag_frequency\n",
|
| 483 |
+
"\n",
|
| 484 |
+
"def parallel_process_files(folder_path):\n",
|
| 485 |
+
" rows = []\n",
|
| 486 |
+
" tag_occurrences = Counter()\n",
|
| 487 |
+
" json_files = [os.path.join(root, file) for root, _, files in os.walk(folder_path) for file in files if file.endswith('.json')]\n",
|
| 488 |
+
"\n",
|
| 489 |
+
" with ProcessPoolExecutor(max_workers=4) as executor:\n",
|
| 490 |
+
" results = executor.map(process_json_file, json_files, chunksize=1) # Using small chunksize for debugging\n",
|
| 491 |
+
"\n",
|
| 492 |
+
" for result in results:\n",
|
| 493 |
+
" if result:\n",
|
| 494 |
+
" row, tag_frequency = result\n",
|
| 495 |
+
" rows.append(row)\n",
|
| 496 |
+
" tag_occurrences.update(tag_frequency)\n",
|
| 497 |
+
"\n",
|
| 498 |
+
" total_tags_count = sum(tag_occurrences.values())\n",
|
| 499 |
+
" unique_tags_count = len(tag_occurrences)\n",
|
| 500 |
+
"\n",
|
| 501 |
+
" return rows, total_tags_count, unique_tags_count, tag_occurrences\n",
|
| 502 |
+
"\n",
|
| 503 |
+
"def write_summary_and_csv(rows, total_tags_count, unique_tags_count, tag_occurrences, output_file, tag_summary_file):\n",
|
| 504 |
+
" if not rows:\n",
|
| 505 |
+
" print(\"No valid data to write to CSV.\")\n",
|
| 506 |
+
" return\n",
|
| 507 |
+
"\n",
|
| 508 |
+
" df = pd.DataFrame(rows)\n",
|
| 509 |
+
" df.to_csv(output_file, index=False, encoding='utf-8')\n",
|
| 510 |
+
"\n",
|
| 511 |
+
" tag_summary_df = pd.DataFrame(list(tag_occurrences.items()), columns=['Tag (ss_tag_frequency/tag_frequency)', 'No. of Occurrences'])\n",
|
| 512 |
+
" tag_summary_df = tag_summary_df.sort_values(by='No. of Occurrences', ascending=False)\n",
|
| 513 |
+
" tag_summary_df.to_csv(tag_summary_file, index=False, encoding='utf-8')\n",
|
| 514 |
+
"\n",
|
| 515 |
+
" print(f\"Processed JSON files and summary saved to {output_file}\")\n",
|
| 516 |
+
" print(f\"Tag summary saved to {tag_summary_file}\")\n",
|
| 517 |
+
"\n",
|
| 518 |
+
"\n",
|
| 519 |
+
"def main():\n",
|
| 520 |
+
" folder_path = '/home/lauwag/shares/laura_wagner/Civitai_page_analysis/Civitai_visualizations/data/raw/adapter_metadata' # Path to your JSON files\n",
|
| 521 |
+
" \n",
|
| 522 |
+
" output_dir = current_dir.parent / 'data/CSV/model_subsets/Section_6-5/'\n",
|
| 523 |
+
" os.makedirs(output_dir, exist_ok=True) # Ensure the directory exists\n",
|
| 524 |
+
"\n",
|
| 525 |
+
" output_file = output_dir / 'Lora_metadata.csv'\n",
|
| 526 |
+
" tag_summary_file = output_dir / 'LoRA_training_tags_acc.csv'\n",
|
| 527 |
+
"\n",
|
| 528 |
+
" rows, total_tags_count, unique_tags_count, tag_occurrences = parallel_process_files(folder_path)\n",
|
| 529 |
+
" write_summary_and_csv(rows, total_tags_count, unique_tags_count, tag_occurrences, output_file, tag_summary_file)\n",
|
| 530 |
+
"\n",
|
| 531 |
+
"if __name__ == \"__main__\":\n",
|
| 532 |
+
" main()\n"
|
| 533 |
+
]
|
| 534 |
+
}
|
| 535 |
+
],
|
| 536 |
+
"metadata": {
|
| 537 |
+
"kernelspec": {
|
| 538 |
+
"display_name": "Python 3 (ipykernel)",
|
| 539 |
+
"language": "python",
|
| 540 |
+
"name": "python3"
|
| 541 |
+
},
|
| 542 |
+
"language_info": {
|
| 543 |
+
"codemirror_mode": {
|
| 544 |
+
"name": "ipython",
|
| 545 |
+
"version": 3
|
| 546 |
+
},
|
| 547 |
+
"file_extension": ".py",
|
| 548 |
+
"mimetype": "text/x-python",
|
| 549 |
+
"name": "python",
|
| 550 |
+
"nbconvert_exporter": "python",
|
| 551 |
+
"pygments_lexer": "ipython3",
|
| 552 |
+
"version": "3.11.9"
|
| 553 |
+
}
|
| 554 |
+
},
|
| 555 |
+
"nbformat": 4,
|
| 556 |
+
"nbformat_minor": 5
|
| 557 |
+
}
|
jupyter_notebooks/SuppM_Figure_13_Danbooru_categories.ipynb
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": 2,
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"outputs": [],
|
| 8 |
+
"source": [
|
| 9 |
+
"import json\n",
|
| 10 |
+
"from pathlib import Path\n",
|
| 11 |
+
"\n",
|
| 12 |
+
"current_dir = Path.cwd()\n",
|
| 13 |
+
"input_json = current_dir.parent / \"misc/lists/danbooru.json\"\n",
|
| 14 |
+
"output_json = current_dir.parent / \"public/json/danbooru_flat.json\"\n",
|
| 15 |
+
"\n",
|
| 16 |
+
"# ---- CATEGORY COLORS ----\n",
|
| 17 |
+
"CATEGORY_COLORS = {\n",
|
| 18 |
+
" \"attire_and_body_accessories\": \"#DC143C\",\n",
|
| 19 |
+
" \"body\": \"coral\",\n",
|
| 20 |
+
" \"characters\": \"silver\",\n",
|
| 21 |
+
" \"copyrights\": \"#264653\",\n",
|
| 22 |
+
" \"creatures\": \"silver\",\n",
|
| 23 |
+
" \"drawing software\": \"#219ebc\",\n",
|
| 24 |
+
" \"games\": \"silver\",\n",
|
| 25 |
+
" \"metatags\": \"silver\",\n",
|
| 26 |
+
" \"more\": \"silver\",\n",
|
| 27 |
+
" \"objects\": \"#6d6875\",\n",
|
| 28 |
+
" \"plant\": \"#7cb518\",\n",
|
| 29 |
+
" \"real_world\": \"#a5a58d\",\n",
|
| 30 |
+
" \"sex\": \"#ef476f\",\n",
|
| 31 |
+
" \"visual_characteristics\": \"#06d6a0\",\n",
|
| 32 |
+
" \"subject\": \"#ffd166\",\n",
|
| 33 |
+
" \"uncategorized\": \"#adb5bd\",\n",
|
| 34 |
+
" \"actions_and_expressions\": \"#d00000\",\n",
|
| 35 |
+
" \"objects_and_backgrounds\": \"#118ab2\"\n",
|
| 36 |
+
"}\n",
|
| 37 |
+
"\n",
|
| 38 |
+
"# ---- STEP 1: Load the original nested JSON ----\n",
|
| 39 |
+
"with open(input_json, \"r\", encoding=\"utf-8\") as f:\n",
|
| 40 |
+
" nested_data = json.load(f)\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# ---- STEP 2: Recursively extract paths and limit tags ----\n",
|
| 43 |
+
"def extract_tags(data, path=None, result=None):\n",
|
| 44 |
+
" if path is None:\n",
|
| 45 |
+
" path = []\n",
|
| 46 |
+
" if result is None:\n",
|
| 47 |
+
" result = []\n",
|
| 48 |
+
"\n",
|
| 49 |
+
" if isinstance(data, dict):\n",
|
| 50 |
+
" for key, value in data.items():\n",
|
| 51 |
+
" extract_tags(value, path + [key], result)\n",
|
| 52 |
+
" elif isinstance(data, list) and len(data) > 0:\n",
|
| 53 |
+
" limited_tags = [data[0]] + [\"...\"] if len(data) > 1 else data\n",
|
| 54 |
+
" result.append({\n",
|
| 55 |
+
" \"level_path\": \" > \".join(path),\n",
|
| 56 |
+
" \"top_tags\": limited_tags,\n",
|
| 57 |
+
" \"tag_count\": len(data)\n",
|
| 58 |
+
" })\n",
|
| 59 |
+
"\n",
|
| 60 |
+
" return result\n",
|
| 61 |
+
"\n",
|
| 62 |
+
"# ---- STEP 3: Build the nested structure ----\n",
|
| 63 |
+
"def build_tree_from_flat_list(flat_data):\n",
|
| 64 |
+
" tree = {\"name\": \"root\", \"children\": []}\n",
|
| 65 |
+
"\n",
|
| 66 |
+
" for row in flat_data:\n",
|
| 67 |
+
" parts = row['level_path'].split(\" > \")\n",
|
| 68 |
+
" current = tree\n",
|
| 69 |
+
" for i, part in enumerate(parts):\n",
|
| 70 |
+
" match = next((child for child in current.get(\"children\", []) if child[\"name\"] == part), None)\n",
|
| 71 |
+
" if not match:\n",
|
| 72 |
+
" match = {\"name\": part, \"children\": []}\n",
|
| 73 |
+
" current.setdefault(\"children\", []).append(match)\n",
|
| 74 |
+
"\n",
|
| 75 |
+
" # If this is the top-level category, set its color\n",
|
| 76 |
+
" if current[\"name\"] == \"root\":\n",
|
| 77 |
+
" color = CATEGORY_COLORS.get(part.lower(), \"#888888\")\n",
|
| 78 |
+
" match[\"color\"] = color\n",
|
| 79 |
+
" else:\n",
|
| 80 |
+
" # Inherit from parent\n",
|
| 81 |
+
" match[\"color\"] = current.get(\"color\", \"#888888\")\n",
|
| 82 |
+
"\n",
|
| 83 |
+
" current = match\n",
|
| 84 |
+
"\n",
|
| 85 |
+
" current[\"tags\"] = row[\"top_tags\"]\n",
|
| 86 |
+
" current[\"direct_tag_count\"] = row[\"tag_count\"]\n",
|
| 87 |
+
"\n",
|
| 88 |
+
" return tree\n",
|
| 89 |
+
"\n",
|
| 90 |
+
"# ---- STEP 4: Recursively compute tag and category sizes ----\n",
|
| 91 |
+
"def compute_sizes(node):\n",
|
| 92 |
+
" tag_count = node.get(\"direct_tag_count\", 0)\n",
|
| 93 |
+
" category_count = 0\n",
|
| 94 |
+
"\n",
|
| 95 |
+
" for child in node.get(\"children\", []):\n",
|
| 96 |
+
" compute_sizes(child)\n",
|
| 97 |
+
" tag_count += child.get(\"tag_count\", 0)\n",
|
| 98 |
+
" category_count += 1 + child.get(\"category_count\", 0)\n",
|
| 99 |
+
"\n",
|
| 100 |
+
" node[\"tag_count\"] = tag_count\n",
|
| 101 |
+
" node[\"category_count\"] = category_count\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"# ---- STEP 5: Generate and save final JSON ----\n",
|
| 104 |
+
"flat_data = extract_tags(nested_data)\n",
|
| 105 |
+
"tree_data = build_tree_from_flat_list(flat_data)\n",
|
| 106 |
+
"compute_sizes(tree_data)\n",
|
| 107 |
+
"\n",
|
| 108 |
+
"with open(output_json, \"w\", encoding=\"utf-8\") as f:\n",
|
| 109 |
+
" json.dump(tree_data, f, ensure_ascii=False, indent=2)\n"
|
| 110 |
+
]
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"cell_type": "code",
|
| 114 |
+
"execution_count": null,
|
| 115 |
+
"metadata": {},
|
| 116 |
+
"outputs": [],
|
| 117 |
+
"source": []
|
| 118 |
+
}
|
| 119 |
+
],
|
| 120 |
+
"metadata": {
|
| 121 |
+
"kernelspec": {
|
| 122 |
+
"display_name": "latm",
|
| 123 |
+
"language": "python",
|
| 124 |
+
"name": "python3"
|
| 125 |
+
},
|
| 126 |
+
"language_info": {
|
| 127 |
+
"codemirror_mode": {
|
| 128 |
+
"name": "ipython",
|
| 129 |
+
"version": 3
|
| 130 |
+
},
|
| 131 |
+
"file_extension": ".py",
|
| 132 |
+
"mimetype": "text/x-python",
|
| 133 |
+
"name": "python",
|
| 134 |
+
"nbconvert_exporter": "python",
|
| 135 |
+
"pygments_lexer": "ipython3",
|
| 136 |
+
"version": "3.10.15"
|
| 137 |
+
}
|
| 138 |
+
},
|
| 139 |
+
"nbformat": 4,
|
| 140 |
+
"nbformat_minor": 2
|
| 141 |
+
}
|
jupyter_notebooks/SuppM_Figure_S12_asset_types.ipynb
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "code",
|
| 5 |
+
"execution_count": 1,
|
| 6 |
+
"id": "c0d18a6a",
|
| 7 |
+
"metadata": {},
|
| 8 |
+
"outputs": [],
|
| 9 |
+
"source": [
|
| 10 |
+
"import pandas as pd\n",
|
| 11 |
+
"import matplotlib.pyplot as plt\n",
|
| 12 |
+
"from matplotlib.ticker import FuncFormatter\n",
|
| 13 |
+
"from pathlib import Path"
|
| 14 |
+
]
|
| 15 |
+
},
|
| 16 |
+
{
|
| 17 |
+
"cell_type": "code",
|
| 18 |
+
"execution_count": 2,
|
| 19 |
+
"id": "98d25755",
|
| 20 |
+
"metadata": {},
|
| 21 |
+
"outputs": [],
|
| 22 |
+
"source": [
|
| 23 |
+
"current_dir = Path.cwd()\n",
|
| 24 |
+
"\n",
|
| 25 |
+
"def sortByFrequency_model_types_csv(csv_path, output_svg_path):\n",
|
| 26 |
+
" hatch_pattern = '\\\\\\\\\\\\\\\\\\\\\\\\' # Hatch pattern for the bars\n",
|
| 27 |
+
"\n",
|
| 28 |
+
" # Read the CSV file\n",
|
| 29 |
+
" df = pd.read_csv(csv_path)\n",
|
| 30 |
+
"\n",
|
| 31 |
+
" if 'type' not in df.columns:\n",
|
| 32 |
+
" return \"The CSV file does not contain a 'type' column.\"\n",
|
| 33 |
+
"\n",
|
| 34 |
+
" # Count the occurrences of each model type\n",
|
| 35 |
+
" type_counts = df['type'].value_counts().reset_index()\n",
|
| 36 |
+
" type_counts.columns = ['Type', 'Count']\n",
|
| 37 |
+
" total = type_counts['Count'].sum()\n",
|
| 38 |
+
" type_counts['Percentage'] = (type_counts['Count'] / total * 100).round(2)\n",
|
| 39 |
+
"\n",
|
| 40 |
+
" # Sort the data in ascending order\n",
|
| 41 |
+
" type_counts = type_counts.sort_values(by='Count', ascending=True)\n",
|
| 42 |
+
"\n",
|
| 43 |
+
" # Plotting\n",
|
| 44 |
+
" plt.figure(figsize=(10, 3.5))\n",
|
| 45 |
+
" bars = plt.barh(type_counts['Type'], type_counts['Count'], color='white', hatch=hatch_pattern, edgecolor='coral')\n",
|
| 46 |
+
" plt.xlabel('Counts', fontweight='bold')\n",
|
| 47 |
+
" plt.ylabel('Asset Type', fontweight='bold')\n",
|
| 48 |
+
"\n",
|
| 49 |
+
" ax = plt.gca()\n",
|
| 50 |
+
"\n",
|
| 51 |
+
" # Hide all axis spines\n",
|
| 52 |
+
" for spine in ax.spines.values():\n",
|
| 53 |
+
" spine.set_visible(False)\n",
|
| 54 |
+
"\n",
|
| 55 |
+
" # Keep ticks visible\n",
|
| 56 |
+
" ax.xaxis.set_ticks_position('bottom')\n",
|
| 57 |
+
" ax.yaxis.set_ticks_position('left')\n",
|
| 58 |
+
"\n",
|
| 59 |
+
" # Bold tick labels\n",
|
| 60 |
+
" for label in ax.get_xticklabels() + ax.get_yticklabels():\n",
|
| 61 |
+
" label.set_fontweight('bold')\n",
|
| 62 |
+
"\n",
|
| 63 |
+
" # Format x-axis ticks: 25000 → 25 k\n",
|
| 64 |
+
" ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{int(x/1000)} k' if x >= 1000 else f'{int(x)}'))\n",
|
| 65 |
+
"\n",
|
| 66 |
+
" # Add percentage labels to the bars\n",
|
| 67 |
+
" for bar, percentage in zip(bars, type_counts['Percentage']):\n",
|
| 68 |
+
" plt.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2,\n",
|
| 69 |
+
" f' {percentage}%', va='center', color='blueviolet', fontweight='bold')\n",
|
| 70 |
+
"\n",
|
| 71 |
+
" plt.tight_layout()\n",
|
| 72 |
+
" plt.savefig(out_file, format='svg')\n",
|
| 73 |
+
" plt.show()\n"
|
| 74 |
+
]
|
| 75 |
+
},
|
| 76 |
+
{
|
| 77 |
+
"cell_type": "code",
|
| 78 |
+
"execution_count": 4,
|
| 79 |
+
"id": "9e610dec",
|
| 80 |
+
"metadata": {},
|
| 81 |
+
"outputs": [
|
| 82 |
+
{
|
| 83 |
+
"data": {
|
| 84 |
+
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA94AAAFUCAYAAADS2eS8AAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAw3xJREFUeJzs3Xd4FNX6wPHv7G6y6Q0CIUBCDQQQkGJAWlBpIgoIVhREf3hRrFe9otJURFEsV0UvwqWKBcSChQ6hhg6GGkognZbeNlvO74+52bAmQIKEIL6f55knkzNnzpyzIXl45zRNKaUQQgghhBBCCCFElTBUdwWEEEIIIYQQQojrmQTeQgghhBBCCCFEFZLAWwghhBBCCCGEqEISeAshhBBCCCGEEFVIAm8hhBBCCCGEEKIKSeAthBBCCCGEEEJUIQm8hRBCCCGEEEKIKiSBtxBCCCGEEEIIUYUk8BZCCCGEEEIIIaqQBN5CCCGEEEIIIUQVksBbCCGEEEIIIYSoQhJ4CyGEEEIIIYQQVUgCbyGEEEIIIYQQogpJ4C2EEEIIIYQQQlQhCbyFEEIIIYQQQogqJIG3EEIIIYQQQghRhSTwFkIIIYQQQgghqpAE3kIIIYQQQgghRBWSwFsIIYQQQgghhKhCEngLIYQQQgghhBBVSAJvIYQQQgghhBCiCpmquwJCVImsM1CQU921EEIIIYQQ4vrj5QcBwdVdi78UTSmlqrsSQlxxk+8Dq6W6ayGEEEIIIcT1x+QOYz6pUPCdmWhj06fZJO+wUJBhx93bQI0mbnR42Jemt3i65D2yuoAd8/I4dbAYZQef2kZaD/Ym6jG/ClXrWEwhS5486/z+uZ31MJk1AOxWxe6v84hbkk92sg2TWaNBFw96POePb4jeH11c4GDlG5kcjynCYILI272IfjEAg1EvIzvFxuyB6fSeGEiL/t4VqlMJ6fEW1yerBQY/C2dTYf23ENkZug4GTdOv/foFZKRD/1FQK0y/J/EArJgLYZFw6zAwmsBug9ULIPEg9B4OYS30vKcT4ZcZEBQCt/8fuJlBKdi4BA5uge73QPOb9LxZZ+Dnz8DdA+4YDV6+evquVbBjGXToC+1u09MKcvW8xUV63pI/Zoe2STukHdIOaYe0Q9oh7ZB2SDukHdXfjlXzICFOH116icBbKcWiUafJTrZjdIcaTdzITrGRvMNC8k4LwxfVplZzdwC2z8lh3XvZAHjXNOAdbKQgw8HJ2KIKBd75Z+0sG5dxwevLJ2aw/8cCAGo2MZF/1sHBXwpI2W1hxHchmH0NbP0ihwNLC7hnVjDZSTaWT8ykRhM32gzxAWDFpEzqdzBXOugu+TCEUEop1aNHDwWo8PDwC+b58ssvVXR0tAoICFBubm6qfv366uGHH1b79+93yTdhwgQFOA9N01RAQIDq0aOHWrVqVbllz54925nfYDCoxMTEy2/MhIFKpRzVz3euVGrCIKWWfqaU3a6nFRUoNfNlpd56QKmkw6X3Hd6u1OtDlPpqilLWYj3NWqx///oQ/XqJpMP6/TNf1stTSi9/6Wf683auLM17Jlmp90Yq9fEYpXIyStPXfavXdd23pWk5GXq+90bq95WQdkg7pB3SDmmHtEPaIe2Qdkg7qrsdyUdc/699ETlpVjW1ZaKa2jJRxc7MVkopdXJroTPt6Fq9zOxUq3qvjZ62c0GOcjgczjIsefZLPkcppRb947Sa1jZRLXnqjLN8a5HDWca7N+hpa9/NVEopVZhtVx90SFJTWyaqLf/R67b4H6fV1JaJylbsUOcSitXUlolq5Zv6z2Tfj3nqg45JKjvVWqH6/JEE3sLpUoH3o48+6gyM/fz8VPPmzZXRaFSA8vT0VMuXL3fmPT/wbtu2rWrTpo1L3vKC6pLnlxxvvPHG5TdmwkCl9m8p/V7+qEo7pB3SDmmHtEPaIe2Qdkg7pB1/vh0pRysceNttDjWjX6qa2jJRTbsxUc0ZkqY+6pyk3muTqH577Zyy2/TAeMe8HDW1ZaL6oEOSWvrSWfXvm5PVpz2S1c//Oqtyz9gu+ZydC/T7t83JURs/ySoTeBfl2tXUVnraummZelpOaeD99SOnlFJKrf8wU01tmahOxBaqvYty1dSWiWrPolyVn2FTH3dNVjvm51yyLhcigbdwuljg/d133zkD4rvuuksVFRUppZT6/fffVc2aNRWgateurfLz85VSroF3QkKCUkqpWbNmOdMWL17sUv7x48eVpmkKUB06dFCAatKkyeU3ZsJApd55WP6oSjukHdIOaYe0Q9oh7ZB2SDukHVeyHSf2VzjwVkrv9Z47NM0ZDE9tmag+6Zasdi4sDWJXvH7Oee29Nonqv3elOnvA5wxJU7ZixwXLP3OkWL3fLkl9+3+nlcPhKDfwVkrvES9J/+/ANPVx12Tn91/0T1VKKWXJt6ufXz6rPuqcpD7umqxWT8lQdptDLX3prJr/QLo6d7xYfT3ylPqoU5Kad2+6Sv29qEKfgVISeIvzXCzwHjhwoDNoPnHihMu184PsH374oUxaeYH3tm3byi0jJCRE7d6925lvw4YNl6x3UVGRys7OdjmKXh2g1Aej5I+qtEPaIe2Qdkg7pB3SDmmHtEPacSXbMf3ZCgfeDrtDLR6tB7yrp2QoS75dHVqe7wx441fpnXbLJpQG3vt+ylNK6UO7S9JObi284DNmD0pTn3RLdvaMXyjwLsyyqxVvZKjPbklRH3RIUguHn1Lz7tFfCPz3rtQLln98Q4Ga1jZRnY63qPn3pat/d05WCZsL1Yy+qeqzW1Mu+lLgfBJ4C6eLBd6RkZEKUAEBAWWuff/9985A+Z133lFKuQbebdu2VW3btlVGo1G5ubmpV155xeV+h8OhGjZsqAD1/PPPK6WUat26tQLUo48+esl6n/+skmNCj2ZKHdktf1SlHdIOaYe0Q9oh7ZB2SDukHdKOK9mON++rcOCdsKl0Pnf6AYsz/cOoJJf50xs/LQ2Wzx3X63DueLEzLe77vAs+o6SX/IMOSeqDDknOnvKSoeu7vip/eLjD4VAz79CHwX//zJly81jy7erzXilqw8dZypJnd8m7ZmqGmtoyUZ0+bCn33j+SwFs4VSTwDgwMLHPtxx9/vGjgff4RFhamtmzZ4nL/2rVrndd3796tlFLq3Xffdc4lLxm+fiEX7PFOOSp/VKUd0g5ph7RD2iHtkHZIO6Qd0o4r2Y7dayoceJ/fu71nUa5SSumLlv1hvnXSziJnvv1L9SB7/9LSHu+knfqQ7p1f5qiZd6SqmXeU9lCfP4S9vGP7PD3wPnO0WOWfK50vHjsr25nn0LLy443Vb2eoWQNSldXi0OeJt0xUP71wVv9RvJ8pgbe4PJc71HzixImXHGoeHx+vGjdurABVp04dlZdX+tZq+PDhzrz+/v7K399feXt7O9Pmz59f+cac/8dA/qhKO6Qd0g5ph7RD2iHtkHZIO6QdV6YdlVhcrSDTpv7dWZ9L/e4N+tzqkgXN3muTqNIPlgatJauRv9dGz1fSc/31o6ecec4fRn4hFxpqvvW/2eq9Nolq1oBUNb1nijPP90+fcVlFvUTq70XqvTalQb9SSs27N119fluKyjtjU/PuSVOf3SJDzcVlKAm8w8LCVGFhocuxePFiZyA8cOBA5+JqcXFxFV5c7aeffnKmTZ06VSmlVG5urkuQXd5x6623Vr4xEwYqtfTz0u/lj6q0Q9oh7ZB2SDukHdIOaYe0Q9rx59tRicBbKaXOHi1WS186qz6/LUVNuzFRfdojWS3+x2mVvLvIJZ+1yKHWvZ+pPrtVz/dF/1S14eMsVVxod+b5M4H38Q0Fas6QNPXhTUlqWlt9XvfWWdnKbi0bONutDjV7UJpa8fo5l/Rzx4vVwodPqQ86JKk5Q9JUyp6iMvdeiATewumP23mdf3zwwQdq5MiRLj3TkZGRzi3CPDw8LridWEng7XA4VKtWrZyLqBUWFrrs3b1v3z6X+nz44YeXv6f3hIHyR1XaIe2Qdkg7pB3SDmmHtEPaIe240u3Y8lOlAm+hk8BbOF0q8FZKqQULFqgePXoof39/ZTKZVN26ddVDDz1UJmguL/BWSqn58+c70z/99FPnMyMiIsrUJzEx8fL39C7p8ZY/qtIOaYe0Q9oh7ZB2SDukHdIOaceVa8ekuyXwvgyaUkohxPVm4iAY/CwkHoIdy6BDX2h3m36tIBd+/gyKi+CO0RAQrKcf2gbrv4XIztB1MGgaWC3w6xeQkQ79R0GtMD1v4gFYMRfCIuHWYWA0gd0GqxdA4kHoPRzCWuh5TyfCLzMgKARu/z9wM4NSsHEJHNwC3e+B5jfpebPO6HVz99Dr5uWrp+9aJe2Qdkg7pB3SDmmHtEPaIe2QdlR/O379AtKOwaj3ILQxomIk8BbXp8n36X9EhBBCCCGEEFeWyR3GfFIa0ItLksBbXJ+yzkBBTnXXQgghhBBCiOuPl58E3ZUkgbcQQgghhBBCCFGFTNVdASGqxLXW4y1vBYUQQgghhPjbkh5vcX261uZ4m9xgzKeVCr6L8x3MuTud7GQ7AL3GBdL2Xp+L5t/4cTbJuyzkpNqxFip8Q4w07+vFTSN9cfc2OPOejC1i68wczsRbseQ68AgwULetmZtH+xEc4Q7AsZhCYt7PIjvFTs3GJm59NZDQ1mZnGSvfyCB5p4WHF4VgdNMq+4kIIYQQQgjxtyE93tVoxIgRzJ07lx49erBu3boqf56m6cHR7NmzGTFiRJU/b86cOTzyyCMAXPX3O1aLvqq5yVz9qz+eOgk/fqz3wFci8F41OdMZdFdEYZaDnQvyMLpDUEM38k7byTxpY8t/ckg/UMyQz/RnZ5yw8t3oM9it4OFnoEYTN84esRK/spDknRZGrw2lOF+x9IVz1Gnjzv3zarFw2Gl+fO4co1eHApC8y8Lv3+Vz/9xaEnQLIYQQQghxCYZLZxGXq6ioiPfff5+oqCj8/Pzw8vIiIiKCxx9/nOPHj1d39apccHAwUVFRREVFVfreESNGoGka0dHRl1+BmvWgRScY+Za+9cGyWeAbpG97cMfj0PMBPfg+sktPa9IWHn1bHxb+2xd6sB7aGG65H+58Eg7Gwu5VENIQwlvo5YY0gN9mgsOu5+00AO57GZIOwabv9YC8dnilq35oWQH7fyqgWR/PCt9jNGv0+Kc/YzbUZcR3IfxjVSh12ui91wkbiijKdgCQFleM3arfc/fnNRm+KISox/wAPXi3FigyT1qxFipCb3DH099I7Uh38k7ZKci0Y7cqVkzMoO09PoS2MZdbFyGEEEIIIUQpCbyrSGZmJjfffDP//Oc/2bZtGwCNGzfm1KlTzJgxg/Xr11dzDate//79iY2NJTY2tnoqcEj/3KlZF4a/DpYCmDsecjP19B5D9eB77UKIWaSn+Qbqec1eet6zKXp6u9vgzidgxwr4dQY4HGD2hGHj9eB6/iRIjtfzRnSAe/8FR3bC4ml60F8JOWk2VryeQe0WbnR72r/C9/nUNHLTI37OIeUms0adlnrgrRnA8L/xLaGt3TG66effjT7L3KHpbJ2Zg9lX45axAZh9DQSGueHmqZEaV0xhtp1TB4vxqW3EK9DIlv/kYC1UdHu24nUTQgghhBDi70wC7yoyZswYdu/eDcCLL75IRkYGcXFxZGdnExMTQ7NmzVzyz5w5k4YNG+Lr68sdd9xBenq6y/UFCxbQsWNHvLy88PX1pW/fvuzZs8clT3p6OqNGjaJ+/fq4u7tTu3ZtHnjggQvW8fvvv8fNzQ1N05g8eTIADRo0QNM0Xn75ZcaMGUNQUBD+/v488cQTWCylc6YLCwt59dVXadKkCe7u7gQFBTFw4EDi4uKceebMmYOmac4h7gDR0dFomsbDDz/MhAkTqFOnDoGBgQwbNozc3FxnHebOnQtATEyMs4xKD8df/60+pByqN/hevaDCVVYOxa9jM3BY4Y6pNTCYLn8Yd/45O/GrCgFo3s/LGZAHhrtxz8xaeAUZKMp2cPqgFYcNfGsbqdlYj8g9/A0MeK8G+aftfH5rGm6eGne9X4Ozx6xsm5VDr3GB7P4qj89vS2V6dApr383EYZPlIoQQQgghhCiXEldcVlaWMplMClBt2rRRDoej3HzDhw9XgPL09FQeHh6qadOmClCAeuCBB5z53nnnHWd6RESECg0NVYDy9vZWBw4cUEopdfbsWRUeHu7M17RpUxUWFqYCAgKc5ZRcmz17tvrtt9+Uu7u7AtSUKVOceUrKMJvNqkaNGqphw4bO+5577jlnvttuu00BStM01bx5c+Xj46MA5ePjow4ePKiUUmr27NnOe0v06NFDAcrNzU35+vq6lP/KK68opZQaOHCgqlmzpgKUr6+vioqKUlFRUWrnzp3lfo5FRUUqOzvb5Sh6dYBSX7+j1IRBSu1cWZr5TLJS741U6uMxSuVklKav+1apCQP1ryVyMvR8743U7yuxc6Ve7tLPlLLb/1eJAqVmvqzUWw8olXS4NO/h7UpNulsvO+VoufU/3/a5OWpqy0S1d3GuUkqprGSrmtoyUU1tmah2f517yftLZJy0qi9uT1VTWyaqL4elK0uevbRZ6VY1o59+7eBv+cqSb1er385QU1smqvfbJ6nc07Zyy3TYHWrBg+lq6Ytn1bH1BWpqy0S14o0MteU/2ZWunxBCCCGEEH8n0uNdBeLj47HZ9OHF3bp1c+nxLY/FYiE2Npb4+HgGDRoEwOrVqwEoKChg0qRJAEyaNInDhw9z8uRJOnToQH5+Pm+99RYAn376KSdPngTg22+/JT4+npMnT7JmzZoyz1u/fj2DBw+muLiYd955h5dffrlMnrCwMBISEjh+/Dj333+/8xnZ2dmsXbuWVav0nuT333+fgwcPcvDgQXx8fMjLy2PKlCmX/Iw8PDw4ePAgR48epX379i5t/v777+nfvz8A7dq1cw5Xb9euXbllTZkyBX9/f5djysZ4fSG0Dr3hp+nV2/Pde/glP48Spw8XA7Dm7Sw+7JjM7IGlIx/WvJPJlw+eumQZKXssfPngKTJP2mgc7cHQGcEuK5rv+TqPrEQb7j4azft64e5loOWd3gDYihQpu8tfDX7313lknrBxy8sBnIwtAqDtPd60e1Bfaf3ElqIKt1MIIYQQQoi/Ewm8q4A6bwXvSwXdADfccANt2rQBoEWLFgCcOqUHWPv376egoACACRMmoGkabm5u7NixA8A5f3rr1q0ANGnShKFDhzrLvvHGG8s8b/bs2RQWFvLcc8/x0ksvlVunO+64A19ffQXw++67D4Di4mLi4+PZvn27M1/JUPZ69erRrVs3AGfdLuaWW26hbt26GAwGmjdv7tLmyho7dizZ2dkux9iuEfrq47ePqv7gO6xFpdtkLVTOo4S9GKxF+ve5p2zMGpDGrAFpxK8qcOY5vKKAbx89Q2Gmg3YP+DDo3zVx83T9Nbfk6WUU5ysyTuirrKXvL3Zed/Ms+282N93Gho+yiX4xAK8gIyX/xI1u2p8aDi+EEEIIIcTfgWwnVgWaNWuGyWTCZrOxceNGlFIXDcADAgKc5ybThX8kkZGR+Pn5uaTVqFGj0vUr6Zn+6quvePLJJ2ncuHGly/izymuzuswtx8xmM2bzH1bXNhn1Lb8MBj34Bj34Bj1oLgm+547Xj+Gv60F2j/+9tFi7UP/aY2hp8H1+3pp1S7chKyn39lGlwfeC1/Xg+6EJYDBWuC23T67B7ZNLf6bZKTZm9EkDXPfxdtggI0EfVVH8v0A677Sdn/55DhQY3SBtXzFfDjvtLKvXa4HUbuFO01s92f11HiiYN/QU/vVMnDumB+B+oUbqdyy7UvnKNzMJbWOm1V16z3iDTh7snJfH8Y1F+Ibo7QvvJCucCyGEEEIIUR7p8a4C/v7+3HPPPQDs3r2bV155xTn0HGDVqlVs3ry5QmW1bNkST099S6m+ffuyZcsW59Drzz77jFdffRXAuWXX0aNHWbJkifP+Py7ABvDGG2/QqlUr0tPT6dWrF2lpaWXy/PLLL+Tl5QH60HUAd3d3IiIi6NixozPfwoV6gJqcnMyGDRsA6NChQ4XadjFeXl4A5OfnX34hv34BlsLS4Lu6er5PJ15+GyrBblX6bHnAboW034tdDkuevp1YeCcPhnxWk/BOZty8NDJPWvGrY6T13d7cP7cWbh6ufxYOLSsgcauFXuMDnWmNunvS9Sl/ts3KYeWkTNo96EObIT5XpZ1CCCGEEEL85VTzHPPr1rlz51Tbtm2dC4f5+fmp1q1bq8DAQOcCZyWLq/Xo0cN534QJE8osSPbWW28500JDQ1WbNm1UUFCQAtSECROUUmUXV4uIiFANGjS44OJqSUlJql69egpQN9xwg8rI0BcaKynD29tb1axZUzVq1Mh53zPPPOMs6/zF1SIjI5Wvr2+lFlcbPny4M63kcwgPD3emffTRR857W7VqpaKiolRBQUHFfwATBir15n36gmdF/7vPbtcXRLvaC669eV+FF1cTQgghhBBCXH+kx7uKBAUFsWXLFt577z06duyIw+Hg8OHDBAYG8thjj9G9e/cKlzV27Fjmzp1Lx44dyczM5OjRo9SqVYt//OMfDB48GNCHnMfGxvJ///d/1K1bl+PHj1NQUEDfvn3LLbNevXosW7aMgIAA4uLi6N+/v3MuOcAzzzzDsGHDyMzMxNfXl8cff5y3337bef2nn37ilVdeoWHDhhw5cgSTycRdd93F5s2bnXO2/4yRI0dy99134+/vz759+9i6dSt2u71yhfQfpfc2L3i9enu+g0L+9OchhBBCCCGE+OvSlLrMibXiutSgQQNOnjzJhAkTmDhxYnVX5/JNHASDn9UD4F9m6MHv7f8HbmZQCjYugYNboPs90Pwm/Z6sM/DzZ+DuAXeMBi99cTl2rYIdy6BD39J53QW5et7iIj1vQLCefmibvn94ZGd9VXVNg7TjsHQ6jHoPQq/+fHohhBBCCCFE9ZLAW7i4bgLvyfeCtfjS+a4WNzM8+XFpgC6EEEIIIYT425BVzcX16clPoCCnumtRystPgm4hhBBCCCH+pqTHWwghhBBCCCGEqELS4y2uT1lnrp0eb+ntFkIIIYQQ4m9NerzF9WnyfWC1VHctdCZ3GPNJpYLv4nwHc+5OJztZX8m917hA2t578X2y93ybx4Gl+Zw+ZMVaqP9aj/wphBqN3MrNf+pgMV8+cAq7lTJ5j8UUEvN+Ftkpdmo2NnHrq4GEtjY77135RgbJOy08vCgEo5tW4XYJIYQQQgjxdyQ93tVkxIgRzJ07lx49erBu3brqrs4VMWfOHB555BEAqv19jtWir2oeGAKrF0DiQeg9HMJa6NdPJ16d1c5tFljyod77XonAe9XkTGfQXVEJG4s4fciKZ6ABa+HF77UWOfj5pXPOoPt8RTkOlr5wjjpt3Ll/Xi0WDjvNj8+dY/TqUACSd1n4/bt87p9bS4JuIYQQQgghKuAvs493gwYN0DTtoseVWIV7zpw5zvKutpI2RkdHX/VnXwnBwcFERUURFRVV3VXRBYZA/Wb6XtoRHWDlPMjL1Lf0atsThk+CrNOwaj7UCIW6TWDoC9ChD6xfBOkJet4WnWDkW2C3wbJZ4Bukp9/xuL7P945lcGSXntakLTz6tj68/LcvwGS+ZDX/6NCyAvb/VECzPp6Vuq/Xa4E8HVuXLk/4XzLv2qlZZCTYyn1G5km9xzz0Bnc8/Y3UjnQn75Sdgkw7dqtixcQM2t7jQ2ibyrdNCCGEEEKIv6O/TOB94403OoO6unXrOtPbtm3rTK9Xr1411vD6Zbfbsdsv3fvav39/YmNjiY2NvQq1qoDVC8BmBZMbDPknNG0P37wD8Tv06/Ui4KEJeu/3gtfBUggGA9w+Cjr0hp+m673aADXrwvDXwVIAc8dDbqae3mOoHnyvXQgxi/Q030A9r9lL7/2uhJw0Gytez6B2Cze6PX3pAPp8PrWMGIyXfmF0dF0he7/Np90DPjTqVjbwDgxzw81TIzWumMJsO6cOFuNT24hXoJEt/8nBWqjo9mzl6iaEEEIIIcTf2V8m8P7++++dQd1jjz1WJn358uXExcURHh6Ou7s79erV4/nnn6egoACAxMREAgIC0DSNSZMmAZCSkuJMGz9+PCNGjHAOlQbK9KSXfD9nzhxnnujoaDRNY8SIEc604cOH07RpU3x9fXF3dyc8PJynn36anJzKL/ZV0gv+r3/9izFjxlCjRg1q1arFM888g81mA6Bv375omsagQYOc9ymlCAsLQ9M0Xn75ZQAsFgsTJkygadOmuLu7U6tWLUaOHMnZs2ed902cOBFN02jQoAHz5s2jcePGuLu7k5SUxOHDh7nzzjupVasWZrOZevXq0a9fP7Zt2wZceLTA7Nmzad++PZ6ennh7e9OlSxd+/PFH5/UTJ064fLZ33HEHXl5eNGzYkFmzZlX6M3NKPAiLp1Vv8O3uUeHqKofi17EZOKxwx9QaGExXftRF3lk7y8dnULOpGz3+GVBuHg9/AwPeq0H+aTuf35qGm6fGXe/X4OwxK9tm5dBrXCC7v8rj89tSmR6dwtp3M3HYZKkIIYQQQgghLuQvE3hfTHFxMdHR0fz73//m9OnTREZGcu7cOT744AMGDBjgDEI//fRTAN566y3279/P448/TnZ2NjfddBPjx4+ncePGNGrUyFnu5fak//jjj2RmZtK4cWPq169PYmIiH3/8MY8++uhlt/GDDz7gq6++wtPTkzNnzvDvf/+b2bNnA3qgD/Dbb785g/stW7aQlJQE4HwpMHjwYF5//XUSEhKIjIzEYrEwe/ZsevToQWFhocvzUlNTGTFiBCaTidq1awNw//33s3TpUmw2Gy1btsThcLBs2TIOHDhwwXq/+eabjBw5kl27dlGrVi38/PzYvHkzAwcOZMGCBWXyjxo1iv379+Pm5saJEycYNWoUhw4duuhnY7FYyMnJcTksNrs+p/vIzuoNvu8YfdG6n2/ngjySdli45eUAghqUvyDan7VyUgbF+Yo7pgZhMl84sG/cw5ORP9XhuR31ePjbEOrc4M7yCRlE9PICDdZ/kE3jaE/aPeDLjrl5/P5dfpXUVwghhBBCiOvBdRF4f/XVV+zZswd3d3d+//139u7d6xzuvGbNGtasWQPAgw8+yL333ktxcTG33norv/zyC97e3ixYsACTycS4ceMYN26cs9zyetgrIiYmhrNnz7Jnzx6OHTvGq6++CsAPP/xAUVHRZbWxXr16HD9+nKNHjxIaqi9ytXr1agAGDhyIn58fFouFH374AYBvvvkGgJtuuonmzZsTExPDr7/+6vxM9u7dy6FDh/D09OTAgQMsXLjQ5XlWq5Xp06dz+PBhUlJSCAsL48iRIwAsXbqUXbt2kZqayvHjxy84Jz0/P5+33noLgEGDBpGQkMCJEye46SZ90bLXXnutzD133XUXx48fZ8OGDQA4HI5LLj43ZcoU/P39XY4pG+P1hdTu/Vf1Bt8lC7FVwOnDxQCseTuLDzsmM3tguvPamncy+fLBUxUu68LPsGK3Kr584DQfdkxmxesZzmvz7z1FzPtZ5d63++s8Mk/YuOXlAE7G6v+G297jTbsH9ZXWT2y5vH/XQgghhBBC/B1cF4F3yVDn4uJiIiIi0DSNtm3bOq+fP+f4s88+IzQ0lFOn9CDmvffeo2nTple0PqtWraJVq1Z4enqiaRqTJ08GwGazcebMmcsq884778Tf3x8PDw8aNmwI4GyDp6cn99xzDwBff/01DoeDRYv0wK+kt7vkMwLo0aMHmqYRGhrq7On+47xsT09PRo0aBehD7A0GAwMGDACgZ8+eREZGcvfdd7Ns2TLq1KlTbp3379/vLP++++7DYDBgNpu5++67ATh58mSZz+PBBx9E0zRatGjhTCtp54WMHTuW7Oxsl2Ns1wj9YkSH6g++K8laqJxHCXsxWIv073NP2Zg1II1ZA9KIX1VQ6fKVo/QZ9mLX59qLyw4Zz023seGjbKJfDMAryEjJgvVGN61KhsMLIYQQQghxvbmuthNzd3fnxhtvLJMeGBjoPM/IyHCZa3306NFKP+f8hcays7Ndrn355Ze88MILANSpU4f69etz9uxZjh8/XubeyggICHCem0z6j+38LbuGDx/OzJkzWbVqFT/++CNpaWmYzWbuu+++MmWVt+p4SEiIy/fBwcEYDK7vZebNm8edd97JunXrOHDgAL/++itLlixh3759zmH8f1ZJO0vaCJfemsxsNmM2/2GFbZNRD6BDG5cG39+8owffQ/5ZGnwvnqan3/svPV9J8D1/kh58DxsPZk89+AY9+AZ9+7CS4HvueP0Y/ro+vLzHUD3P2oWQm0FF3T65BrdPruH8PjvFxow+aYDrPt4OG2Qk6PP7i/NKP5uY97OIX1lIcb7Dmbb48TMYTBrtHvSh/TBfHl8R6vLMfT/k89treh0vtOf3yjczCW1jptVd3gA06OTBznl5HN9YhG+IEYDwTrLCuRBCCCGEEBdyXfR4d+zYEdCD2unTpzuHiK9bt44XX3yRBx54wHn9oYceIi8vjzZt2qBpGh988AExMTHOsry8vJzn+fmu81Zr1aoFQHx8PACHDh0iLi7OJU9Jz7Gvry8JCQls3bqV3r17X+EWl9W1a1caN26M1WrliSeeAPRe8pKXDiWfEeg9xCWf0caNG5k4cWKZ+eflbae2YcMGBg0axOeff8769euZMGECAOvXry+3Ti1btsTTU181+5tvvsHhcGCxWFiyZAkA4eHhBAdXfG/rSvtlBiTrP6tq6/nesazq2vcH+efsZCXZKMgoDbxz0vS0omzHRe68sEPLCkjcaqHX+NKXV426e9L1KX+2zcph5aRM2j3oQ5shPn+6/kIIIYQQQlyvrovA+/7776d169bY7XY6duxIq1ataNasGQEBAQwZMoSsrCxAnwu8ZcsWAgMD+e2333j88cdxOBwMHz7c2QvevHlzZ7ktWrSgU6dObNq0CYBbb70VgGnTptGzZ086d+5cpje2devWAOTm5tKoUSMaNWrEt99+W9UfAQAPP/wwAOnp+tzgkkXXQF99vU+fPoA+J7x58+a0bNmSgIAA+vXrx4kTJy5Z/kMPPURgYCDNmjXjxhtvZPz48UBpm//I29ubV155BYAlS5bQsGFDGjRowNatWwF94bUqFRSi91xXZ/Ddoe9lV9+/rokX99XnxX31nb3df0xvNdDbmX775BrO9D8eXZ4sf/uvVgO9nXnK6+1u3teLZ7fXI6Ce6+CYzo/78cS6uozZWJdbxwbKkHMhhBBCCCEu4roIvM1mMzExMTz99NPUr1+f+Ph4MjMz6dChA5MnT6Z27drs3LmT119/HYCPPvqIOnXq8O6779KwYUNOnjzJmDFjAD2IHDduHLVr1yYxMZGtW7eSmakHUu+//z79+/fH09OTY8eO8corr9C1a1eXujz66KM8//zz1KxZk9zcXKKjo53PrWoPP/yws6c6JCSEvn1dg74ffviB8ePH07RpU44fP056ejqRkZG89tprtGrV6pLlP/LII7Rs2ZKzZ89y4MABQkJCGDVqFJ988skF73nttdeYNWsW7dq14/Tp02RnZ9O5c2d++OEHhg0b9ucafCm3/x/UCqve4LvdbVXbRiGEEEIIIcQ1T1OXmkArxF/RxEEw+FnwD4Zfv4CMdOg/Sg/EARIPwIq5EBYJtw4DownsNli9QN//u/dwfVV00APvX2boPei3/x+4mUEp2LgEDm6B7vdAc32ldrLOwM+f6ft33zEaCrJhyYcw6j19vrkQQgghhBDib0cCb3F9evNesBVfOt/V4GaGJz+GgCqczy6EEEIIIYS4ZkngLa5PWWegIOfS+a4GLz8JuoUQQgghhPgbk8BbCCGEEEIIIYSoQtfVPt5COF3NHm/p0RZCCCGEEEJchPR4i+vT5PvAark6z6rEHO7fXssgeZeF/DN2ALxqGGjc3ZObn/TD0994wfty021s+U8OKXuKyT1lw2EF/7pGWt7lTfthvhjd9NXsbRbF8okZpO8rJuOEDRTUae3OsIW1XcqL+z6PLf/JoeCcg5BW7vSeEEhQg9LtxL4bfQaHHYbOkBcKQgghhBBC/FnS4y2uT1aLvqp5zXpwaBus/xYiO0PXwaBp+vUrsdr5z5/rZRXkVCjwPrq2ELOvRlBDE4WZDrKT7examEfGSRtD/3Ph+zMTbexdlI+bl0ZgmImsZBtnj9qImZZNdrKNXuOCAD3wPrC0AJ/aRsw+Gpbcsu/Vzh23snxCJi3v9KLb0wHMHpTOb69l8OACPTg/8Es+STssjPg+pHKfuRBCCCGEEKJc18U+3qJyoqOj0TSNBg0a/Omydu3axbBhwwgLC8NsNlO7dm2io6OZOXPmn6/on2Uy61t43XI/3PkkHIyF3asgpCGEt4CRb0FIA/htJjjset5OA+C+lyHpEGz6Xg/I6zeDYeP1PcBXzoO8TD1v255wxz8qVaXRa0IZtSyUh78N4fGVodRt5w5Ayu6L9857+BvoMzGQMRvrMnxxCI8vD8W/nt5DfuCXAmc+d2+N0WtDGb06lFrN3Mst6+wRK8oBoW3N+NQyEtTAxJnDVgAKs+ysfSeLLmP8Cagn7+WEEEIIIYS4EiTwFpdt5syZ3HTTTXz55ZckJydTr149fHx82LBhA2+++WZ1V0/fT/tsin7e7ja48wnYsQJ+nQEOB5g99YC6VhjMnwTJ8XreiA5w77/gyE5YPA1sVjC5wZB/QtP28M07EL9Dz1vSU15BJrPGxo+zWXD/Kf7TO5WUXfqWZ/XamS96X61m7rQe4oPJXR9S7uFvoGYTfWh4SRqAwajhE3zhIesANZu6oRkgdY+FvNN2Mk7YCG6ml7V2ahZ+oSbaD/OpVLuEEEIIIYQQFyaBtyijsLCQV199lSZNmuDu7k5QUBADBw4kLi7OmefQoUP84x//wG63Ex4ezu7duzl27BjHjh3j7NmzvPLKK868GRkZPPnkk9SvXx83Nzdq167NsGHDSExMdOaZOHGisxd+0aJFNG/eHG9vb7p3787hw4cvryHuHjB3fNUH35WUedJKWlwxOan6PO/wTmbunFajUmVkJFhJ3Kr3kre+27tS99Zo5EafSYEk7bAws38awU3d6PdGECe2FHHw1wJ6jQsk5v0spken8PltqWyddY1syyaEEEIIIcRflATeoow777yTt956i+PHj9O4cWOsVis//vgjN998M4cOHQJg1qxZ2O164PjBBx/Qpk0b5/2BgYGMGjUKgKKiInr06MH06dNJT08nIiKCnJwcvvzySzp37syZM2dcnp2SksKDDz6IpmkUFhayYcMGRo4cedH6WiwWcnJyXA6LzQ53jAazV9UG34kHKv35DnivJs/vrsfDi2tTs6kbJ2MtrHwzs8L3p8VZ+GrEaayFiqa3edLlSf9K1+GGQT6MWhbKs9vrcd+cWviGGFkxKYOOI3xJiytmx9w82j3gS5Oenqz/IJuEjYWVfoYQQgghhBBCJ4G3cLF27VpWrVoFwPvvv8/Bgwc5ePAgPj4+5OXlMWXKFAAOHCgNOLt3737B8r766iv27dsHwKJFi9i/fz+bNm3CYDCQmprKJ5984pLfZrPx3XffcfDgQZ599lkANm/eTGHhhQO/KVOm4O/v73JM2RgPXr4w/PWqDb5XzK3U51vC6KZRu7m7s7f6wNICMk5YL3nfkTWFfDPyDAXnHLQe6s2d02pgMGmXvO9SNn6Sg8GkcfNof07GFgHQ7kEf2gzV63diS9GffoYQQgghhBB/VxJ4Cxfbt293nj/wwAMA1KtXj27dugGwY4c+vPr8Xeg07cKBX0l5Xl5eDBw4EIB27drRrFkzl/JK+Pv7M2DAAABatGjhTD99+vQFnzF27Fiys7NdjrFdI/SLvoFVG3yHRV6wXn+UFmchcVtpAGu3KmeQC2AtVM58swakMWtAGmlxpYuu7Zyfy4/PnsVapOj+vD99JgRhMP75oPvUgWJ2fZlLnwmBmMwa/O9Ha3DTrkhQL4QQQgghxN+dBN7isrRs2dJ5vmHDhitWbkBAgPPcZCpdVfti282bzWb8/PxcDrPJCLv0nvsqDb5vHVbhtp07ZuObkWf4+OYU5tydzvToVI6t0wPvWs3dqPW/Bc6shYqMBBsZCTZnMJ6yx8Kad7JQDnD30jiyqpAFD5xyHnn/2xcc4It+aXzRL420OH3httOHip1puadsLnVy2BTLJmTQaqA39Tt6ABDeWV/o7fj6Qo6v10cahEd5VLidQgghhBBCCFcSeP+NKaUoKipyOdq3b++8vnDhQgCSk5OdwXWHDh0AGDlyJEajvnr2c88957LwWkZGhnMIeceOHQEoKCjghx9+APQtyEoWTCspr0rsWAYxi/Tzqgq+jRXfcqtmEzcadvXAaIZzx6zYihQ1GpnoOMKXe2fVQjNcuHfZXlz64qE4X5H2e7HLcf71rCQbWUk2bBb1v3tL0xyucTfb5+WSf9ZO9D8DnGlthvjQ7kEfVkzMZPucXLo+5U+j7p4VbqcQQgghhBDClaYu1pUorkvR0dHExMSUe+2DDz7gl19+YdWqVWiaRvPmzUlOTiY3NxcfHx+2b99O8+bNAX07sZKVzQ0GAw0bNkTTNE6cOEHdunU5ceIERUVFdOzYkX379mEymYiIiOD48eMUFRURGhrKnj17CA4OZuLEiUyaNInw8HBOnDgBwJw5c3jkkUcASEhIqNy+4xMHQYe+evDd8wHoMVRPz83UA29LgR6I16yrp+9aBT9Nhw694fZRYDCApRAWvA6nE+GhCVDvf8PX43foC6s1bQ9dBsGsl2HUe/re3kIIIYQQQgjxB9LjLcr46aefeOWVV2jYsCFHjhzBZDJx1113sXnzZmfQDfDYY4+xdetWHnjgAUJDQ0lMTCQzM5OoqCheffVVADw8PIiJieGJJ54gJCSE+Ph4fH19efDBB9myZQvBwcFV15B2t+lB99qFVdfzvXpB1dVfCCGEEEIIcV2QHm9xfZo4CAY/CzXr6b3ZO5bpPeDtbtOvF+TCz59BcZG+7VjA/14AHNoG67+FyM7QdTBoGlgt8OsXkJEO/UfpgTjoW4ktnwPKIT3eQgghhBBCiAuSwFtcn968R5+DfTWY3GDMp6XBuxBCCCGEEEKcRwJvcX3KOgMFOVfnWV5+EnQLIYQQQgghLkgCbyGEEEIIIYQQogpVfC8kIf5Kyuvxlp5pIYQQQgghRDWQHm9xfZp8n74o2vnczPDkxxUKvu1WRewXOez/KZ/cdDteNYw06+1J16f8cfe68GYAmz7NZvNnFx7iPmp5HfzrmjgWU0jM+1lkp9ip2djEra8GEtra7My38o0MkndaeHhRCEa3C+/vLYQQQgghhLj2yXZifwMjRoxA0zSio6Mvmm/58uW0bt0aDw8PNE1j4sSJTJw4EU3TKreH9rXAatFXNR/1HtzzL/Dw1tMqOO972bgMNk/PISfVTkB9EwXn7Oycn8eSJ86iHBd+V+Vb20id1u4uh4e//mtmdAcPPwNFOQ6WvnAO72Aj/1hdh+ICxY/PnXOWkbzLwu/f5dNnUpAE3UIIIYQQQlwHJPC+St599100TcNkMpGbm+tM79WrF5qm4ebmRn5+fpn0Hj16XJX6ORwO7rvvPuLi4vD19SUqKop69epdlWdXmbOp+hZfLTrBnWMqfNupA8Uc+LkAgFteDuDRpXW468OaACTtsHBkdeEF7209xIdhC2s7j3v/G4zBqF9reac3Zl8DmSetWAsVoTe44+lvpHakO3mn7BRk2rFbFSsmZtD2Hh9C25gv+BwhhBBCCCHEX4cE3ldJt27dALDb7WzevBkAm83Gli1bnOexsbFl0rt3737Zz7Tb7djt9grlTU1NJSsrC4AFCxYQGxvLY489dtnPvias/1bfwxsqNbf7+IYi53lELy8AGnf3wGTWe58TNhWVe1959v9YQEGGAzToONwXgMAwN9w8NVLjiinMtnPqYDE+tY14BRrZ8p8crIWKbs/6V/gZQgghhBBCiGubBN5XSfv27fH09ARgw4YNAOzevZv8/Hxq1arlkr5r1y5n73e3bt3IyMjgySefpH79+ri5uVG7dm2GDRtGYmKis/zzh4TPmzePxo0b4+7uTlJSUpm6nD59msjISDRN46abbuLDDz+kfv36zut9+/ZF0zTmzJlTblvsdjvTpk2jRYsWmM1m/P396dWrl7P+AF27dkXTNJ588knnMzVNQ9M0Dhw4AMD48ePRNI3mzZsDkJeXx+jRo6lfvz5ms5ng4GC6dOnC3LlzK/+BA0R2hp+mlwbfFZSbbnOeewXpvyKaQcMzQD/PSavYywzlUOyYp49uaBLtQVBDNwA8/A0MeK8G+aftfH5rGm6eGne9X4Ozx6xsm5VDr3GB7P4qj89vS2V6dApr383EYZOlGIQQQgghhPirksD7KnFzc6NTp05AaYBd8vX55593+X79+vUAGI1GOnfuTI8ePZg+fTrp6elERESQk5PDl19+SefOnTlz5ozLc1JTUxkxYgQmk4natWuXqUdmZia9e/fm0KFDREVFsXLlSpo2bUrbtm2deSIjI4mKiiI4uPxe4scff5wXXniBgwcPEhYWhslkYtWqVdxyyy3ExMQAOOeTb9q0yeUrwMaNG13aW5J3/PjxfP7555w5c4aWLVvi6+vL1q1bWbt27UU/W4vFQk5Ojsthsdmh62Do0FsPvg9tu2gZFVHZ0PfImkIyT+pBfMdH/FyuNe7hycif6vDcjno8/G0IdW5wZ/mEDL2HXYP1H2TTONqTdg/4smNuHr9/l1/eI4QQQgghhBB/ARJ4X0Ulw823bdtGcXGxM/AcOnQoTZs2JTY2FqvV6gy827Zty+LFi9m3bx8AixYtYv/+/WzatAmDwUBqaiqffPKJyzOsVivTp0/n8OHDpKSkEBYW5ryWl5dHv3792Lt3L506dWLFihX4+/vTv39/vv/+e2e+6dOnExsbS//+/cu04dixY/z3v/8F4JlnnuHIkSMcP36c8PBwbDYb48ePB0qD6bi4OHJycti4cSNGoxEvLy82btyI1Wpl27ZtLnmPHDkCwLhx49i1axfHjx/n9OnTPPfccxf9XKdMmYK/v7/LMWVjPGga3D5KD77Xf3uJn04p35DSXfYKMhyA3ntdlKWf+9UxVqic7XP03u46bdyp1+7i87V3f51H5gkbt7wcwMlYfSh723u8afegDwAntlR8eLsQQgghhBDi2iKB91VUMl+7qKiIbdu2sXHjRkJDQ2nUqBHdu3enoKCAHTt2OHuHu3Xrxvbt2wHw8vJi4MCBALRr145mzZoBsGPHDpdneHp6MmrUKAA0TcNgKP0R79y5k61btxIeHs7y5cvx83Ptha2InTt3UrID3QMPPACAv78/t99+u0t9br75Ztzd3XE4HGzZsoVNmzbRtm1bOnXqxMaNG9m1axcFBfoCZiWB94ABAwA98A4PD6dPnz58/PHH5fbcn2/s2LFkZ2e7HGO7RuirmBsMevAd2bnCbWzY1cN5Hr9Sr+Ox9UXYLHq7G3bRr6fFWZg1II1ZA9JIi3Pduixlt4XUPcUAdBzhe9Hn5abb2PBRNtEvBuAVZKRkgz+jm4bBJKuaCyGEEEII8VcngfdV1KlTJ0wmvTd1xowZnD171tkLXhKUf/7552RkZLikVUZwcLBLsH0+b29vAE6ePMn8+fMrXXZleHl50bFjRwBWrlzJrl276NKlC127diUhIYFvv9V7oJs1a0ZISAgAo0aNIiYmhueff57mzZuzc+dOJk6cyG233XbRZ5nNZvz8/FwOs8kIv34BlkI9+O46uMJ1D2npTuTt+qJqa97OYtaANH589iwA9dqbaXqrPlffWqjISLCRkWDDWug6EL2ktzsgzETE//JfyMo3MwltY6bVXfrPp0EnPbA/vrGIY+v1FdTDO8kK50IIIYQQQvxVSeB9FXl7e9OuXTsAvvrqK4AygffChQud+bt27eoMXgsKCvjhhx8AffG1w4cPA9ChQweXZ2jahXtIO3TowGuvvQbAU0895axDZbRv3975jJK6Zmdn8+uvv5apT0lP9syZM7Farc7AG/QXD+fnAX0IfsuWLXnvvfdYvnw5P//8MwD79+/n3LnSfa4rLCMdFryuB98X+VzK029yEJ3/4YdfHSNZSTa8goy0e9CHu6fXRDNcvKzMRCtH1+oBc4eHfS6a/9CyAhK3Wug1PtCZ1qi7J12f8mfbrBxWTsqk3YM+tBniU6n6CyGEEEIIIa4dpktnEVdSt27d2LZtGzabzfk9QIMGDahfv75zFfLmzZsTHBzM/fffz/vvv8++ffsYOnQoERERHD9+HIfDQWhoKGPGVHx/aoA33niDlJQUZs+ezfDhwwkICKBfv34Vvr9x48aMHDmSWbNm8dFHH/HLL7+QkZFBRkYGJpOJSZMmOfNGR0czefJksrOzAejSpQt+fn4YjUby8vKceUr8+9//5ptvvqFevXoEBQVx9OhRAOrWrUtQUFCl2glA/1Hw20w9+L7toUrdanTT6DrGn65jLrytV9hNHry4r36Z9MAwN174vWx6eZr39aJ5X68y6Z0f96Pz45WfCiCEEEIIIYS49kiP91VWEmgDBAQE0KpVK+f3PXr0KJPPw8ODmJgYnnjiCUJCQoiPj8fX15cHH3yQLVu2XHDl8YuZMWMG/fr1w2q1MmTIEJcVxyviP//5D++++y6RkZEkJiZitVq57bbbWLNmjUsgXTLPGyAsLIy6devi6+tL69atnXnOz9+/f3+6detGYWEhcXFxeHh4MGDAAH799deL9uRfUK0weGgCnE7Uh50LIYQQQgghRDXQVMlKWUJcTyYOgsHPQs16euD98+dgK4ZR70Fo4+qunRBCCCGEEOJvRIaai+uTyQ2WfFg2zUuGbwshhBBCCCGuLunxFtenrDNQkOOa5uUHAZUfmi+EEEIIIYQQf4YE3kIIIYQQQgghRBWSoebi+lTS4y293EIIIYQQQohqdlk93ocPH+att94iNjaWiIgIXn31VVasWMHgwYNdVukWotpMvg+sFnAzw5MfVyj4tlsVsV/ksP+nfHLT7XjVMNKst76ntrvXhTcAyE23seU/OaTsKSb3lA2HFfzrGml5lzfth/lidHNdkf33xXnsXZTHueP6lnL+dY10GO7LDYP0vbpjv8hh99d5FOc5CLvJTK8JQfjUNALgsCnm3XOKOq3d6TPxMrZYE0IIIYQQQlx1ld5ObO/evXTs2JEFCxZw5MgRzp07h4eHBxMnTuTzzz+vijqKcowYMQJN01y249I0DU3TmDNnzgXvW7dunTPfunXrqryeFxMdHV2mDVeM1QJdB+tf/zjX+wKWjctg8/QcclLtBNQ3UXDOzs75eSx54izKceH3U5mJNvYuyic7xYZ/qAnNCGeP2oiZls2atzNd8q56K5PlEzNJ32/FM8BAYLiJggwHKbuLATixuYgNH2VzwyBvhi2szbH1Rax7N8t5/7b/5lKQaSf6nwGV/kiEEEIIIYQQ1aPSgffLL79MXl4e7du3d6a1bduWoKAg1q5de0Urd71799130TQNk8lEbm6uM71Xr15omoabmxv5+fll0nv06EHjxo2JioqiRYsW1VH1v4ajuyuc9dSBYg78XADALS8H8OjSOtz1YU0AknZYOLK68IL3evgb6DMxkDEb6zJ8cQiPLw/Fv57eQ33glwJnvpQ9FnYvzEMzwF0f1uDxlaEMXxzCk+vrcsu/AgA4fUgPwOu1M1OjsRteQQbOHNbTMk9a2fKfHG57JRCzb6V/dYUQQgghhBDVpNL/e9+0aRN169Zly5YtLun169cnKSnpilXs76Bbt24A2O12Nm/eDIDNZnN+tjabjdjY2DLp3bt3Z9y4ccTGxjJ9+vRqqHnFFRcXV9/DT52scNbjG4qc5xG9vABo3N0Dk1kfJp6wqajc+wBqNXOn9RAfTO56Xg9/AzWbuAE40wAOL9eDcJ9aRvZ9n89HnZL5/LZUVr2VScmEj1rN3QFI3mXh3DErBRkOgpu5o5Ri+cRMGnb1cNZPCCGEEEII8ddQ6cDbbrfj4+OD0Wh0ST9z5gwOh+OKVezvoH379nh6egKwYcMGAHbv3k1+fj61atVySd+1a5ez97tbt27lDjUvz7fffkujRo3w9PTk9ttvJyUlpdx8O3bs4K677qJGjRqYzWYaNWrEtGnTACgsLGTgwIE0bNgQb29vzGYzTZs2Zfz48S6BdcnQ8YceeogXX3yRWrVq0axZMwAyMzO555578PLyIiws7ILTEubNm0fbtm3x9fXF19eXyMhIHnrooYp8nGV17FvhrLnpNue5V5D+a6EZNDwD9POcNHuFy8pIsJK41QJA67u9S9NP2P73LDsnt1rwCzGSe8rO7oV5/PKvcwA0uNmDbs/4E7cknwUPnKJRNw+iXwwg7rt8Th8qpvtz/vz22jk+6ZbCF/3S2PdjftkKCCGEEEIIIa4plQ68W7RoQXx8PG+++SYAOTk5vPDCC6SmpsrCapXk5uZGp06dgNIAu+Tr888/7/L9+vXrATAajXTu3LlC5e/Zs4f777+fhIQEzGYz8fHxPP7442Xybd68mS5duvDTTz+Rl5dH06ZNycnJcT7bYrHw448/UlhYSEREBLVq1eLo0aO88cYbvPrqq2XK+/bbb/noo4+oXbs2fn5+ADz22GMsWrSIwsJCvLy8eOGFF9ixY4fLfXv37mXEiBHs3buXkJAQGjRoQHJyMgsWLLhoOy0WCzk5OS6HxWaH2g0q9DldTGVXHkyLs/DViNNYCxVNb/Oky5P+zmsOW2lpQ2cE88gPdejypP75HIspIjtFD8w7/Z8fo9eE8szWegz+RF8Ubt37WfR4PoC4Jfns+6GAHs/7E9zMjWXjMjh71PrnGimEEEIIIYSoUpUOvJ955hmUUkyYMAFN0zh48CAffPABmqYxZsyYqqjjda1kuPm2bdsoLi52BrtDhw6ladOmxMbGYrVanYF3SW9wRUybNg2Hw4G/vz+HDx/m6NGjDB48uEy+1157jeLiYgICAoiLi2Pfvn2cPn2aSZMmAeDt7c3+/ftJT09n9+7dJCUlMWzYMAC+/vrrcp+9fft24uLi2LVrF8eOHWPJkiUA/Otf/+LQoUPs3LkTi8Xics/Ro0dRShEREcHhw4eJi4sjKyuLmJiYi7ZzypQp+Pv7uxxTNsZX6DMq4RtSurNeQYY+ckM5FEVZ+rlfHWO5953vyJpCvhl5hoJzDloP9ebOaTUwmEqHmvvWLi0jpJU+pLzODe7OtJLA+49WT84kOMKd1kO8ORlbhIe/gRsG+dBqoDfKAYlbLzwMXgghhBBCCFH9Kh14Dxs2jLfffhtPT0+UUiil8PDwYPLkyc5gTFRc9+7dASgqKmLbtm1s3LiR0NBQGjVqRPfu3SkoKGDHjh1s2rQJKA3UK2L//v0AdOnShdq1awN6QP9HW7duBWDIkCFEREQAYDAYaNOmjfN8wYIFREREYDab0TTN2QudmppapryePXs67zUajc56ANx9990ANGvWjNatW7vc16VLFwIDA4mPj6dGjRpERUXxxBNPXLKdY8eOJTs72+UY2zUCMk9d8t4SDbt6OM/jV+pzsY+tL8Jm0XupG3bRr6fFWZg1II1ZA9JIiyt9cbBzfi4/PnsWa5Gi+/P+9JkQhMHouo1YeKfSZ6TvK/7f1//1VmsQGGbij46sKeT4+iL6TAxE0zSUAqM+fRxj2exCCCGEEEKIa9Bl/df9pZde4qmnnnIGVC1btnTOVRaV06lTJ0wmEzabjRkzZnD27FnuvfdeQA/KZ82axeeff05GRoYz7Wp7++23mTJlCgDh4eGEhISQnJxMSkpKufP6S4L8ygoJCWH//v3Mnz+fnTt3EhcXx4wZM5g5cyabN28mKiqq3PvMZjNms9k10WSE2KUVf3ZLdyJv9+LgrwWseTuL3V/lkZWk90DXa2+m6a36v29roSIjweY8B3218jXvZAHg7q1xZFUhR1aVroI+8KOa+AQbadbHi53zc0nfb2Xx42fwr2dyDhO/YaC3S687gCXPwarJmXR+3I+gBnq0Hd7Jg+2zc0nfX8zx9UVoBqh/0x/aLoQQQgghhLimXPaeRBs2bCAmJoaYmBg2btx4Jev0t+Lt7U27du0A+Oqrr4DSXu2SIHvhwoXO/F27dq1w2S1btgT0lehPnz4NwOLFi8vkKwlov/vuO44ePQqAUorff/8dwLmyekREBCdOnGDTpk3OHu3yaJprT+/5W559//33AMTHxzvLL5GamsqZM2d46aWX+Oabbzhw4ADNmzfH4XBc3r8x36BKZe83OYjO//DDr46RrCQbXkFG2j3ow93Ta6IZtAveZy8unbtdnK9I+73Y5Si5bnTTGDqjFm2GeuPurZGVaKNmEzdu+VcAvScGlil3/QdZePobuGlk6dSCm//hR2R/L7597DRH1xXSZ2IgwU3dy9wrhBBCCCGEuHZUusc7OTmZQYMGsWvXLpf0G2+8ke+//5769etfscr9XXTr1o1t27Zhs9mc3wM0aNDAZZu25s2bExwcXOFyn3/+eb788kuys7OJiIggODi43C3f3nzzTXr27ElmZiYtW7YkIiKC9PR0unTpwg8//EDr1q35+eefiY+Pp2HDhlitVgoLL7yv9R81adKEgQMH8sMPPzBlyhS+//57kpKSMBqNzjYDHDhwgF69ehEcHExoaCg5OTkkJCQAcMMNN1T4eU6dBsBvX1Q4u9FNo+sYf7qO8b9gnrCbPHhxX/1Lpl2Ih7+B3hOC6D3h0nl7jSv74sDd28Ad79So0LOEEEIIIYQQ14ZK93iPGjWKnTt3Oud3lxy7d+8ud8VscWnnz9sOCAhwWR2+R48e5eariBtvvJGFCxfSoEEDioqKCA8P57PPPiuT7+abb2bTpk0MGDAAHx8fDh8+jI+Pj7N3/ZVXXmH48OEEBASQk5PDfffdV6G51+ebNWsWd999Nx4eHmRnZ/P66687V3Qv0ahRI+677z78/PyIj4/nzJkztGnThhkzZtC7d+9KPQ8Ak1vl7xFCCCGEEEKIK0xTSlVqxyRPT09sNhuffvop999/P6CvbD169Gjc3d0pKCiokooKUSkTB0G3IbBhMYx6D0IbV3eNhBBCCCGEEH9TlQ68w8PD8fX1Zd++fS7prVq1oqCggOPHj1/RCgpxWd68B2xWcDPDkx9DQMWH6AshhBBCCCHElVTpOd7/+te/eOmllzh06BDNmzcH4NChQyQkJPDxxx9f8QoKcVnGfAoFOeDlJ0G3EEIIIYQQolpVuse7Z8+exMbG4nA4nAtexcXFYTab6dChQ2nBmsbq1auvbG2FEEIIIYQQQoi/mEoH3gZDxdZj0zQNu91+WZUS4k/LOiM93kIIIYQQQohrQqWHmj/88MNl9mkW4prz6VNgtVRqjrfdqoj9Iof9P+WTm27Hq4aRZr096fqUP+5eF37hlJtuY8t/ckjZU0zuKRsOK/jXNdLyLm/aD/PF6Kb/vmSn2JjRJ63cMvpMDKT1EB8A4r7PY8t/cig45yCklTu9JwQS1KB0hfbvRp/BYYehM+SFghBCCCGEEH8FlQ6858yZUwXVEH91c+bM4ZFHHgGgkoMoqobVAh36wo5les93BQLvZeMyOPBzAZoBAsNNZCXZ2Dk/j9MHrdz732A0Q/kvnDITbexdlI+bl0ZgmImsZBtnj9qImZZNdrKt3P2467R2d/neq4YRgHPHrSyfkEnLO73o9nQAswel89trGTy4oDYAB37JJ2mHhRHfh1T2ExFCCCGEEEJUk0rv492iRQumTp1KSkpKVdRHVJOMjAxefvllIiMj8fT0xMfHh7Zt2zJ58mSXLeKio6PRNI0RI0ZUX2UrKm59hbOeOlDMgZ/1dt7ycgCPLq3DXR/WBCBph4UjqwsveK+Hv4E+EwMZs7EuwxeH8PjyUPzr6YH0gV/K315v2MLaLkeTnp4AnD1iRTkgtK0Zn1pGghqYOHPYCkBhlp2172TRZYw/AfUq/c5MCCGEEEIIUU0qHXgfOnSIsWPH0qBBA/r06cPChQspLLxwUCKufcnJydx444288847HDp0iJCQEPz9/dm7dy+vvfYaXbp0ITc3t7qrSXFxceVucDNXOOvxDUXO84heXgA07u6Byaz3cidsKir3PoBazdxpPcQHk7ue18PfQM0m+tDwkrQ/+qRbCh92TGbukHT2LspDOfRRAjWbuqEZIHWPhbzTdjJO2Ahuppe1dmoWfqEm2g/zqXC7hBBCCCGEENWv0oH3c889R3h4OHa7nZUrV/LQQw8REhLCo48+yrp166qgiqKqPfHEEyQmJgLw1VdfkZCQQEpKClOmTAFgz549vPrqq2iaRkxMDABz585F0zQ0TePEiRMu5W3evJmOHTvi5eVFu3btiI2Ndbm+detWbr/9dgICAvDw8KBdu3YsXrzYJU9J2VOnTmXw4MH4+PgwatSoyjWs810VzpqbbnOeewXpvxaaQcMzQD/PSav4QoEZCVYSt1oAaH23d5nrXkEGfIL1HvHTh6ysmJTJ+g+zAajRyI0+kwJJ2mFhZv80gpu60e+NIE5sKeLgrwX0GhdIzPtZTI9O4fPbUtk6K6fC9RJCCCGEEEJUj0qval5i165dLF68mCVLlhAfH+9ccK1Bgwb885//5IknnriiFRVVIzMzk5o1a+JwOIiOjmbt2rXOaw6HgyZNmpCQkEBQUBBNmjTh4MGD5ObmUrNmTRo3bgzA999/z/Lly51zvL28vKhfvz7Hjh3DZrMRHh7O0aNHMZlMbNq0iZ49e2K1Wp0964cPHwb0YP7hhx8GcP57cnd3x8PDg7CwMDp37syMGTPKtMFisWCxWFzSzFOHYR70FCydDqPeg9DGF/0cVkzKYO+ifAD+ubceBqP+/M9uTSXvlJ0GXTwY+p9LzxNPi7OwZMxZCs45aHqbJ3e+VwODSS+ruMBBdrKN4Ah9fndhtp2vHj7NuWM2TB4aT2+p61yI7XzWQgezB6XTvK8XviEmVr2ZSben/ck7Y2f3V3kM+bwmDbt6XrJuQgghhBBCiOpR6R7vEu3atePhhx/mzjvvxNtb79VTSpGQkMBTTz3Fc889d8UqKarOkSNHcDgcALRt29blmsFgoHXr1oA+B3zp0qW0a9cOgP79+xMbG0tsbCx16tRxue/tt9/m0KFDTJs2DYCTJ09y9OhRAF577TWsViu9evUiKSmJQ4cO8eyzzwLw6quvlqlfo0aNOHHiBHFxcXz22WfltmHKlCn4+/u7HFM2xlfqc/ANKZ0zXZChfx7KoSjK0s/96hgvWcaRNYV8M/IMBecctB7qzZ3TSoNuAHcvgzPoBvD0NzoDZluRojDTUW65Gz/JwWDSuHm0Pydj9SHv7R70oc1Q/ffuxJYLD4MXQgghhBBCVL9KB955eXnMnDmTm2++mZYtWzJt2jTy8/MJCQlh3LhxfPXVVwQGBjJv3ryqqK+oQuVtE1fRfdvP99BDDwH6QnwlTp06BcC2bdsAWLlyJW5ubmiaxocffgjoc83/uGjf8OHDCQwMBMBoLD/4HTt2LNnZ2S7H2K4REL+jwnVu2NXDeR6/Ul8Q7dj6ImwWfUBIwy769bQ4C7MGpDFrQBppcaW97Dvn5/Ljs2exFim6P+9PnwlBzl7zEkfWFLrMFS/KcXBik74+gpunhmdg2c/61IFidn2ZS58Jgfp88/+NTzG4aS5BvRBCCCGEEOLaVemlkevUqUNBQYFzy6iePXsyevRoBg4ciMmkF/f999+zaNGiK1tTUSWaNGmCwWDA4XCwe/dul2sOh4O9e/cCEBQURHBwxfaNDggIAHD+e4CyW4zVrVuXevXqlbnXZrO5fF+7du1LPs9sNmM2/2EhNZMRDm+rUH0BQlq6E3m7Fwd/LWDN21ns/iqPrCS9LvXam2l6q94zbS1UZCTYnOcAKXssrHknCwB3b40jqwo5sqp0wcGBH9XEJ9jI6YPFbP4sB7Ovhl8dfdsxa4Fexk0jfcsMM3fYFMsmZNBqoDf1O+qBf3hnM0dWF3J8fSHZyXo9wqM8EEIIIYQQQly7Khx433LLLbRo0YL8/Hz8/f0ZPnw4o0ePplmzZmXyjhkzhn79+l3RioqqERQURP/+/Vm6dCnr1q3j66+/5r777gNg6tSpHD9+HIAHH3wQTdPw8tJX/M7Pz7+s53Xs2JGYmBjCw8NZtWoVnp56QJucnMzOnTsJDw93yV9eL3yFNbupUsF3v8lBBISZOLA0n6wkG15BRiJ6edLtaf8L7uENYC8ufalQnK9I+7243OuNoz3JTrWRsruYrCQbJrNGcIQb7Yf50ryvV5lyt8/LJf+sneh/BjjT2gzxISPBxoqJmRhM0PUpfxp1l/ndQgghhBBCXMsqvLiawWCgU6dOPPbYY9x///3OgEn89SUlJdG1a1fnyuYNGjSguLiY1NRUQJ/7HRMTg5+fH88//zwffPABBoOBNm3aUKtWLZYtW8acOXOci6uV/JNat24dPXv2BGDt2rVER0ezfv16br31Vmw2G/7+/jRs2JAzZ86QmppK9+7dnSvjlwTcs2fPvrw9wycOggFPVHhxNSGEEEIIIYSoKpWewDty5EgJuq8z9evXZ9euXbz00ks0a9aMtLQ0MjMzad26NW+++SabNm3Cz88PgBdeeIHbbrsNLy8vdu/ezY4dFZ9HDdC9e3fWr19Pv3790DSNAwcO4Obmxt13380LL7xQFc0TQgghhBBCiGpVqR7vxo0bM27cuIvmK9kOSohqNXEQdLoDYn+WHm8hhBBCCCFEtapU4H2p+baappVZHEuIavHmPWCzgpsZnvwYAiq2MJwQQgghhBBCXGmVDrwvll3TNOx2+xWrnBCXLesMFOSAl58E3UIIIYQQQohqVantxCIjI/n000+rqi5CXDkBwRJwCyGEEEIIIa4JlQq8/fz86NGjR1XVRYgrJ+uM/lWCbyGEEEIIIUQ1q1TgLcRfxidjQNMqPL/bblXEfpHD/p/yyU2341XDSLPennR9yh93rwsv/p+bbmPLf3JI2VNM7ikbDiv41zXS8i5v2g/zxeimr4tgsyiWT8wgfV8xGSdsoKBOa3eGLaztUl7c93ls+U8OBecchLRyp/eEQIIauDmvfzf6DA47DJ0hLxSEEEIIIYT4q6jwdmLDhw/n9ttvr8q6CHHl2IrBatHneVfAsnEZbJ6eQ06qnYD6JgrO2dk5P48lT5xFOS68rkFmoo29i/LJTrHhH2pCM8LZozZipmWz5u3M0upYFAeWFlBcoDD7lL9I4bnjVpZPyKR+BzOP/VKHM/FWfnstw3n9wC/5JO2w0Gt8YAU/BCGEEEIIIcS1oMKB9+zZs3nttdeqsi7iKouOjkbTNOdhNBqpW7cuAwYMYPPmzdVdvavm1IFiDvxcAMAtLwfw6NI63PVhTQCSdlg4srrwgvd6+BvoMzGQMRvrMnxxCI8vD8W/nhGAA78UOPO5e2uMXhvK6NWh1GrmXm5ZZ49YUQ4IbWvGp5aRoAYmzhy2AlCYZWftO1l0GeNPQD0ZqCKEEEIIIcRfSYUDb3H9cnd3JyoqitatW3P69Gl+/vlnevTowbZt26q7alfF8Q1FzvOIXl4ANO7ugcms90wnbCoq9z6AWs3caT3EB5O7ntfD30DNJvrQ8JI0AINRwyfYeNF61GzqhmaA1D0W8k7byThhI7iZXtbaqVn4hZpoP8znMloohBBCCCGEqE4SeAvq1KlDbGwsu3fv5ocffgDAZrOxcOFCAH766Se6du2Kj48PHh4e3HjjjcyaNculjHnz5tG2bVt8fX3x9fUlMjKShx56yCXPggUL6NixI15eXvj6+tK3b1/27NnjvG632xk7diyNGjXCw8ODoKAgOnTowLvvvlul7c9NL9173itI/5XQDBqeAfp5TlrFt8jLSLCSuNUCQOu7vStVjxqN3OgzKZCkHRZm9k8juKkb/d4I4sSWIg7+WkCvcYHEvJ/F9OgUPr8tla2zKjaMXgghhBBCCFG9ZMyquKgFCxY4A+jatWvj4eHBnj17eOyxx0hPT+fVV19l7969jBgxAqUUTZo0wcPDgxMnTnDo0CHmz58PwNSpU/nXv/4FQEREBHl5eSxfvpyNGzeyfft251Z1b7/9NkajkZYtW1JQUEBcXBw+Pj68+OKLF6yjxWLBYrG4pJltdsymi/cwX0qFNrg/T1qchSVjzmItVDS9zZMuT/pX+pk3DPLhhkGlvdrWQgeLR5+h4whf0uKK2TE3j25P+5N3xs76D7Kp1cyNhl09K/0cIYQQQgghxNUjPd6CtLQ0OnXqxI033sjAgQMBMJlM3H///bz66qsAREVFcfLkSRISEhg0aBAAkydPpqCggKNHj6KUIiIigsOHDxMXF0dWVhYxMTEAFBQUMGnSJAAmTZrE4cOHOXnyJB06dCA/P5+33noLgCNHjgDwyCOPsHfvXo4cOcK5c+cu2eM9ZcoU/P39XY4pG+Mr3H7fkNL3TwUZDgCUQ1GUpZ/71bl0AH9kTSHfjDxDwTkHrYd6c+e0GhhM5S+iVhkbP8nBYNK4ebQ/J2P1Ie/tHvShzVC9N/3ElgsPgxdCCCGEEEJcGyodeBuNRrp06VImfeTIkURFRV2RSomrq7i4mK1bt/L7778THBxM//79iYmJoWHDhiQmJgIwePBgzGYzmqZx3333AVBYWMj+/fvp0qULgYGBxMfHU6NGDaKionjiiSec5e/fv5+CAn2hsQkTJqBpGm5ubuzYsQOA2NhYAO644w40TWPmzJnUrVuXnj178uabbxIUFHTR+o8dO5bs7GyXY2zXiAq3v2FXD+d5/Eq9nsfWF2Gz6H3eDbvo19PiLMwakMasAWmkxZX2sO+cn8uPz57FWqTo/rw/fSYEYTD++aD71IFidn2ZS58Jgfp88/91wRvctCsS1AshhBBCCCGujkoPNVdKoVTZQbj79u1j586dV6RS4uoKDw/nxIkTZdJPnz5doftDQkLYv38/8+fPZ+fOncTFxTFjxgxmzpzJ5s2bMRhK3+9ERkbi5+fncn+NGjUA6NOnD7t27WLRokXs3buX3bt3s27dOubMmcPRo0fx8Sl/YTGz2YzZbHZNrMQw85CW7kTe7sXBXwtY83YWu7/KIytJn/ddr72ZprfqQ7mthYqMBJvzHCBlj4U172QB+srlR1YVcmRV6SroAz+q6VxU7Yt+aQDkndbnjJ8+VOxMu29OML61S38dHTbFsgkZtBroTf2OeuAf3tnMkdWFHF9fSHayXo/wqNKXBkIIIYQQQohrU4UD79dff915npyc7PJ9fn4+v//+Ox4eEgRcT2rVqkVYWBiJiYksWbKEZ555Bnd3d77++msAPD09admyJampqZw9e5aXXnrJeW9kZCSHDh1i48aNjB49Gk9PTwoLC+nbty/Tpk1D0/Qe2927d1NYqAeqJT3ukydPBiA9PZ06depw6tQpDh8+TPv27ausrf0mBxEQZuLA0nyykmx4BRmJ6OVJt6f90QwX7l22F5e+hCrOV6T9XnzB6yXBfOm10jSH6yW2z8sl/6yd6H8GONPaDPEhI8HGiomZGEzQ9Sl/GnWX+d1CCCGEEEJc6zRVXvd1OQwGA5qmoZRyBk3nU0rRuXNnNm3adMUrKapGdHQ0MTExF+zxhvIXVzt58iQAb775Jq+++iqrVq2iV69eBAcHExoaSk5ODgkJCQAsX76c3r17M2XKFF555RUAQkNDCQ4OJikpiYyMDCZMmMDEiRN57bXXeOutt6hXrx7BwcEkJiZy9uxZvLy8SElJISAgoOKNm6jPQ2fUexDa+LI+HyGEEEIIIYS4Eirc4x0WFoamaSQmJuLu7k5ISIjzmpeXF82bN+fNN9+skkqK6jNs2DD8/PyYOnUqu3fvJisri7Zt2zJmzBgeffRRABo1asR9993H9u3biY+Px2g00qZNG5588kl69+4N6POw69atyyeffMK+ffvIzs6mfv363HPPPQwePBiA7t27s2vXLn7//Xf27duHr68vt9xyCxMmTKhc0C2EEEIIIYQQ15AK93iXMBgMdOrUic2bN1dVnYT486THWwghhBBCCHGNqPTiagkJCS4LWdlsNkwm2Q5cXGNM7oACL79LZhVCCCGEEEKIqlTp7cTCw8M5fPgwPXr0wMPDgx49erB69WpGjhwpveDi2jHmExjzKQQEV3dNhBBCCCGEEH9zle6qXrduHb1798Zm05dhVkoRFhbGnDlzALj55puvaAWFuCwScAshhBBCCCGuEZXu8R4/fjx2u51BgwY505o2bUrt2rVlRXNx7cg6ox9CCCGEEEIIUc0q3eO9Y8cOGjZsyHfffYfBUBq316lTh/j4+CtaOSEu2ydjQNPgyY8r1Ptttypiv8hh/0/55Kbb8aphpFlvT7o+5Y+714XfT+Wm29jynxxS9hSTe8qGwwr+dY20vMub9sN8MbrpW++l7rWw+q1Mzh6z4V/XSI/nA2jco3QP7q3/zWHH3Fwe/akOHv6Vfh8mhBBCCCGEuIZV+n/4JpOJPy6E7nA4SElJwWg0XrGKCfGn2IrBaoGCnAplXzYug83Tc8hJtRNQ30TBOTs75+ex5ImzKMeFF/7PTLSxd1E+2Sk2/ENNaEY4e9RGzLRs1rydCejTMX58/hzWQsU/VtfBu4aRpS+coyjH4Sxj8/Qcbh0bKEG3EEIIIYQQ16FK/y//xhtv5MSJE/zf//0fAGfOnOH+++/nzJkztG/f/opXUFwd0dHRaJqGpmkYjUZ8fX1p1qwZjzzyCLt27ap0eSNGjHCWV1JmzZo1uf3229m9e3e590ycONGZ39/fn4KCgj/brAo5daCYAz/rz7rl5QAeXVqHuz6sCUDSDgtHVhde8F4PfwN9JgYyZmNdhi8O4fHlofjX019AHfhFL7Mw00HeKTu1It3x9DdSp7U71kJFVqK+TsKK1zMI72SmeV+vqmymEEIIIYQQoppUOvB++eWXAfjvf/+LpmkcP36cxYsXo2kaL7744hWvoLi63N3d6dixI/7+/hw5coQ5c+YQFRXFzJkzL7vMqKgoIiMjOXfuHL/99ht9+vShsNA1mFVKMW/ePOf3OTk5LFmy5LKfWRnHNxQ5zyN66cFv4+4emMz6MPGETUXl3gdQq5k7rYf4YHLX83r4G6jZxA3AmeYZaMCntpHTB4spzLaT9nsxbp4aAWEm4r7PI31fMb1eC6yStgkhhBBCCCGqX6UD7379+rFw4ULCwsJQSjlXNV+wYAH9+vWrijqKq6hOnTrExsaSnJzMtm3bCA8Px2azMXr0aA4dOgTAvn37GDx4MDVq1MDd3Z1GjRoxduzYMsF0idjYWPbt28e4ceMAfZTEgQMHXPLExMSQkJAAQIcOHQCcK+VXtdx0m/PcK0j/ldAMGp4B+nlOmr3CZWUkWEncagGg9d3eelmaxl3v18DkofH5rWnkn7Uz4L0a2K2Kde9l0/1Zf05utfBFvzQ+6ZbCb6+do7jAcaWaJ4QQQgghhKhmlV5cDeDee+/l3nvv5ezZswDUrFnzilZKXBs6dOjARx99xMCBA7HZbMyaNYuRI0fSuXNn8vLy8PHxoUmTJhw6dIi3336bnTt3smLFikuWazKZCA0NdUkrCbI7duzIuHHjuPPOO1m7di1JSUnUr1//ouVZLBYsFotLmtlmx2z6c2sOXHhmd/nS4iwsGXMWa6Gi6W2edHnS33kttI2Zh78Nccm/9IWz1Ghkol47M3OHnKJJT08aR3uwbFwmXjWM9Hgu4E/VXwghhBBCCHFtqHSPd3Z2NomJiRQWFlKzZk1iYmJ45pln+O9//1sV9RPVrFu3bs7zAwcO8PbbbzuD7gMHDnDgwAHef/99AFauXMnatWvLlNGpUydatWrFG2+8gbe3Nx999BF16tRxXs/Ly2Px4sUAPPTQQ/Tt25eaNWvicDiYO3fuJes4ZcoU/P39XY4pGyu+wr5vSOn7p4IMvadZORRFWfq5X51LB/BH1hTyzcgzFJxz0HqoN3dOq4HBpF0w/7GYQo6sKaTPxCCStltQDmg1yJsbBvng4W/g5JYLD28XQgghhBBC/LVUOvB+/PHHadiwIQcOHGDp0qUMHTqUTz75hP/7v/9j6tSpVVFHUY0cDtchz9u3bwf0gLykJ/qBBx5wXt+xY0eZMrZu3cr+/fsBaNCgAbfddpvL9cWLF5Ofn4+bmxv33Xcfbm5u3HvvvQAVCrzHjh1Ldna2yzG2a0SF29iwq4fzPH6lviDasfVF2Cx6n3fDLvr1tDgLswakMWtAGmlxpT3sO+fn8uOzZ7EWKbo/70+fCUEYjBcOuosLHKx8I5NO/+dHjcZulGwSYNSnhmO4rHEoQgghhBBCiGtVpQPvnTt3EhAQQPv27fnuu+/QNI3evXujlKpQkCT+WjZs2OA8b9GixWWV4XA42L59OzVq1GD//v3ce++9LlvSlQwzt9vtNG3alICAAOcIiqNHj7Jx48aLlm82m/Hz83M5KjPMPKSlO5G364uqrXk7i1kD0vjxWX0aRb32Zprequ+3bS1UZCTYyEiwYS3U65+yx8Kad7JQDnD30jiyqpAFD5xyHnlnys4PX/9hNmYfjajH/AAIizKjGSBhYxFpcRYKzjkIi/Ioc58QQgghhBDir6nSgXdqaiphYWEAxMXFceONN/Lbb7/RrFkzEhMTr3gFRfXZsWMHzz33HABGo5FHHnmEjh07AnpAnpycDMDChQud95QsjHY+TdPo0KEDEyZMAGDPnj3OoeUJCQmsX78e0AP0kh7r8xdquxqLrPWbHETnf/jhV8dIVpINryAj7R704e7pNdEMF+69theXvkAozlek/V7scpx/HSB1r4W93+bRZ1IQRje93OCm7vSZGMiR1YUsGnWGyP5e3PwPv6ppqBBCCCGEEOKq09T5XY8VEBgYSEBAAIcOHSI4OJiBAwcyb9482rRpw4kTJ8jOzq6quooqFB0dTUxMDO7u7tx4442kpKSQkpKCUgqTycRnn33GY489xsGDB7npppuc87zr16/PoUOHUErRq1cv5+JqI0aMcI6AKPknVlhYSHh4OGfOnOHGG29k165dTJw4kUmTJuHm5sbp06cJCAhw1um5557jww8/xM/Pj7S0NLy8KrHP9cRB+tdR70Fo4yvyGQkhhBBCCCHE5ah0j3dkZCSJiYnUrl2b/Px8oqKiAEhOTqZevXpXvILi6iouLmbbtm1kZWXRpEkThg8fztatW3nssccA/ee/ZcsWBg0ahLu7O0eOHKFBgwa8/PLL/Pjjjxct29PTk6effhqA3bt388svvzj37u7Zs6dL0A0wePBg4Oru6S2EEEIIIYQQV1qle7x//fVXBg8eTHFxMY0bN2bnzp0cOHCAm2++mUceeYRZs2ZVVV2FqDjp8RZCCCGEEEJcIyq9fvLtt99OcnIyiYmJtGzZErPZTMuWLTly5Ag1atSoijoKUXkmd0CBl8yVFkIIIYQQQlSvSvd4/1FycjLbt2+nRYsWNGvW7ErVS4g/J+uM/jUguHrrIYQQQgghhPjbq/Qc75deeolGjRoRGxvL3r17iYyMZMiQIdxwww389NNPVVFHISovIFiCbiGEEEIIIcQ1odKB94oVKzh9+jTt27dn9uzZ5Ofn4+vri81m45133qmKOgpReVlnSnu9hRBCCCGEEKIaVXqO94kTJwgPD8fNzY2dO3fSqFEjDh48SMOGDTl48GBV1FGIyvvkSdAM8OTHFer5tlsVsV/ksP+nfHLT7XjVMNKstyddn/LH3evi76e2/Cebo2uLOHO4GLtVT3tuZz1M5tL9v5N3Wdi9MJe0fcUUnHNgdNeo0chExxG+NL21dJu0uO/z2PKfHArOOQhp5U7vCYEENXBzXv9u9Bkcdhg6Q3rzhRBCCCGE+KuodI+31WrFaDQCcPjwYdq0aYObmxu1a9emqKjoildQ/H2MGDECTdOIjo7+84XZrGC1QEFOhbIvG5fB5uk55KTaCahvouCcnZ3z81jyxFmU4+LLIBxeUUjmSSueQcYL5jm5pYhDywqxFigC6psozneQsruYH545x6FlBQCcO25l+YRM6ncw89gvdTgTb+W31zKcZRz4JZ+kHRZ6jQ+sUJuEEEIIIYQQ14ZKB95hYWHs37+fPn36cO7cOW688UYA0tPTCQkJueIVrCrR0dFomkaDBg1c0tetW4emaWiaxpw5c6qlblfLnDlznG39Ozt1oJgDP+vB7y0vB/Do0jrc9WFNAJJ2WDiyuvCi99/9aU2e2lyX1oO9L5inZlM3hs4I5sn1dRmxJIRhC2uj/e+37+Av+QCcPWJFOSC0rRmfWkaCGpg4c1jvQi/MsrP2nSy6jPEnoF6lB6oIIYQQQgghqlGlA+/HHnsMpRQrV67E3d2dBx54gOPHj5OWlka7du2qoo4CKC4uru4qVBm73Y7dbq+25x/fUDpSI6KXPuy7cXcP51DxhE0XH8nhG2K65MuLZr29aHCzh/P7WpFuuHvr9xjd9a81m7qhGSB1j4W803YyTtgIbqYPM187NQu/UBPth/lUsnVCCCGEEEKI6lbpwPuf//wnP/74I++99x47duygUaNGOBwOvvjiC1599dWqqGO1ycnJwcfHB03TmDlzpjM9Li7O2VMcGxvr0kv+008/0a1bNzw8PGjSpAmLFy92KfPQoUMMHTqU4OBg3N3diYyM5LPPPnPJ06BBAzRN48UXX2TkyJEEBATQp08fAOdzpk2bxrBhw/D19aVu3bq8+eabLmVkZ2fzzDPPEB4ejru7O/Xq1eP555+noEDv2R0xYgSPPPKIM39JuRMnTuS1115D0zRatmzpvN6iRQs0TWP69OkArFmzxnlPeno6ABkZGTz55JPUr1/fOf1g2LBhJCYmOsuZOHGic6TBvHnzaNy4Me7u7iQlJZX5/E+fPk1kZCSapnHTTTeRlZVV4Z9dZeSm25znXkH6r4Rm0PAM0M9z0q78S4EDPxdgyVWgwQ2D9WC6RiM3+kwKJGmHhZn90whu6ka/N4I4saWIg78W0GtcIDHvZzE9OoXPb0tl66yKDaMXQgghhBBCVK9KB94AAwYM4Pnnn3cGZk2aNKFv37789ttvV7Ry1c3Pz48HHngAgP/+97/O9O+++w6AiIgIOnXq5HLPPffcw+nTpzGbzRw7dox7772X3bt3A3DkyBE6derE4sWLcTgcNGvWjMOHD/PEE0/w+uuvl3n+v//9b77++mvCwsLw9PR0uTZ27FjWrFmDh4cHqampjBs3jpUrVwJ673h0dDT//ve/ncHruXPn+OCDDxgwYABKKRo3bkyjRo2c5UVFRREVFUW9evWcc6wPHjxIZmYmGRkZHDp0CICNGzcCsGHDBgCaNWtGSEgIRUVF9OjRg+nTp5Oenk5ERAQ5OTl8+eWXdO7cmTNnXFcYT01NZcSIEZhMJmrXrl2m7ZmZmfTu3ZtDhw4RFRXFypUrCQgIKPfnZLFYyMnJcTkstj8fLP+pDe4vIm5JHsvG6XO3o18IoGGX0p7wGwb5MGpZKM9ur8d9c2rhG2JkxaQMOo7wJS2umB1z82j3gC9Nenqy/oNsEjZefBi8EEIIIYQQovpdVuBdoqioiIULF9K7d28aNGjA+PHjr1S9rpqTJ086e241TaNnz54u10ePHg3Ali1bnMFnSeD98MMPlynvueee4/Dhwxw+fJiAgAAcDodzm7W33nqL7OxsWrVqRVJSEnFxcXzwwQcAvP322+Tm5rqU5efnx+HDh/n9999ZunSpy7UOHTpw4sQJDh48iJubPhx59erVAHz11Vfs2bMHd3d3fv/9d/bu3UtsbCyg91SvWbOGcePGMW7cOGd5sbGxxMbG8thjj3HzzTfj7u6OUopNmzaxadMmlFL4+fk5A++SryVB+ldffcW+ffsAWLRoEfv372fTpk0YDAZSU1P55JNPXOpvtVqZPn06hw8fJiUlhbCwMOe1vLw8+vXrx969e+nUqRMrVqzA39+/vB8fAFOmTMHf39/lmLIx/oL5/8g3pHTOdEGGAwDlUBRl6ed+dS68aFplKKXY8O9slo3PBKDvG4F0HO570Xs2fpKDwaRx82h/TsbqQ97bPehDm6H6fPITW2RBQyGEEEIIIa51lxV4b968mVGjRlGnTh0eeughVq9ejd1uR6mq6iOsOu7u7s7e3qioKCIjI12u33jjjURFRQF6r3d8fDz79u1D0zQeeuihMuXdf//9AISEhDiD+Li4OAC2bdsGwL59+/D29kbTNJ599lkACgsL+f33313Kuvvuu6lfvz6AcyX5Evfccw/u7u7UrFmTWrVqAXDq1CmX5xQXFxMREYGmabRt29Z5b0kQfiFeXl507NgRwBl4GwwG/u///o+kpCQSEhKcZZQE3tu3b3feO3DgQADatWtHs2bNANixY4fLMzw9PRk1ahSgD3M3GEr/Ke7cuZOtW7cSHh7O8uXL8fPzu2h9x44dS3Z2tssxtmvERe85X8OupT3O8Sv1ofjH1hdhs+j/nkt6pNPiLMwakMasAWmkxVkqXD7o25X98nIGsTNyMPtq3P1ZMDcMuvh87VMHitn1ZS59JgTq883/9+tlcNMwmP7eC+IJIYQQQgjxV1Lh5ZFTUlKYO3cuc+fO5ejRowDOQFvTND788EMGDx5cNbWsQnXq1HEJRNetW1em1/uJJ55g69atzJ8/H19fvYeyZ8+eLr20lVGzZk0aN25cJv2PwXV5Q7BLnD/s2mTSf4x/fPHh7u7uXHX+fIGBl96OKjo6mk2bNrFx40aUUrRq1Yr+/fszbdo0PvnkE/Ly8pz5LkdwcLBLsH0+b29v8vPzOXnyJPPnz+fJJ5+8aFlmsxmz2eyaaKp4L3VIS3cib/fi4K8FrHk7i91f5ZGVpM/7rtfeTNNb9WH+1kJFRoLNeV7i53+dI+33YoqyHc60/96VhqZp9Hjen4heXmyfk8vBX/Sg3s3LwMaPs9n4cbbe3ppGBv27pkudHDbFsgkZtBroTf2OeuAf3tnMkdWFHF9fSHayXo/wKA+EEEIIIYQQ17YKB97h4eEopZzBXevWrXnooYeYOHEiBQUFPP3001VWyep2zz338Nxzz5Genu4cNl7eMHOAb775htatW3P69GnWrVsHwA033ABAx44dOXDgAP7+/vz6668EBQUBcPbsWVavXl1mvvjlbvNV0lttt9uZPn26c7X5oqIifvnlF2699VZA750ukZ+fj7d36XZY0dHRTJ482dmTPXLkSKKiojCZTMyYMQMond9d8szPPvuMgoICfvjhBwYOHMiuXbs4fPgwoA+Nr2jbOnToQLdu3XjzzTd56qmnCAoKco4kqCr9JgcREGbiwNJ8spJseAUZiejlSben/dEMF/855J2yOwP1EtnJ+hzz4nz998VerFzy550qnYPuF1r2JcH2ebnkn7UT/c8AZ1qbIT5kJNhYMTETgwm6PuVPo+6eZe4VQgghhBBCXGNUBWmapgwGg7rpppvU3r17nekBAQHKYDBUtJhrRo8ePRSgwsPDXdLXrl2r0Af1qtmzZzvT//nPfzrTvb29VW5ubrn3eHt7q2bNmil/f38FKIPBoHbu3KmUUurQoUPKz89PAcrLy0u1bdtWhYWFKaPR6FKP8PBwBagJEyaUqXd5dSvJP3z4cKWUUkVFRap169bO57ds2VJFREQos9msAJWQkKCUUmrv3r3O8sLCwlRUVJTauHGjUkqp/Px85e7u7ry+YMECpZRSHTt2dKY9/vjjzjoUFhaqVq1aKUCZTCbVokUL5eHhoQAVGhqqTp8+rZRSasKECeV+7kopNXz4cAWoHj16KKWUeuSRRxSg3Nzc1K+//nqhH2X5JgzUj5SjlbtPCCGEEEIIIa6wSs/x3rFjB/369eOll14qMyf5evaPf/zD2Us7ePBgfHzKn5+7ePFiateuTVFREY0aNeKrr75y9jg3a9aMLVu2MHToULy8vNi/fz8Oh4O+ffvyxhtvXLG6ms1mYmJiePrpp6lfvz7x8fFkZmbSoUMHJk+e7BzC3rp1a8aNG0ft2rVJTExk69atZGbqC3+dP88b4Oabbwaga9euzrTzh5l7eHgQExPDE088QUhICPHx8fj6+vLggw+yZcsWgoODK92OGTNm0K9fP6xWK0OGDGHTpk2X83EIIYQQQgghRLXSlKrYimhz5sxh7ty5rF+/HqWUMwgtOd+/fz/Nmzev0spWJ4vFQu3atcnOzmb16tXccsstzmvnzwtPSEigQYMG1VRL4TRxkP511HsQWnY+vRBCCCGEEEJcLRXu8R4xYgRr167l2LFjjB8/ngYNGrgs5tWyZUtatGhRJZWsbsOGDaNTp05kZ2fTvn17l6BbXKPczGByA6+Lr4guhBBCCCGEEFWtwj3e5YmJiWH27Nl899135Ofno2kadrv90jf+xWiahpubGx07dmTu3Lk0adLE5br0eF+Dss7oXwMqP8RdCCGEEEIIIa6kPxV4l8jPz2fRokXMnTuXtWvXXol6CSGEEEIIIYQQ14UrEngLIYQQQgghhBCifBXex1uIv5S1X0OH3uAbdMmsdqsi9osc9v+UT266Ha8aRpr19qTrU/64e118GYTifAcbP8nm8IpCCs7Z8Q0x0vJObzqP8sNg0hcgPLGliM2fZZORYMOS68DDz0CNRm60f9iXprfo+3A77Ip172Zx8NcCHDZo1MODXuMCnc+35DqYdWca7e73pdMombcuhBBCCCHEX0mltxMT4i8h5hvIzaxQ1mXjMtg8PYecVDsB9U0UnLOzc34eS544i3JceECIciiWPHmWnfPzKDin35uTamfz9Bx+G5fhzHf2qJWzR6141zRSs4kbxfmKpB0Wfnz2LCm7LQDEfZ/PzgV5dH/WnwHv1eDA0gK2fpHjLGPdtCy8Ao3cNNL3Mj8QIYQQQgghRHWRwFtclujoaDRNo3Hjslt1HTt2DE3T0DSNN998E9BXxS9Ja9q0aZl71q1b57z+x6Nt27ZV1o5TB4o58HMBALe8HMCjS+tw14c1AUjaYeHI6sIL3ntkdSFJO/TAeeCHNXl0aR1u+VcAAAeWFnDqQDEAbe/14enN9Xjk+xCGLw5h8Kd6+coBqXv1+08fsgJQr72Z+h3Netph6//qUcS+H/LpMynQ2YsuhBBCCCGE+OuQwFtclhEjRgBw/PhxNm3a5HJtwYIFABgMBh5++GHy8vJYvHix8/rRo0fZuHHjBctu1KgRUVFRzqN169ZXvgH/c3xDkfM8opcXAI27e2Ay6wFuwqaicu8DSNioXzN5aDTq7uFShst1d43sVBsLHjjF3CHpLBlzFgDNAKFt9SC7VnM3AJJ3WkjargfjtZq5YStWrJiUyY0P+FDnBvOfb7AQQgghhBDiqpM53uKyDBkyhDFjxpCfn8/8+fPp0qWL81pJ4N2zZ0/CwsKYM2cO+fn5mM1mmjZtyr59+5gzZw5du3Ytt+xx48Y5A/uqlptuc557BenvoTSDhmeAgdxTdnLSLrw9Xk66fs3T34Bm0AN1rxql77Jy0krLthUp0n4vdn7v5qnR780g6v4v8L5hkDfnjlqJ+SAbh13RYoAXUf/nx5bPc7AXK9rd78v3T58leYcFn9pGov/pT8OunlfgExBCCCGEEEJUNenxFpfFx8eHIUOGAPDtt99isei9tFu2bOHo0aNAaa/4nDlzALjzzjsZNWoUAIsWLaKgoOCK1MVisZCTk+NyWGx/bj/5y17q/wI31mjkxov76jNmUyjdn/PHWqhYPinDORzdYNS45eVAxmyoy9Ob69F/Sg2yk21s+28OvcYHsf7DLI6vL6Tf5CDcvTV+fP4cBRl/ro1CCCGEEEKIq0MCb3HZSgLrzMxMfv75Z6C0t9vPz4/BgweTkJDA+vXrAXjooYe47777MJlM5OTksGTJknLLfeSRR1zmeE+cOPGi9ZgyZQr+/v4ux5SN8RVqg29I6aCPggwHoC+aVpSln/vVMV7wXr8Q/VphlsO5CFtJGfq9ZQeUePobiXrUDw8/A5YcxfY5ueWWrRyK5RMyad7Pi4ZdPDgZayE4wo0mPT2J7OeFtUCRel4PuhBCCCGEEOLaJYG3uGw9evSgYcOGAMyfPx+r1co333wDwNChQ/Hy8mLu3LkopQgODqZv374EBwfTp08foLQn/I/+OMe7Xr16F63H2LFjyc7OdjnGdo2oUBsadvVwnsev1Hvgj60vwmbRA+mGXfTraXEWZg1IY9aANNLi9N79Bv+712ZRHF9f5FLG+WX/vjiPwuzS3umU3RaKcvUA3VpYGqifb9eXeWSn2JyLtaHA4KYPZ5cF1oQQQgghhPhrkTne4rJpmsbDDz/MpEmT+PXXX5k3bx7nzp0D9N5wpRTz5s0D9F7x4OBgAIqK9CB17dq1JCUlUb9+fZdyKzvH22w2Yzb/YeEx04V7qs8X0tKdyNu9OPhrAWvezmL3V3lkJelzs+u1N9P0Vn0etbVQkZFgc54DNL3Fk7rt3EnZVcwPz54loL6JzJN6nsj+XtRu4Q7Alhk5rHgjE/+6JowmOJdgcw5Jb3Gnd5k65aTZ2PBxNr0nBOIZoLcjvLOZ4+uLyE6xkbCpCDdPjTo3uFf4MxJCCCGEEEJUH+nxFn/K8OHD0TQNq9XKs88+C0CTJk3o2rUrMTExJCQkAGCz2Zy90SXzwR0OB3Pnzq2uqjv1mxxE53/44VfHSFaSDa8gI+0e9OHu6TWdi6aVx2DUuHt6MO0e9MErSL/Xr46Rzv/wo9+bQc58zft5UaORGwUZdjJO2vAMMNCgiwd3f1aTZuetgl5ixeuZ1G9vpkX/0qD8lrGB1O9oZvagdDISrAx4rwbeNSr2ckEIIYQQQghRvTSl1GWvIyUE6Ht6x8TEOL9/4403eO211xgxYgRz586ldu3apKamYjCUvucZNGgQP/zwA02aNOHIkSOsW7eOnj17AvpQ85LecQBfX19WrlxZuUpNHASj3oPQsvuMCyGEEEIIIcTVJEPNxZ82YsQIZ+Bd3t7dd911l0vQDTB48GB++OGHcvf0Pn78OMePH3d+7+/vX8UtEEIIIYQQQoiqIz3e4vokPd5CCCGEEEKIa4TM8RbXpx73gm9gdddCCCGEEEIIIaTHWwghhBBCCCGEqErS4y2EEEIIIYQQQlQhCbyFEEIIIYQQQogqJKuai+vT2q+hQ2/wDbpkVrtVEftFDvt/yic33Y5XDSPNenvS9Sl/3L0u/m6qON/Bxk+yObyikIJzdnxDjLS805vOo/wwmPQ9wE9sKWLzZ9lkJNiw5Drw8DNQo5Eb7R/2pektngA47Ip172Zx8NcCHDZo1MODXuMCnc+35DqYdWca7e73pdMovz/54QghhBBCCCGuJunxFpdF0zQ0TWPOnDnVXZXyxXwDuZkVyrpsXAabp+eQk2r///buPK6qam3g+G8fhsNhHgUUBRxQQFJRI3NMs66Zc7fULK281q1suvXezEq7N1/tLcubad3Kq41WDmneyiwTccIRRxQcQBFEAhGQ4cCB9f5xYusR0KMJODzfz+d83Oy99trPXm4O5zlr7bXxbu5ISV4l2z87w9LHc1FVdU+BoKoUS5/IZftnZyjJs+5bmFXJxrmF/PjKKb1c7qEKcg9V4ObvgH9rJ8qLFRnbzCx/JpfMJDMAe74tZvvnZ+j1jBeD3vIjeUUJmz8q1OuIn3kaVx8Hbn7Y4zIbRAghhBBCCNFYJPFuQGVlZbz99tvExcXh6emJq6srERERPProozbPrb5SwsLC0DSNqVOnXvG661ufPn3QNI1x48bV63FOJpeT/N8SAPq+6M0jK4IZMssfgIxtZg6uLq1z34OrS8nYZk2ch87y55EVwfT9uzcAyStKOJlcDkDH+9x5amMID30bxNjFQQyfY61fVUHWLuv+OQcqAAjpbKR5V6N1XUrF73GUsXdZMXe+5qP3ogshhBBCCCGuHTLUvIHk5+fTr18/kpKSAPDw8KBVq1YcO3aMDz/8kG7dutGyZctGi6+yshIABweHRouhMRxZV6YvR/R3BaBVLxccjRoWsyJtQ5m+/nxp6637OrpotOzlotexevppfXtglDOOzhoFWRZWPJ9HZbniVLoFAM0ATTtak+wm7ZwAOL7djEem9f+iSVsnLOWKVa/l02m0O8Exxit89kIIIYQQQoiGID3eDeTJJ5/Uk+4XXniBU6dOsWfPHgoKCli7di1t27YF4LvvvqNHjx64u7vj4uJCp06dmDdvnk1d1cO8Z86cyZgxY/Dw8KBZs2a8/vrrAKSnp6NpGkePHgXgtdde0/cBmDp1KpqmERYWxqeffkqrVq1wdnYmIyPD7hjOt2DBAv0Ya9asITY2FpPJRGxsLImJiTZlN2/ezF133YW3tzcuLi7ExsayePFim/Nbu3YtAJ988oleb3p6+uU0/QUVZVv0ZVdf66+DZtAweVuXC09U1rlvYbZ1m8nLgGawtq2r39lfqcITZ+u2lClO7C4n50AFljKFk0lj0Jt+NPs98Y4Z5kbnMe6sfaeA757PJWqQK3F/8WTTB4VUlitiR3nw7VO5zL41k/nDsklbX3dPvBBCCCGEEOLqIj3eDaCgoIBvvvkGgA4dOvDGG2/oSTBAr169APj888954IEHAAgMDMTFxYWdO3cyfvx4srOzmTx5sk29kyZNwt/fHxcXF7KysnjllVeIi4ujffv2xMXFkZSURHl5Oc2aNSMkJKRGXFlZWYwbN442bdoQGBh4WTHUZsCAAYSFhWGxWEhKSmLkyJEcOnQIR0dHNmzYwG233UZFRQVBQUEEBQWRlJTEn//8Zz755BMefPBB4uLiSE5OpqioCH9/f1q1agWA0Vh7j6/ZbMZsNtusM1oq+SP9w5f9cPs6dvRr6cQLe5tTWlDJ7sXFJLxTwE+vncK7uSOBUc4YHDT6vuhD3xd99H1+Sy1ny38KGT4ngIRZpzmSUMqQd/zZPK+Q5c/lMWFlMK6+N9YIBSGEEEIIIa5F0uPdAFJTU7FYrL2fPXv2tEm6z1Wd1MbFxXH06FHS0tIYNmwYANOmTaOkpMSmfJcuXUhPT2f//v04OVmHKq9evZrg4GASExMJDg4GYPz48SQmJtboea6oqGDu3LmkpKSQmZlJixYtLjmG2rz55pscOHCAmTNnAnD06FEOHToEwMsvv0xFRQX9+/cnIyODAwcO8Mwzz9icf2JiIrGxsQAMHDhQj736fM43ffp0vLy8bF7T16deNE4Aj6Cz3z2VnKoCrJOmlZ22LnsG153YegZZt5WertInYauuw7pvze+1TF4OxD3iiYunAXOhYuuColrrVlWKn6bk026AK+HdXTiaaCYgwonWt5mIHOBKRYkia3e5XecohBBCCCGEaFySeDcApc52g9aVdOfk5HDs2DEAhg8fjtFoRNM0Ro4cCUBpaSn79u2z2efee+/F2dkZf39/mjRpAsDJkyftjstkMjFhwgQ9rtzc3EuOoTbVPeZRUVH6uuq4tmzZAsDPP/+Mk5MTmqYxa9YsAI4fP05mZqbd8VebNGkSBQUFNq9JPSLs2je8h4u+nPqz9UuFwwllWMzW/7Pw7tbtJ/aYmTfoBPMGneDEHmvvetjv+1rMiiMJZTZ1nFv37sVnKC04O2Q9M8lMWZE1Qa8oPZuon2vHF2coyLTok7WhwOBkvXZkgjUhhBBCCCGuLTLUvAG0bdsWR0dHLBYL69evRylVZwJ+Kby9vfVlR0frf+W5Sf7FBAQEYDBc+e9equOqjglqxlXX8PfqkQGXwmg01hyG7mjfEOygaGci73Jl/w8l/DrjNEkLz3A6wxpDSGcjbfpZn7NdUao4lWbRlwHa9DXRLNaZzB3lLHsmF+/mjuQftZaJHOhKYJQzAJs+LGTVP/PxauaIgyPkpVn0IelRg91qxFR4wsK62QXcMcUHk7f1PEK7GTmSUEZBpoW0DWU4mTSCY5wvoZWEEEIIIYQQjUV6vBuAl5cX9957LwBJSUm89NJLNgnmL7/8wqFDh2jRogUAS5cuxWw2o5Tiq6++Aqy909HR0Zd0XFdX62zcxcXFtW4/P/lv0qTJFY/hfF27dgUgNDSUNWvW6MPIFy9ezKRJkwgNDbUr9itpwDRfuj3miWewA6czLLj6OhB7vzsj5vrrk6bVxuCgMWJuALH3u+Pqa93XM9iBbo95MuB1X71cuwGu+LV0ouRUJaeOWjB5Gwjr7sKI9/1pW8uM6av+kU/zzkaiBp5NyvtO8qF5VyPzh2VzKq2CQW/54eYn93cLIYQQQghxLZAe7wYye/ZskpOT2blzJzNmzGDu3LmEhYWRkZFBfn4+8+fPZ9q0aTzwwANs3ryZ0NBQXFxc9JnJJ0+erCej9mrXrh379+/n3XffJT4+nvbt2zN//vwL7nOlYzjfP/7xD/r168fGjRsJDg4mPDyc3377jaysLHr16sWQIUP02H/88UeWLl1KbGwsTZo0YeXKlX/o2HVxcNLo8aQXPZ70qrNMi5tdeGFv8xrrje4G+k3yod8kn1r2sur9rDe9n7U/nnveD6ixzt3fgRFza64XQgghhBBCXP2kx7uB+Pr6smnTJt566y26du1KVVUVKSkp+Pj4MH78eHr16sWYMWNYvnw53bt3p6ioiOzsbDp27MjHH39s12zi53v99de55ZZbMBgMbNu2jT179lx0nysdw/l69epFQkICAwYMQNM0kpOTcXJyYsSIETz//PN6ueeff57bb78dV1dXkpKS2LZt2x8+thBCCCGEEEI0Bk1dyk3BQlwrpg6DCW9B01aNHYkQQgghhBDiBic93uL61Ps+8Kh7+LcQQgghhBBCNBTp8RZCCCGEEEIIIeqR9HgLIYQQQgghhBD1SBJvIYQQQgghhBCiHsnjxMT1ac1X0OUO8PC9aNHKCkXiR4Xs+66YouxKXP0caHuHiR4TvXB2vfB3U+XFVax/r4CUVaWU5FXiEeRA9GA3uk3wxOBofQZ4+qYyNr5fwKk0C+aiKlw8Dfi1dKLzgx606WsCoKpSEf/mafb/UEKVBVr2dqH/Kz768c1FVcwbfILYUR7cMsHzDzaOEEIIIYQQoiFJjzfQp08fNE1j3Lhx18VxrjRN09A0jalTpwIQHx+vr0tPT2/U2Oq09msoyrer6MpXTrFxbiGFWZV4N3ekJK+S7Z+dYenjuaiquqdAUFWKpU/ksv2zM5TkWfctzKpk49xCfnzllF4u91AFuYcqcPN3wL+1E+XFioxtZpY/k0tmkhmAPd8Ws/3zM/R6xotBb/mRvKKEzR8V6nXEzzyNq48DNz/scZkNIoQQQgghhGgsV0XivWDBAj2Rc3BwICMj44ofIz09XT9GfHz8Fa/fnuNERUURFxdHq1aX9oirsrIy3nnnHW699Va8vb0xGo20aNGC22+/nbfffvsKRm8fT09P4uLiiIuLw2g01ssxqttwwYIF9VJ/tZPJ5ST/twSAvi9688iKYIbM8gcgY5uZg6tL69z34OpSMrZZE+ehs/x5ZEUwff/uDUDyihJOJpcD0PE+d57aGMJD3wYxdnEQw+dY61dVkLXLun/OgQoAQjobad7V2qY5KRW/x1HG3mXF3Pmaj96LLoQQQgghhLh2XBVDzc9Nrqqqqvjkk094+eWXGy+gejJ37txL3icvL49+/fqxa9cuAFxdXYmIiKCoqIi1a9eyevVqnnvuuTr3r6ysBMDBweHygq5FbGwsiYmJV6y+xnRkXZm+HNHfFYBWvVxwNGpYzIq0DWX6+vOlrbfu6+ii0bKXi17H6umn9e2BUc44OmsUZFlY8XweleWKU+kWADQDNO1oTbKbtHMC4Ph2Mx6Z1v+zJm2dsJQrVr2WT6fR7gTH1M+XHEIIIYQQQoj61eg93mlpaSQkJADQpUsXAD755BObMgUFBTz99NOEhobi7OxMSEgIzz33HCUlJXqZlJQUBg8eTJMmTTAajYSEhDBgwAC2bNnCggULCA8P18vedtttaJpGnz59bI6jlOJ///d/adq0KT4+PowZM4aioiJ9e1VVFf/6179o3749Li4u+Pj48Oc//5m0tDSAix6ntqHmRUVFPP/887Rq1QpnZ2f8/Pz405/+RGmptaf1ySef1JPup59+mry8PPbs2UN6ejq5ubnMnz9fr2vq1KlomkZYWBiffvqpXmdGRgY///wzPXv2pEmTJjg7O+Pp6UnPnj358ccfbdpg9+7d3HLLLbi4uNChQwfWr19f4/+srqHmP/74I71798bDwwOTyUTPnj1Zs2aNvv3c0QALFizg7rvvxtXVlfDwcObNm2dTd7WHHnpIP6f6UJRt0Zddfa2/DppBw+RtXS48UVnnvoXZ1m0mLwOawRqzq9/ZX6nCE2frtpQpTuwuJ+dABZYyhZNJY9CbfjT7PfGOGeZG5zHurH2ngO+ezyVqkCtxf/Fk0weFVJYrYkd58O1Tucy+NZP5w7JJW193T7wQQgghhBDi6tLoifcnn3yCUoqgoCA++ugjAA4dOqQnfOXl5fTp04d3332XnJwcIiMjycvL45133mHQoEFUP4Z81KhRrFixAovFQnR0NFVVVaxcuZLk5GQCAgLo2LGjfszIyEji4uKIioqyiWXRokXMmDEDFxcXTp8+zRdffMGMGTP07U8++STPPPMM+/bto3Xr1jg4OLB48WJuvfVWcnJy7D5OtepzmzlzJkeOHKFp06b4+vqyatUqzGYzp0+fZtGiRQB06NCBt99+GxcXF31/Ly+vWu8Xz8rKYty4cTg6OhIYGAjAvn372Lx5Mx4eHrRv3x6lFOvXr2fw4MF6Yl9aWspdd93F5s2bqaqqoqKigoEDB9rz38jXX3/NwIEDSUhIwM/Pj+DgYNavX0///v1tku9qEyZMYN++fTg5OZGens6ECRM4cOCAPoy9WsuWLYmLi6NTp051HttsNlNYWGjzMlvqTpjtcdkPt69jR7+WTrywtzlPbmhKr2e9qChV/PTaKX04usFBo++LPjy5rhlPbQxh4HQ/Co5b2PKfQvq/6kvCrNMcSShlwDRfnN00lj+XR8mpP3aOQgghhBBCiIbRqIm3UopPP/0UgNGjR9OxY0duuukm4Ozw84ULF7Jz506cnZ3ZvXs3u3bt0oc5//rrr/z6668AHDx4EIAVK1awY8cOsrKyOHLkCH369GHgwIF8++23+nHnzp1LYmJijaHfjo6O7N+/n0OHDtG5c2cAVq9eDVh75j/44APA+mXB3r17SU9PJyQkhOzsbGbPnm33cap99dVX7NixA4D/+7//Iz09nYMHD7Jnzx5cXV1JTU3Vh4r37NkTg8H63zV06FC957i2+6ArKiqYO3cuKSkpZGZm0qJFC4YNG0ZOTg6HDx9mx44dHDt2DA8PDywWC4sXLwbgyy+/JDMzE4DvvvuO5ORku+8hf/HFF1FK8fDDD5OWlsbhw4cZNmwYlZWVvPrqqzXKDxkyhCNHjrBu3TrAOpogPj6+xjD2V155hcTERJt2Pd/06dPx8vKyeU1fn2pX3B5BZ++2KDlVBVgnTSs7bV32DK57iL5nkHVb6ekqfRK26jqs+9a8k8Pk5UDcI564eBowFyq2LiiqUaY6hp+m5NNugCvh3V04mmgmIMKJ1reZiBzgSkWJImt3uV3nKIQQQgghhGhcjZp4r127Vh+m/cADD9j8u2jRIkpKStiyZQtg7R2OiIhA0zSbXuXqJG3QoEGAdXh3ZGQkI0aMYOXKlQQHB9sdT9++fWnWrBkGg4F27doBcPLkSQC2bdum966PHTsWTdPw8PDg+PHjNnFcis2bNwNgNBpt7tOOjo7G2dnZpmx10g3Qtm1bOnToUGe9JpOJCRMmANZJygwGA2azmXHjxtGkSRMcHBzw9fXVh9FnZWUB1l5xsN5H/qc//QmAe++996Ln8dtvv+lDzv/zn/9gMBgwGAx6slx9nue6//770TTNZjRAdVtfqkmTJlFQUGDzmtQjwq59w3ucHUGQ+rP11oXDCWVYzNb/6/Du1u0n9piZN+gE8wad4MQe64RoYb/vazErjiSU2dRxbt27F5+htOBs73RmkpmyImuCXlF6NlE/144vzlCQadEna0OBwck6nF0mWBNCCCGEEOLa0qiTq53bU1t9H7TFYr0vtrCwkKVLl+rbnZ2dax1u7OPjA8Cnn37K4MGDiY+PJzk5mR9++IGlS5eyd+9e5syZY1c83t7e+rKjo7VpqpPtc3Xs2LHGbN6hoaF2HaMu597XXK1t27Y4ODhQWVnJxo0b9fVvvPEGDz30EJGRkbXWFRAQYJOoAwwcOJBDhw7h6OhITEwMLi4uJCUlUV5erveqXygWe7Vs2ZKAgIAa68vLbXtnq9u6up2h9ra2h9ForDm7uqN9k8kFRTsTeZcr+38o4dcZp0laeIbTGdZrMKSzkTb9rM/ZrihVnEqz6MsAbfqaaBbrTOaOcpY9k4t3c0fyj1rLRA50JTDK+uXJpg8LWfXPfLyaOeLgCHlpFn1IetRgtxoxFZ6wsG52AXdM8cHkbT2P0G5GjiSUUZBpIW1DGU4mjeAY5xr7CiGEEEIIIa4+jZZ4nzlzRh/iDNYJ1M63YMECxowZA1hn5547dy6xsbGA9RFb33//Pf369QNg3bp1DBs2jJEjRwIwY8YMJk2apE/c5up6dmbq4uLiS463c+fOaJqGUopx48bx9NNPA+j3Snt5eV3yceLi4pg7dy5ms5lZs2bpvd779++nVatWeHl5ce+997Jw4UK2bdvGlClTePXVVy86Q/n5iXNeXh6HDh0C4B//+AeTJk0iPT1d79WvFh0drce9atUq7rjjDpv/o7oEBAQQGhrK0aNHiY2NZeHChXpCnZqaytGjR2v04F+MyWSitLT0sv6vLtWAab54t3AkeUUxpzMsuPo6ENHfRM+nvPRJ02pjcNAYMTeA9bMLSP25lNMZFjyDHYga5Ea3Rz31cu0GuHIkoYzCExYqShUmbwOBUc50HuNOy56mGvWu+kc+zTsbiRp4NinvO8mHitJTzB+WjUegA4Pe8sPN78rNVC+EEEIIIYSoR6qRzJ8/X2Ht91N79+612TZr1iwFKIPBoNLT09VNN92k/xwdHa0iIiKU0WhUgEpLS1NKKdWsWTNlMplURESE6tixo3JyclKAGj16tFJKqaqqKuXn56cA5ePjo26++Wb17rvvKqWU6t27twLU2LFj9RjGjh2rABUaGqqvmzBhgh5zeHi4iomJUZ6engpQ8+fPv+TjmM1mFRsbq9cZFham2rRpowwGg8rPz1dKKZWbm6ufP6A8PT1Vx44dVWBgoL6u+thTpkypEXN1TCEhIQpQTk5Oqn379srHx0e5ubnZxFNSUqKaNm2qAOXs7KyioqL0MoCaMmWKUkqpNWvW6Ouq2/+LL77Q1wUEBNjEWF1/WlqaXmbNmjV6fOfXr5RSnTp1UoByd3dXXbt2VZMmTbrwBXW+KUOVyjx0afsIIYQQQgghRD1otHu8q4eZR0RE6D2t1YYPHw5YJ9z67LPPWLt2LU899RTNmzcnNTWV/Px8unTpwrRp0/RZux966CGio6PJzc0lOTmZoKAgJkyYwHvvvQdYe4E/+ugjWrduTWFhIVu2bOHo0aOXFPP777/PO++8Q0xMDFlZWRw9epSwsDCee+45faj8pRzH2dmZNWvW8Le//Y3w8HAyMzPJy8vj9ttv14dO+/n5kZiYyBtvvEHnzp2pqqriwIEDmEwm7rzzTj744AOGDh16wbg1TWPJkiV07dpVH7r+xRdf4O/vb1POZDLx/fff07VrV33dhSY1O9fo0aP573//S+/evSktLSUlJQUPDw8efPBBxo8fb1cd53r33XeJiYmhvLycrVu3kppq32RpQgghhBBCCHG10ZS6zBtrhbiaTR0GE96Cpq0aOxIhhBBCCCHEDa7Rn+MtRL3ofR94+DR2FEIIIYQQQgghPd5CCCGEEEIIIUR9kh5vIYQQQgghhBCiHkniLYQQQgghhBBC1KNGe463EPVqzVfQ5Q7w8L1o0coKReJHhez7rpii7Epc/Rxoe4eJHhO9cHa98HdT5cVVrH+vgJRVpZTkVeIR5ED0YDe6TfDE4Gh9Bnj6pjI2vl/AqTQL5qIqXDwN+LV0ovODHrTpa32Od1WlIv7N0+z/oYQqC7Ts7UL/V3z045uLqpg3+ASxozy4ZYJnnfEIIYQQQgghrj7S492AwsLC0DSNqVOnNnYoV0R8fDyapqFpGvHx8Ze077hx49A0TX8M2xW39msoyrer6MpXTrFxbiGFWZV4N3ekJK+S7Z+dYenjuaiquqdAUFWKpU/ksv2zM5TkWfctzKpk49xCfnzllF4u91AFuYcqcPN3wL+1E+XFioxtZpY/k0tmkhmAPd8Ws/3zM/R6xotBb/mRvKKEzR8V6nXEzzyNq48DNz/scZkNIoQQQgghhGgsN2zi3adPHz1p7NChg822vLw8TCaTvv3FF1+0u95zk9H09HSbbZ06dSIuLo6QkJArcQo1VCezmqbRpEkTzGazvs1isdCsWTN9+8iRI+slhmvNyeRykv9bAkDfF715ZEUwQ2ZZn2+esc3MwdWlde57cHUpGdusbTx0lj+PrAim79+9AUheUcLJ5HIAOt7nzlMbQ3jo2yDGLg5i+Bxr/aoKsnZZ9885UAFASGcjzbtan+Gek1Lxexxl7F1WzJ2v+ei96EIIIYQQQohrxw2beJ9r9+7dJCQk6D9//PHHlJWVXfHjfPvttyQmJjJ+/PgrXvf5fvvtN77++mv95yVLlpCVlVXvx73WHFl39v85or8rAK16ueBotCa4aRvqvg7S1lu3ObpotOzlYlOHzXZnjYIsC5+PPskn92Sz9MlcADQDNO1oTbKbtHMC4Ph2Mxlbrcl4k7ZOWMoVq17Lp9Nod4JjjH/8hIUQQgghhBAN7oZPvJ2crAnP7NmzAaisrGTu3Ln6+nOdOnWKJ554gubNm+Pk5ERgYCBjxozh2LFjAEydOpXbbrtNLx8eHo6maYwbNw6ofaj5sWPHePDBBwkKCsLJyYmQkBAef/xxTp06O1T53GHZc+bMISwsDA8PD+6++26ys7NrxOno6GhzTucu13ZepaWlTJ48mdatW+Ps7Iyvry9Dhw5lz549NuW++eYbWrZsiclk4q677iIzM7NGXVOnTkXTNMLCwvR1FxoFcC6z2cyUKVNo06YNzs7ONGnShIcffpjc3Nw69/mjirIt+rKrr/XXQTNomLyty4UnKuvctzDbus3kZUAzWBN1V7+zv1KFJ87WbSlTnNhdTs6BCixlCieTxqA3/Wj2e+IdM8yNzmPcWftOAd89n0vUIFfi/uLJpg8KqSxXxI7y4Nuncpl9aybzh2WTtr7unnghhBBCCCHE1eWGn1ytY8eO5OXlsWzZMo4fP87WrVs5duwYo0aNYuHChXq5srIyevfuzd69e3F0dCQiIoIjR47wxRdfsGbNGnbu3ElISAiRkZHs379fr9toNNKqVataj52Tk0O3bt3IysrCaDQSERFBamoq77//PuvWrWPr1q24uLjo5Tdu3MjmzZtp3rw5Z86c4fvvv+dvf/sbX3zxhU29fn5+REZGEh8fT2JiIkajkQ0bNujHOnr0qE35wYMH88svv6BpGm3btuX48eMsX76c1atXs3XrVtq1a8fOnTsZNWoUVVVVeHl5kZqayqOPPnql/hsAGD58OD/88AMODg5ER0eTnp7O/Pnz2bx5M9u2bcNkMtW6n9lsthlWD2C0VPJH+ocv++H2dezo19KJF/Y2p7Sgkt2Li0l4p4CfXjuFd3NHAqOcMTho9H3Rh74v+uj7/JZazpb/FDJ8TgAJs05zJKGUIe/4s3leIcufy2PCymBcfR0uN1IhhBBCCCFEA7nhe7wNBgNPPPEEFouF999/X+8Znjhxok25hQsXsnfvXgAWLVrEvn372LBhAwaDgaysLN577z3Gjx/P3Llz9X2qh5a/8sortR57zpw5ZGVlYTAY2LhxI/v27WPRokUA7N271ybxB2tvfGJiIqmpqQwbNgyA1atX11p3dfyzZ8+u85wA1qxZwy+//ALA22+/zf79+9m/fz/u7u6cOXOG6dOnAzBz5kw96U5JSeHQoUMMHz68rma9ZGvXruWHH34A4Ndff2XXrl0cOHAAk8lEcnIyX375ZZ37Tp8+HS8vL5vX9PWpdh3XI+jsd08lp6oA66RpZaety57BdSe2nkHWbaWnq/RJ2KrrsO5b83stk5cDcY944uJpwFyo2LqgqNa6VZXipyn5tBvgSnh3F44mmgmIcKL1bSYiB7hSUaLI2l1u1zkKIYQQQgghGtcNn3gDPPzww7i5uTF79mzWrFlD586d6datm02ZrVu3AuDq6srQoUMBiI2NpW3btgBs27btko9bXWfbtm2JjY0FYOjQobi6utZaZ0xMjD4RXFRUFAAnT56ste4hQ4bQokULFi1axMKFCwkODuaee+6pMwaA0aNHAxASEkLPnj1tYti3bx8A3bt3JzAwEIA///nPl3rKddqyZYu+3Lt3bzRNo2nTppSWWodUJyYm1rnvpEmTKCgosHlN6hFh13HDe5wdUZD6s3WStcMJZVjM1kQ6vLt1+4k9ZuYNOsG8QSc4scfaux72+74Ws+JIQplNHefWvXvxGUoLzg5Zz0wyU1ZkTdArSs8m6ufa8cUZCjIt+mRtKDA4WYezywRrQgghhBBCXFtu+KHmAN7e3owZM4Z///vfQO09w1cDb29vfbn6Pu66ODg48Ne//pVJkyZRUVHBo48+Wuv93VeaplmTwsrKs4lmQUHBJdURFxdXY11QUFCd5Y1GI0bjeQPLHe0bgh0U7UzkXa7s/6GEX2ecJmnhGU5nWO/NDulspE0/6/D2ilLFqTSLvgzQpq+JZrHOZO4oZ9kzuXg3dyT/qLVM5EBXAqOcAdj0YSGr/pmPVzNHHBwhL82iD0mPGuxWI6bCExbWzS7gjik+mLyt5xHazciRhDIKMi2kbSjDyaQRHONs1zkKIYQQQgghGpf0eP/uySefBCAgIKDWR2117doVgJKSEpYtWwbAjh07SElJAaBLly4Aem81QHFx8QWPWV1nSkoKO3bsAGDZsmWUlJTY1Hm5xo8fj4uLC05OTnXej10dA6AP5z5+/Djr1q2ziSE6OhqADRs2kJOTA8DixYtr1NekSRPAev96dcJdW7kLxTFp0iQSExNJTExk/fr1TJ06lUceeeSidVyuAdN86faYJ57BDpzOsODq60Ds/e6MmOuvT5pWG4ODxoi5AcTe746rr3Vfz2AHuj3myYDXffVy7Qa44tfSiZJTlZw6asHkbSCsuwsj3ven7TmzoFdb9Y98mnc2EjXwbFLed5IPzbsamT8sm1NpFQx6yw83P7m/WwghhBBCiGuCukH17t1bASouLk5fl5eXpwoKCvSfsfZLqr///e+qtLRUtW/fXgHK0dFRRUVFKRcXFwWopk2bqpycHKWUUrm5ucrJyUkBKigoSMXFxalFixYppZQKDQ1VgJoyZYpSSqmTJ0+q4OBgBSij0aiio6OVo6OjAlT79u1VaWmpUkqpsWPHKkD17t1bj23KlCl6fNWqywUGBurr8vPzVX5+vv5zdQz33Xefvu72229XgNI0TUVGRioPDw8FKHd3d7V//36llFI7duxQmqYpQHl5eanWrVsro9Gox7BmzRqllFL79+9XBoNBASo8PFx17txZ/xlQaWlpdZ7TnXfeqZdr27atioqKUm5ubjb1223KUKUyD13aPkIIIYQQQghRD6TH+xy+vr54enrWus3FxYW1a9fy+OOPExQURGpqKh4eHtx///1s2rSJgIAAwDqj+Lvvvkvz5s05efIkmzdvrvWRX2DtHU5MTOSBBx7A29ublJQUAgMDeeyxx1i7dq3NjOaXy9vb22aIem2+++47XnrpJcLDwzl48CCOjo4MGTKEjRs30q5dOwA6derEl19+SVhYGGVlZYSGhvL+++/XqKtdu3Z8+OGHhIWFceLECfz9/W0mnLuQZcuW8eqrr9KmTRuOHDlCdnY2kZGRvPzyy7Rv3/6Sz10IIYQQQgghrgaaUuqyn5wkxFVr6jCY8BY0rf1RbkIIIYQQQgjRUKTHW1yfet8HHj4XLyeEEEIIIYQQ9Ux6vIUQQgghhBBCiHokPd5CCCGEEEIIIUQ9ksRbCCGEEEIIIYSoR46NHYAQ9WLNV9DlDvDwvWjRygpF4keF7PuumKLsSlz9HGh7h4keE71wdr3wd1PlxVWsf6+AlFWllORV4hHkQPRgN7pN8MTgaH0G+JnfKvl1Rj7Ze8spyKwEoN2fTAx6y9+mrsSPCkn66gzlZ6pocbOR/lN8cfe3Pqu7yqL49N6TBN/kzJ1TL35OQgghhBBCiKuH9HiLetOnTx80TWPcuHENf/C1X0NRvl1FV75yio1zCynMqsS7uSMleZVs/+wMSx/PRVXVPQWCqlIsfSKX7Z+doSTPum9hViUb5xby4yun9HLFeZWk/FQKGjgatVrrSt9Yxrp/FRAzzI0xXwZyOKGM+DdP69u3/KeIkvxK+vzN265zEkIIIYQQQlw9JPG+BlQnsGFhYTbrx40bh6ZpaFrtyZy4uJPJ5ST/twSAvi9688iKYIbMsvZEZ2wzc3B1aZ37HlxdSsY2MwBDZ/nzyIpg+v7dG4DkFSWcTC4HwDfMkSfXN2XCyqa4+tX+K5dzwFo2JNaIXysnXH0N/JZiXZd/tIJN/y7k9pd8MHrIr6wQQgghhBDXGvkUL64J5eXl9VLvkXVl+nJEf1cAWvVy0Xum0zaU1bofQNp66zZHF42WvVxs6jh3u5OLAZO3wwXjaNLOGYDjO8zkHa6g5FQVAW2dUUrx09R8wnu42NQthBBCCCGEuHZI4n0dqe4B79OnD3PmzCE0NBQXFxcGDBhARkaGXi4xMZF+/frh5+eHi4sLYWFhDB06lMOHD+tltm3bxpAhQ/Dz88NoNNKyZUtmzpwJQGlpKUOHDiU8PBw3NzeMRiNt2rTh1VdfvWiCXFBQwNNPP01oaCjOzs6EhITw3HPPUVJSUut5/N///R8hISG4uLhc4dayKsq26MuuvtZfB82gYfK2LheeqKxz38Js6zaTlwHNYE3Uz+3RLjxhqXW/2oTd6kLPp73Ys7SYz0efpGVPF/q84M2eJcXkHCin17Ne/PhyHu/1zOSjASfYu7zY/pMUQgghhBBCNCqZXO06lJiYyJYtWwgLC6OiooKVK1cydOhQtm3bhlKKu+++m7y8PAIDA4mMjCQzM5Ply5fzzDPP0KpVKzZu3Mhtt91GeXk5zs7OtGnThuzsbNatW8ff/vY3zGYzy5cvJzAwkIiICHJzczl06BD//Oc/KS0t5c0336w1rvLycvr06cPOnTtxcXEhMjKS1NRU3nnnHXbt2sUvv/xiM2x+06ZNrFu3jrZt21JWVnfPs9lsxmw226wzWiox/oE2vOyH21/2jnDLXzy55S+e+s9nciuJf/s0vZ/zZs/SYvYuK+FP//Th8NoyVr5yiqBoZ/xbO13+AYUQQgghhBANQnq8r0OVlZVs3bqV5ORk5s6dC8COHTv46aefyM/PJy8vD4Dt27eTlJRETk4Oe/fuJSoqCoCXX36Z8vJyvL292bNnD3v37iUnJ4fXXnsNADc3N/bt20d2djZJSUlkZGQwZswYAL766qs641q4cCE7d+7E2dmZ3bt3s2vXLhITEwH49ddf+fXXX23Kl5eX89///pfk5GROnjxZZ73Tp0/Hy8vL5jV9fapdbeURdPa7p5JTVYB10rSy09Zlz+C6h4h7Blm3lZ6u0idhq67Duu8f+15r9bR8AiKcuekeN44mluHiZSBmmDvth7qhquDY5rq/jBBCCCGEEEJcPSTxvgZc6uRpMTExREdHAzBq1Ch9/Z49e/Dz86Nbt24AtG7dmpiYGEaNGkVSUhL+/tZJxTZv3gzAPffcQ0REBAAGg4EOHTroy59//jkREREYjUY0TePzzz8HICsrq864tmzZAlgT6oiICDRNo2PHjvr26iS8Wtu2bRkwYAAADg51J8CTJk2ioKDA5jWpR8RFWskqvMfZIeypP1uHux9OKMNitibS4d2t20/sMTNv0AnmDTrBiT3W3vWw3/e1mBVHEsps6ji/7kt18NdSjiSUcedUHzRNQylw+L1z20HGqQghhBBCCHFNkY/w1wA3NzcATp06ZbO+uufa3d39kupbvXo1X375JRs2bCA5OZnFixfz1VdfceLECV544YWL7j9jxgymT58OQGhoKEFBQRw/fpzMzEyqqqousjc4OzvTqVOnGut9fHxsfg4MDLTrfIxGI0bjeQPLHS88mVm1oGhnIu9yZf8PJfw64zRJC89wOsN6b3ZIZyNt+pkAqChVnEqz6MsAbfqaaBbrTOaOcpY9k4t3c0fyj1rLRA50JTDKOmFa0UkLX437DYAzOdb7wg8nlPHRgBMA/OXHYJuYzGeq+GVaPt0e9cQ3zJpth97iwtb5RWTvK+dIQhmaAZrf/EcG0wshhBBCCCEaivR4XwOqe4WLior4+OOPsVgsbN++nTVr1gDoPdHV9uzZw/79+wH4+uuv9fUxMTEopdi4cSPjxo3jP//5D4mJiTzyyCMAJCQkABAXFwfAkiVLOHToEABKKXbv3g2c7ZmOiIggPT2dDRs21IihNl27dgWsQ+Hnzp1LYmIiiYmJxMfH88ILLzB69Gib8g31mLQB03zp9pgnnsEOnM6w4OrrQOz97oyY669PmlYbg4PGiLkBxN7vjquvdV/PYAe6PebJgNd99XJVFjidYeF0hoWq3+dbqyhR+rrzJbxzGpOXgZsf9tDX3fqYJ5EDXflmfA6H4ku5c6oPAW2cr1wjCCGEEEIIIeqNppT6A9NBiYZw/PhxOnTooPd4W4ceK315xYoVDBw4kHHjxvHJJ5/g5uZGVVUV4eHhHDhwgKqqKjp27MiOHTuorKzEyckJDw8PmjdvjsFgIDk5maqqKl566SWmTZtWY3K1iIgIsrOz6d69O8uWLWPy5Mn87//+L4A+gVtpaakeX3Vsffr0Ye3atYwdO5YFCxZgNpu5+eab2b17NwaDgcjISCoqKjh69Chms5m0tDTCwsL08+jduzfx8fGX12hTh8GEt6Bpqz/W+EIIIYQQQgjxB0mP9zUgJCSEjRs3ct999xEYGIjBYMDLy4u+ffvy448/MnDgQJvyXbp04d1336W4uBgnJyfuuOMOli1bhqZpODg48NhjjxEeHk5mZiaHDh0iLCyM559/nldffRWAW2+9lQ0bNjBo0CDc3d1JSUnB3d2dHj16APDSSy8xduxYvL29KSwsZOTIkTz++OMXPQ+j0cjatWt56qmnaN68OampqeTn59OlSxemTZtm99ByIYQQQgghhLiWSI/3deSK9BRfL6THWwghhBBCCHGVkB5vcX3qfR94+Fy8nBBCCCGEEELUM5nVXFyfbhvZ2BEIIYQQQgghBCBDzYUQQgghhBBCiHolQ82FEEIIIYQQQoh6JIm3EEIIIYQQQghRjyTxFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEW1x2z2czUqVMxm82NHcp1T9q64UhbNxxp64Yjbd1wpK0bjrR1w5G2bjjS1n+cppRSjR2EEFdSYWEhXl5eFBQU4Onp2djhXNekrRuOtHXDkbZuONLWDUfauuFIWzccaeuGI239x0mPtxBCCCGEEEIIUY8k8RZCCCGEEEIIIeqRJN5CCCGEEEIIIUQ9ksRbXHeMRiNTpkzBaDQ2dijXPWnrhiNt3XCkrRuOtHXDkbZuONLWDUfauuFIW/9xMrmaEEIIIYQQQghRj6THWwghhBBCCCGEqEeSeAshhBBCCCGEEPVIEm8hRL2aOnUqmqYRFhbW2KFc98aNG4emafTp06exQ7nuSVs3Lk3T0DSNBQsWNHYo170+ffqgaRrjxo1r7FCue2FhYWiaxtSpUxs7lOuetHXDkffrsyTxFteNr776itjYWEwmE76+vtxzzz0cPny4scO66s2cOZM+ffoQHByM0WgkNDSUsWPHcuTIEZty1R++zn/16NGjkSK/9lR/CVHby2Kx6OUqKip47bXXaNmyJc7OzoSEhPDss89y5syZRoz+2pGenl5nO5//Qav6w9f5rzFjxjTeCVzFEhISuOuuuwgICNDb6oMPPqhRzt5r+NChQ9xzzz34+vpiMpmIjY3l66+/bqjTuarZ09ZFRUU888wzdO7cGX9/f0wmExEREbzyyisUFRXZlK3r9+Hll19uyNO6Ktl7Xdv7d/DkyZM8/PDDNGnSBKPRSFRUFO+9915Dnc5VzZ62XrBgwQXfw+Pj44ELv9d//PHHjXB2Vxd7P9/J+3XDcWzsAIS4EubNm8f48eMBCA8PJy8vjyVLlrBu3Tp27dpFUFBQI0d49Zo9ezbHjh2jbdu2mEwm0tLS+PTTT1m1ahUpKSl4enralG/ZsiUBAQH6z9HR0Q0d8jXP39+fVq1a2azTNE1ffvjhh/n8888xGAy0adOGI0eOMGvWLJKSkvj1118xGOQ70wsxGo3ExcXZrDt9+jQpKSkABAcH19gnMjLS5lpv3bp1/QZ5jdqxYwc///wzLVu2JDc3t85y9lzDJ06coHv37uTk5ODp6UlwcDBJSUmMHDmS4uJiHn744QY8s6uPPW2dl5fHv/71L4xGI+3atSMzM5ODBw/y+uuvs337dn744Yca+3Ts2NFmVuLmzZvX2zlcK+y9rqtd6O9gcXExvXv3JiUlBZPJRGhoKPv372fixInk5OTwj3/8o17O4VphT1sHBATUeA8/duwYJ06cAKj1M9355Zs0aXKFIr522fv5Tt6vG5AS4hpnNpuVv7+/AtSIESOUUkplZmYqDw8PBaiJEyc2coRXt9dff10dPXpU//mZZ55RgALU0qVL9fW9e/dWgJo/f/4l1T9lyhQFqNDQUKWUUsXFxapHjx4KUK1atVLHjh27EqdxTahui7Fjx9ZZZvv27Xr7z549Wyml1HfffaevW7JkSZ37jh07VgGqd+/eSimlTp48qdq1a6cA1bVrV5Wfn38Fz+ba8sQTTyhA+fj4qKKiIn19aGioAtSaNWsuqb4bta1zc3NVSUmJSktL06/J999/36aMvdfwxIkTFaA8PDxUZmamUkqpESNGKED5+/srs9lcZxzVdVW/Hy1dulQ5OjoqQL3++uv1cOYNz562PnHihHrzzTdVYWGhUkqp0tJSdcstt+jlT506pZetXpeWlnZJcVS/91e/b6WmpqqgoCAFqLvvvluVlZX9ofO8GtjT1krZ93dw5syZClCapqldu3YppZR67rnnFKCcnJxUdnZ2nftWvx9NmTJFKaXUxo0blZubmwLUhAkTVFVV1R86z6uBvW19vpiYGAWo/v376+vOreNS3Qhtbc/nO3m/bljSbSKueVu3btW/NR0xYgQATZs25ZZbbgFg5cqVjRbbtWDy5Mm0aNFC/7lnz576cm3Panz22WcxGo20bNmSCRMmcPLkSbuPZTabGTp0KOvXr6d169bEx8ffkL0tS5YswWQyERwczN13301SUpK+7ccff9SXq6/ngQMH4uLiAth/Pefn53PHHXdw4MAB4uLi+Pnnn/H29r5yJ3ENycvLY/78+QD89a9/xd3dvUaZESNG4OLiQkREBP/zP/9DYWGh3fXfSG3t5+eHyWS6YBl7r+Hqct26daNp06YADB8+HIDc3Fy2bdtmV0wrV65k5MiRWCwWpk+fzuTJky/hjK5e9rR1UFAQzz//PB4eHgC4uLjQtWtXAAwGA46ONQc2dunSBVdXV6Kjo5kxYwZms9numI4ePUq/fv3Izs5m8ODBLFmy5Lp4pq89bX2uC/0drL6u27Rpw0033QSc/T2oqKhg9erVdh0jKSmJAQMGUFxczF//+lc++OADm5FR16pLbWuw/o7v2bMHgBdeeKHWMgEBAbi7u9OpUyc+/PBDqqqq7K7/em1rez7fyft1w5LEW1zzMjIy9OVzhxYFBgYC1uFJwj6VlZV8+OGHgHUoXb9+/Wy2m0wmmjVrRkBAAGlpaXz00Ud069aN4uLii9ZtsVi47777+Pnnn2nTpg3x8fGEhITUy3lczRwcHAgKCiIsLIzs7Gy+//57unXrpifftV3PBoMBf39/wL7r+cyZMwwYMIBdu3Zxyy23sGrVKry8vOrhbK4Nc+fOpaSkBKPRyMSJE2ts9/DwoFmzZnh5eXHw4EHefPNN7rzzTrs+uElb12TvNVxdrrb37XPLXUhCQgLDhw+nvLycN954gxdffPGPn8A1LCcnhyVLlgAwcuRIPSGv5uPjQ0hICEajkeTkZCZNmsSDDz5oV93Z2dn069ePjIwMhgwZwuLFi3F2dr7i53C1u9jfwStxXe/fv5877riDgoICHn/8cebMmXNdJIKX68033wSgQ4cO9O/fv8b2Jk2a6Mngzp07efTRR5k0aZJddd8obV3X5zt5v25YkniL65ZSqrFDuKYUFxczbNgwfvrpJ4KCglixYoVNT8Y777xDfn4+e/fuJSMjQ/+jlpaWxrfffnvR+jMzM1m+fDnu7u6sWbOGZs2a1du5XK1Gjx5NTk4OBw8eZP/+/fo3yWazmTlz5lxw30u5nrdv387mzZsJDQ3lp59+qnGf/o3k3LYdM2ZMjXsDFy9eTH5+Prt37yYzM5MHHngAgMTERDZu3HjR+qWt7WfPNXyp79vz58+ntLSUZ599lv/5n/+53NCuC4cPH6ZHjx5kZWXRvXv3GhNWJSYmkpeXx86dO8nMzKRv374AfPPNNzYfvuvy008/cfjwYW6++WYWLVqEk5NTvZzH1exy/w5e6nX9zTffkJuby/Dhw6/bRNBe1fcZAzz//PM22wICAti9ezcnT55k165dHDt2jKioKMB6f3N5eflF678R2vpin+9qI+/X9UMSb3HNO3eock5OTo3lc4fZiNplZ2fTu3dvVqxYQUREBBs2bND/eFXr1KmT/kataRqjR4/Wt9nzTaeLiwsODg6cOXOGWbNmXdH4rxURERH4+vrqP9955534+fkBZ9uwtuu5qqqKvLw8wL7r2c3NDbAOC/3ss8+uTPDXqE8//ZSTJ0+iaRp/+9vfamzv0qULDg4OADg6OnLvvffq2+y5rqWta7L3Gq4uV9v79rnlLqT6toGFCxfe0E+x2LRpE7fccgsHDx5k0KBBrFq1qkZvd1xcnJ5UuLq6MmzYMH2bPYl3dVvXNWnbjcCev4NX8rpetWoVmzdv/uOBX8PeeustwNquI0eOtNnm5uZGTEyM/rOvry8DBgwAoLS01K6J8q73tr7Y5zt5v25YkniLa17Xrl315KV6iF1WVhaJiYkA/OlPf2q02K4F+/bt45ZbbmH79u307NmTTZs20bJlS5syOTk5vP322zaPpzn3ERL2PKM7MDBQ74F56623mDFjxpU5gWvIG2+8YZPM/fzzz/oft+o2PPd6rb6ev//+e8rKympsr0uXLl30RwRNnDiRhQsXXpH4rzVKKWbOnAlY71mLjIy02b5v3z7mzZun3+NaWVnJ4sWL9e32XNfS1jXZew1X/7tp0yaysrIAWLp0KWCd+b9Lly4XPdY///lP2rdvT3Z2Nv3799dnPb6RLF68mL59+5Kbm8vEiRNZtmwZrq6uNmUSEhJYvHgxlZWVAJSVlbF8+XJ9e2ho6EWPM3z4cB544AEqKysZOXIka9euvbIncpWz9+9g9XV98OBBdu/eDZz9PXBycqpxC1dtJk6cSN++fTlz5gx33XUXycnJV+o0rinHjh3jm2++AeDpp5+uMWfB8uXLWbVqlf7z6dOn9ZFkbm5uNjPP1+V6bmt7Pt/J+3UDa7x53YS4cv7973/rsyaGh4crT09PfabF6tkXRe0iIiL0tuvYsaOKi4vTXx999JFS6uzMoY6Ojqpdu3aqefPm+j6RkZGqtLS0zvrPn9V86tSp+r7V9d8oQkNDlaZpqkWLFioyMlJpmqYA5ebmpvbt26eXGzVqlAKUwWBQ7dq1U05OTgpQPXv2VJWVlXXWf/5M2w899JA+k+4PP/xQ36d31Vm+fLl+ra1du7bG9jVr1ihAGY1GFR0drQIDA/Xyffv2veCstjdqWy9ZskS1atVKnxEYUAEBAapVq1Zq9OjRejl7ruHjx4/rT6Tw9PRU4eHhep0ffvjhBeOoLjd//nyVkZGhQkJCFKBiYmJsZvK+ltnT1pmZmfr7iLOzs837d1xcnNq+fbtSSqn58+fr7zUxMTHKx8dHr/Ohhx66YBznzmpeXl6u+vfvr/+fVdd/rbOnre39O1hUVKTatGmjAGUymWz+xr700ksXjOPcmbYLCgpUhw4dFKCaNWum0tPT670dGoK97yFKKfXss88qQHl5eekz95+r+vOFl5eXuummm5S7u7te52uvvXbBOG6Etrbn851S8n7dkCTxFteNzz//XHXs2FEZjUbl5eWlhg8frlJTUxs7rKveuX/8zn9VP2bjzJkzavLkyapr167K19dXmUwm1a5dO/Xiiy9e9E3z/MRbKaX+8pe/6G/yixYtqsezu7r8+9//VrfffrsKDg5WRqNRhYWFqfvvv18dOHDAplx5ebl69dVXVVhYmHJyclJNmzZVTz31VK0fPM51fjJYUVGhBgwYoADl6uqq1q9fX1+ndlXq2bOnAtTNN99c6/bs7Gz13HPPqZtuukl5eXkpd3d3FRMTo6ZPn65KSkouWPeN2tbVCVxtr+q2UMr+azglJUUNHz5ceXl5KaPRqDp27Ki++OKLi8Zx7gc5pZTau3ev8vb2VoDq1q2bKi4uvpKn3SjsaetzH6dU26v6MXkHDx5Ujz32mIqMjFTu7u7Ky8tLde7cWX3wwQeqoqLignGc/zixwsJCFRsbqydMKSkp9dgKDcOetr6Uv4NZWVlq7Nixyt/fXzk5Oal27dqpWbNmXTSO8x9xlZWVpa9r06aNOnny5JU+9QZn73vI6dOn9cfCvvDCC7XWtW3bNjV27FjVunVr5erqqvz8/NStt96qvv7664vGcSO0tT2f75SS9+uGpCklM1AJIYQQQgghhBD1Re7xFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEWQgghhBBCCCHqkSTeQgghhGgQZWVlvP3228TFxeHp6YmrqysRERE8+uijHDlypNHi0jQNTdNYsGBBo8UghBDi+ubY2AEIIYQQ4vqXn59Pv379SEpKAsDDw4NWrVpx7NgxPvzwQ7p160bLli0bOUohhBCifkiPtxBCCCHq3ZNPPqkn3S+88AKnTp1iz549FBQUsHbtWtq2bQvAd999R48ePXB3d8fFxYVOnToxb948m7pq66Hu06cPmqYxbtw4ANLT023K3X333bi6uhIeHq7XFx8fj6Zpeh0PPfQQmqYRFhYGQEpKCoMHD6ZJkyYYjUZCQkIYMGAAW7ZsqadWEkIIcb2SxFsIIYQQ9aqgoIBvvvkGgA4dOvDGG2/g6Hh20F2vXr3o1q0bn3/+OUOGDGHDhg24u7sTFBTEzp07GT9+PNOmTbvs40+YMIF9+/bh5OREeno6EyZM4MCBA3h6ehIXF6eXa9myJXFxcXTq1AmAUaNGsWLFCiwWC9HR0VRVVbFy5UqSk5MvOxYhhBA3Jkm8hRBCCFGvUlNTsVgsAPTs2dOml/lckydPBiAuLo6jR4+SlpbGsGHDAJg2bRolJSWXdfwhQ4Zw5MgR1q1bB0BVVRXx8fHExsaSmJiol3vllVdITEzk22+/BeDgwYMArFixgh07dpCVlcWRI0fo06fPZcUhhBDixiWJtxBCCCHqlVJKX64r6c7JyeHYsWMADB8+HKPRiKZpjBw5EoDS0lL27dt3Wce///770TSNqKgofd3Jkycvut+gQYMAuO2224iMjGTEiBGsXLmS4ODgy4pDCCHEjUsmVxNCCCFEvWrbti2Ojo5YLBbWr1+PUqrOBPxSVFZW6ssFBQV1lvP29gawGd5+7pcBdfn0008ZPHgw8fHxJCcn88MPP7B06VL27t3LnDlzLj9wIYQQNxzp8RZCCCFEvfLy8uLee+8FICkpiZdeekkfeg7wyy+/cOjQIVq0aAHA0qVLMZvNKKX46quvADCZTERHRwPQpEkTwDqEHeDAgQPs2bPnsuMzmUwAFBcX26xft24dw4YN44MPPiAhIYEpU6YAkJCQcNnHEkIIcWOSxFsIIYQQ9W727Nl07NgRgBkzZuDn50eHDh3w9fWlf//+pKam6hOobd68mdDQUMLDw/X7rSdPnoyrqysA/fr1A2DmzJncdtttdOvWza4e7Lq0a9cOgBdffJGbb76Zl156CYAHHngAHx8f2rZtS6dOnXj11VcBuOmmmy77WEIIIW5MkngLIYQQot75+vqyadMm3nrrLbp27UpVVRUpKSn4+Pgwfvx4evXqxZgxY1i+fDndu3enqKiI7OxsOnbsyMcff6xPvAbw9ttvM3DgQEwmE4cPH+all16iR48elx3bu+++S0xMDOXl5WzdulXvSX/ooYeIjo4mNzeX5ORkgoKCmDBhAu+9994fbg8hhBA3Fk39ka+IhRBCCCGEEEIIcUHS4y2EEEIIIYQQQtQjSbyFEEIIIYQQQoh6JIm3EEIIIYQQQghRjyTxFkIIIYQQQggh6pEk3kIIIYQQQgghRD2SxFsIIYQQQgghhKhHkngLIYQQQgghhBD1SBJvIYQQQgghhBCiHkniLYQQQgghhBBC1CNJvIUQQgghhBBCiHokibcQQgghhBBCCFGPJPEWQgghhBBCCCHq0f8DlvnWh8YdVxYAAAAASUVORK5CYII=",
|
| 85 |
+
"text/plain": [
|
| 86 |
+
"<Figure size 1000x350 with 1 Axes>"
|
| 87 |
+
]
|
| 88 |
+
},
|
| 89 |
+
"metadata": {},
|
| 90 |
+
"output_type": "display_data"
|
| 91 |
+
}
|
| 92 |
+
],
|
| 93 |
+
"source": [
|
| 94 |
+
"csv_file = current_dir.parent / \"data/CSV/Models/Civi_models.csv\"\n",
|
| 95 |
+
"out_file = current_dir.parent / \"plots/Figure_12.svg\"\n",
|
| 96 |
+
"sortByFrequency_model_types_csv(csv_file, out_file)"
|
| 97 |
+
]
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"cell_type": "code",
|
| 101 |
+
"execution_count": null,
|
| 102 |
+
"id": "e91e903f",
|
| 103 |
+
"metadata": {},
|
| 104 |
+
"outputs": [],
|
| 105 |
+
"source": []
|
| 106 |
+
}
|
| 107 |
+
],
|
| 108 |
+
"metadata": {
|
| 109 |
+
"kernelspec": {
|
| 110 |
+
"display_name": "latm",
|
| 111 |
+
"language": "python",
|
| 112 |
+
"name": "python3"
|
| 113 |
+
},
|
| 114 |
+
"language_info": {
|
| 115 |
+
"codemirror_mode": {
|
| 116 |
+
"name": "ipython",
|
| 117 |
+
"version": 3
|
| 118 |
+
},
|
| 119 |
+
"file_extension": ".py",
|
| 120 |
+
"mimetype": "text/x-python",
|
| 121 |
+
"name": "python",
|
| 122 |
+
"nbconvert_exporter": "python",
|
| 123 |
+
"pygments_lexer": "ipython3",
|
| 124 |
+
"version": "3.10.15"
|
| 125 |
+
}
|
| 126 |
+
},
|
| 127 |
+
"nbformat": 4,
|
| 128 |
+
"nbformat_minor": 5
|
| 129 |
+
}
|
jupyter_notebooks/SuppM_Figure_S13_Danbooru_Taxonomy.ipynb
ADDED
|
@@ -0,0 +1,1848 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"id": "68a6e559-6ef4-4c7f-8736-93bc8588e3bc",
|
| 6 |
+
"metadata": {},
|
| 7 |
+
"source": [
|
| 8 |
+
"# \"Danbooru\" Taxonomy"
|
| 9 |
+
]
|
| 10 |
+
},
|
| 11 |
+
{
|
| 12 |
+
"cell_type": "markdown",
|
| 13 |
+
"id": "36aff030",
|
| 14 |
+
"metadata": {},
|
| 15 |
+
"source": [
|
| 16 |
+
"### Getting Tags and Categories from Danbooru.donmai.us"
|
| 17 |
+
]
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"cell_type": "code",
|
| 21 |
+
"execution_count": 14,
|
| 22 |
+
"id": "a97e3a64-5990-44a9-809d-45030e85f161",
|
| 23 |
+
"metadata": {
|
| 24 |
+
"execution": {
|
| 25 |
+
"iopub.execute_input": "2025-03-17T12:48:07.782534Z",
|
| 26 |
+
"iopub.status.busy": "2025-03-17T12:48:07.781679Z",
|
| 27 |
+
"iopub.status.idle": "2025-03-17T12:48:07.786016Z",
|
| 28 |
+
"shell.execute_reply": "2025-03-17T12:48:07.785465Z",
|
| 29 |
+
"shell.execute_reply.started": "2025-03-17T12:48:07.782509Z"
|
| 30 |
+
}
|
| 31 |
+
},
|
| 32 |
+
"outputs": [],
|
| 33 |
+
"source": [
|
| 34 |
+
"from pathlib import Path\n",
|
| 35 |
+
"import subprocess\n",
|
| 36 |
+
"import json\n",
|
| 37 |
+
"import urllib.parse"
|
| 38 |
+
]
|
| 39 |
+
},
|
| 40 |
+
{
|
| 41 |
+
"cell_type": "markdown",
|
| 42 |
+
"id": "97917f01-8f5b-4aa7-8869-9a7d853843d0",
|
| 43 |
+
"metadata": {},
|
| 44 |
+
"source": [
|
| 45 |
+
"#### Paths credentials"
|
| 46 |
+
]
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"cell_type": "code",
|
| 50 |
+
"execution_count": 2,
|
| 51 |
+
"id": "1f50a910-66c2-472a-8027-dd589b6ae755",
|
| 52 |
+
"metadata": {
|
| 53 |
+
"execution": {
|
| 54 |
+
"iopub.execute_input": "2025-03-17T12:48:08.799790Z",
|
| 55 |
+
"iopub.status.busy": "2025-03-17T12:48:08.798840Z",
|
| 56 |
+
"iopub.status.idle": "2025-03-17T12:48:08.905544Z",
|
| 57 |
+
"shell.execute_reply": "2025-03-17T12:48:08.904995Z",
|
| 58 |
+
"shell.execute_reply.started": "2025-03-17T12:48:08.799770Z"
|
| 59 |
+
}
|
| 60 |
+
},
|
| 61 |
+
"outputs": [],
|
| 62 |
+
"source": [
|
| 63 |
+
"current_dir = Path.cwd()"
|
| 64 |
+
]
|
| 65 |
+
},
|
| 66 |
+
{
|
| 67 |
+
"cell_type": "code",
|
| 68 |
+
"execution_count": 6,
|
| 69 |
+
"id": "a0f846e3-1fcc-4839-a459-c974f9c46d06",
|
| 70 |
+
"metadata": {
|
| 71 |
+
"execution": {
|
| 72 |
+
"iopub.execute_input": "2025-03-17T12:48:09.490977Z",
|
| 73 |
+
"iopub.status.busy": "2025-03-17T12:48:09.490799Z",
|
| 74 |
+
"iopub.status.idle": "2025-03-17T12:48:09.701470Z",
|
| 75 |
+
"shell.execute_reply": "2025-03-17T12:48:09.700912Z",
|
| 76 |
+
"shell.execute_reply.started": "2025-03-17T12:48:09.490961Z"
|
| 77 |
+
}
|
| 78 |
+
},
|
| 79 |
+
"outputs": [],
|
| 80 |
+
"source": [
|
| 81 |
+
"api_key = (current_dir.parent / \"misc/credentials/api-key_danbooru\").read_text().strip()\n",
|
| 82 |
+
"username = (current_dir.parent / \"misc/credentials/username_danbooru\").read_text().strip()"
|
| 83 |
+
]
|
| 84 |
+
},
|
| 85 |
+
{
|
| 86 |
+
"cell_type": "markdown",
|
| 87 |
+
"id": "c4c2290f-7705-4061-9b29-e1b74da732b8",
|
| 88 |
+
"metadata": {
|
| 89 |
+
"execution": {
|
| 90 |
+
"iopub.execute_input": "2025-03-12T20:39:34.470218Z",
|
| 91 |
+
"iopub.status.busy": "2025-03-12T20:39:34.469746Z",
|
| 92 |
+
"iopub.status.idle": "2025-03-12T20:39:34.472498Z",
|
| 93 |
+
"shell.execute_reply": "2025-03-12T20:39:34.472084Z",
|
| 94 |
+
"shell.execute_reply.started": "2025-03-12T20:39:34.470199Z"
|
| 95 |
+
}
|
| 96 |
+
},
|
| 97 |
+
"source": [
|
| 98 |
+
"### Query danbooru.donmai.us for a single Tag"
|
| 99 |
+
]
|
| 100 |
+
},
|
| 101 |
+
{
|
| 102 |
+
"cell_type": "code",
|
| 103 |
+
"execution_count": 7,
|
| 104 |
+
"id": "00ffaf5f-a4ae-408a-8a7c-e6b174f091f8",
|
| 105 |
+
"metadata": {
|
| 106 |
+
"execution": {
|
| 107 |
+
"iopub.execute_input": "2025-03-17T12:48:11.712064Z",
|
| 108 |
+
"iopub.status.busy": "2025-03-17T12:48:11.711866Z",
|
| 109 |
+
"iopub.status.idle": "2025-03-17T12:48:20.143088Z",
|
| 110 |
+
"shell.execute_reply": "2025-03-17T12:48:20.142318Z",
|
| 111 |
+
"shell.execute_reply.started": "2025-03-17T12:48:11.712047Z"
|
| 112 |
+
}
|
| 113 |
+
},
|
| 114 |
+
"outputs": [
|
| 115 |
+
{
|
| 116 |
+
"name": "stdout",
|
| 117 |
+
"output_type": "stream",
|
| 118 |
+
"text": [
|
| 119 |
+
"🚀 Fetching tag details for 'wombat'...\n",
|
| 120 |
+
"🔎 Running cURL: curl -s -L --user parodyofsomething:FkzGApb17bfJayJMqKTzeyTw -X GET https://danbooru.donmai.us/tags.json?search%5Bname%5D=wombat&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\n",
|
| 121 |
+
"✅ Test complete! Data saved to `/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru/tag_info_test/wombat_tag_info.json`.\n"
|
| 122 |
+
]
|
| 123 |
+
}
|
| 124 |
+
],
|
| 125 |
+
"source": [
|
| 126 |
+
"output_dir = current_dir.parent / \"misc\" / \"danbooru\" / \"tag_info_test\"\n",
|
| 127 |
+
"output_dir.mkdir(parents=True, exist_ok=True)\n",
|
| 128 |
+
"\n",
|
| 129 |
+
"# Get the tag to test\n",
|
| 130 |
+
"tag_name = input(\"Enter a single tag to test: \").strip()\n",
|
| 131 |
+
"if not tag_name:\n",
|
| 132 |
+
" print(\"❌ ERROR: No tag entered.\")\n",
|
| 133 |
+
" exit()\n",
|
| 134 |
+
"\n",
|
| 135 |
+
"# Encode the tag name properly\n",
|
| 136 |
+
"encoded_tag = urllib.parse.quote(tag_name, safe=\"\") # Encode everything\n",
|
| 137 |
+
"\n",
|
| 138 |
+
"# API endpoints with correct encoding\n",
|
| 139 |
+
"BASE_URL = \"https://danbooru.donmai.us\"\n",
|
| 140 |
+
"#TAGS_API = f\"{BASE_URL}/tags.json?search%5Bname%5D={encoded_tag}\"\n",
|
| 141 |
+
"TAGS_API = f\"{BASE_URL}/tags.json?search%5Bname%5D={encoded_tag}&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\"\n",
|
| 142 |
+
"IMPLICATIONS_API = f\"{BASE_URL}/tag_implications.json?search%5Bantecedent_name%5D={encoded_tag}\"\n",
|
| 143 |
+
"ALIASES_API = f\"{BASE_URL}/tag_aliases.json?search%5Bantecedent_name%5D={encoded_tag}\"\n",
|
| 144 |
+
"WIKI_API = f\"{BASE_URL}/wiki_pages.json?search%5Btitle%5D={encoded_tag}\"\n",
|
| 145 |
+
"\n",
|
| 146 |
+
"# Function to run cURL command properly\n",
|
| 147 |
+
"def fetch_data(api_url, description):\n",
|
| 148 |
+
" print(f\"🚀 Fetching {description} for '{tag_name}'...\")\n",
|
| 149 |
+
"\n",
|
| 150 |
+
" # Build the cURL command\n",
|
| 151 |
+
" curl_command = [\n",
|
| 152 |
+
" \"curl\", \"-s\", \"-L\", # Silent & follow redirects\n",
|
| 153 |
+
" \"--user\", f\"{username}:{api_key}\", # Auth\n",
|
| 154 |
+
" \"-X\", \"GET\",\n",
|
| 155 |
+
" api_url\n",
|
| 156 |
+
" ]\n",
|
| 157 |
+
"\n",
|
| 158 |
+
" print(f\"🔎 Running cURL: {' '.join(curl_command)}\") # Show the command\n",
|
| 159 |
+
"\n",
|
| 160 |
+
" try:\n",
|
| 161 |
+
" result = subprocess.run(curl_command, capture_output=True, text=True, check=True)\n",
|
| 162 |
+
" if result.returncode == 0:\n",
|
| 163 |
+
" return json.loads(result.stdout) if result.stdout else None\n",
|
| 164 |
+
" else:\n",
|
| 165 |
+
" print(f\"⚠️ cURL failed (status {result.returncode}): {result.stderr}\")\n",
|
| 166 |
+
" return None\n",
|
| 167 |
+
" except subprocess.CalledProcessError as e:\n",
|
| 168 |
+
" print(f\"⚠️ cURL error: {e}\")\n",
|
| 169 |
+
" return None\n",
|
| 170 |
+
" except json.JSONDecodeError:\n",
|
| 171 |
+
" print(f\"⚠️ Failed to parse JSON: {result.stdout}\")\n",
|
| 172 |
+
" return None\n",
|
| 173 |
+
"\n",
|
| 174 |
+
"# Fetch tag details\n",
|
| 175 |
+
"tag_info = {\n",
|
| 176 |
+
" \"tag_details\": fetch_data(TAGS_API, \"tag details\"),\n",
|
| 177 |
+
" #\"implications\": fetch_data(IMPLICATIONS_API, \"tag implications\"),\n",
|
| 178 |
+
" #\"aliases\": fetch_data(ALIASES_API, \"tag aliases\"),\n",
|
| 179 |
+
" #\"wiki\": fetch_data(WIKI_API, \"wiki information\"),\n",
|
| 180 |
+
"}\n",
|
| 181 |
+
"\n",
|
| 182 |
+
"# Save to JSON\n",
|
| 183 |
+
"output_file = output_dir / f\"{tag_name}_tag_info.json\"\n",
|
| 184 |
+
"with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
|
| 185 |
+
" json.dump(tag_info, f, indent=4, ensure_ascii=False)\n",
|
| 186 |
+
"\n",
|
| 187 |
+
"print(f\"✅ Test complete! Data saved to `{output_file}`.\")\n"
|
| 188 |
+
]
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"cell_type": "markdown",
|
| 192 |
+
"id": "75a28e4c-e06f-4578-a338-4f0014953302",
|
| 193 |
+
"metadata": {},
|
| 194 |
+
"source": [
|
| 195 |
+
"## Hierarchy"
|
| 196 |
+
]
|
| 197 |
+
},
|
| 198 |
+
{
|
| 199 |
+
"cell_type": "code",
|
| 200 |
+
"execution_count": 8,
|
| 201 |
+
"id": "1a58f2d1-909b-41c0-bcb8-0a55e6835f53",
|
| 202 |
+
"metadata": {
|
| 203 |
+
"execution": {
|
| 204 |
+
"iopub.execute_input": "2025-03-17T12:48:42.365809Z",
|
| 205 |
+
"iopub.status.busy": "2025-03-17T12:48:42.365447Z",
|
| 206 |
+
"iopub.status.idle": "2025-03-17T12:48:42.376305Z",
|
| 207 |
+
"shell.execute_reply": "2025-03-17T12:48:42.375752Z",
|
| 208 |
+
"shell.execute_reply.started": "2025-03-17T12:48:42.365788Z"
|
| 209 |
+
}
|
| 210 |
+
},
|
| 211 |
+
"outputs": [],
|
| 212 |
+
"source": [
|
| 213 |
+
"manual_hierarchy = {\n",
|
| 214 |
+
" \"subject\": {\n",
|
| 215 |
+
" \"female\": {\n",
|
| 216 |
+
" \"female_general\": [\"woman\", \"girl\", \"1girl\", \"2girls\", \"3girls\", \"4girls\", \"5girls\", \"6+girls\", \"multiple girls\"]\n",
|
| 217 |
+
" },\n",
|
| 218 |
+
" \"male\": {\n",
|
| 219 |
+
" \"male_general\": [\"man\", \"boy\", \"1boy\", \"2boys\", \"3boys\", \"4boys\", \"5boys\", \"6+boys\", \"multiple boys\"]\n",
|
| 220 |
+
" },\n",
|
| 221 |
+
" \"koma\": {\n",
|
| 222 |
+
" \"koma_general\": [\"1koma\", \"2koma\"]\n",
|
| 223 |
+
" },\n",
|
| 224 |
+
" \"anthro\": {\n",
|
| 225 |
+
" \"anthro_general\": [\"cat girl\", \"fox girl\", \"dog girl\", \"cat boy\", \"furry\"]\n",
|
| 226 |
+
" }\n",
|
| 227 |
+
" },\n",
|
| 228 |
+
"\n",
|
| 229 |
+
" \"visual_characteristics\": {\n",
|
| 230 |
+
" \"image_composition_and_style\": {\n",
|
| 231 |
+
" \"artistic_license\": [],\n",
|
| 232 |
+
" \"image_composition\": {\n",
|
| 233 |
+
" \"backgrounds\": [],\n",
|
| 234 |
+
" \"censorship\": [],\n",
|
| 235 |
+
" \"colors\": [],\n",
|
| 236 |
+
" \"focus_tags\": [],\n",
|
| 237 |
+
" \"lighting\": [],\n",
|
| 238 |
+
" \"prints\": []\n",
|
| 239 |
+
" },\n",
|
| 240 |
+
" \"patterns\": [],\n",
|
| 241 |
+
" \"symbols\": [],\n",
|
| 242 |
+
" \"text\": [],\n",
|
| 243 |
+
" \"year_tags\": []\n",
|
| 244 |
+
" }\n",
|
| 245 |
+
" },\n",
|
| 246 |
+
" \"body\": {\n",
|
| 247 |
+
" \"body_parts\": {\n",
|
| 248 |
+
" \"ass\": [],\n",
|
| 249 |
+
" \"breasts_tags\": [],\n",
|
| 250 |
+
" \"face_tags\": {\n",
|
| 251 |
+
" \"eyes_tags\": []\n",
|
| 252 |
+
" },\n",
|
| 253 |
+
" \"ears_tags\": [],\n",
|
| 254 |
+
" \"hair\": {\n",
|
| 255 |
+
" \"hair_color\": [],\n",
|
| 256 |
+
" \"hair_styles\": []\n",
|
| 257 |
+
" },\n",
|
| 258 |
+
" \"hands\": {\n",
|
| 259 |
+
" \"gestures\": []\n",
|
| 260 |
+
" },\n",
|
| 261 |
+
" \"neck_and_neckwear\": [],\n",
|
| 262 |
+
" \"posture\": [],\n",
|
| 263 |
+
" \"pussy\": [],\n",
|
| 264 |
+
" \"penis\": [],\n",
|
| 265 |
+
" \"shoulders\": [],\n",
|
| 266 |
+
" \"skin_color\": [],\n",
|
| 267 |
+
" \"tail\": [],\n",
|
| 268 |
+
" \"wings\": []\n",
|
| 269 |
+
" },\n",
|
| 270 |
+
" \"injury\": []\n",
|
| 271 |
+
" },\n",
|
| 272 |
+
" \"attire_and_body_accessories\": {\n",
|
| 273 |
+
" \"attire\": {\n",
|
| 274 |
+
" \"dress\": [],\n",
|
| 275 |
+
" \"handwear\": [],\n",
|
| 276 |
+
" \"headwear\": [],\n",
|
| 277 |
+
" \"legwear\": [],\n",
|
| 278 |
+
" \"neck_and_neckwear\": [],\n",
|
| 279 |
+
" \"sexual_attire\": {\n",
|
| 280 |
+
" \"bra\": [],\n",
|
| 281 |
+
" \"panties\": []\n",
|
| 282 |
+
" },\n",
|
| 283 |
+
" \"sleeves\": [],\n",
|
| 284 |
+
" \"swimsuit\": []\n",
|
| 285 |
+
" },\n",
|
| 286 |
+
" \"embellishment\": [],\n",
|
| 287 |
+
" \"eyewear\": [],\n",
|
| 288 |
+
" \"fashion_style\": [],\n",
|
| 289 |
+
" \"nudity\": []\n",
|
| 290 |
+
" },\n",
|
| 291 |
+
" \"sex\": {\n",
|
| 292 |
+
" \"sex_acts\": {\n",
|
| 293 |
+
" \"simulated_sex_acts\": []\n",
|
| 294 |
+
" },\n",
|
| 295 |
+
" \"sexual_positions\": []\n",
|
| 296 |
+
" },\n",
|
| 297 |
+
" \"objects\": {\n",
|
| 298 |
+
" \"computer\": [],\n",
|
| 299 |
+
" \"airplanes\": [],\n",
|
| 300 |
+
" \"armor\": [],\n",
|
| 301 |
+
" \"ground_vehicles\": [],\n",
|
| 302 |
+
" \"helicopters\": [],\n",
|
| 303 |
+
" \"pokemon_objects\": [],\n",
|
| 304 |
+
" \"ships\": [],\n",
|
| 305 |
+
" \"weapons\": [],\n",
|
| 306 |
+
" \"audio_tags\": [],\n",
|
| 307 |
+
" \"cards\": {\n",
|
| 308 |
+
" \"playing_card_faces\": []\n",
|
| 309 |
+
" },\n",
|
| 310 |
+
" \"eyewear\": [],\n",
|
| 311 |
+
" \"piercings\": [],\n",
|
| 312 |
+
" \"sex_objects\": []\n",
|
| 313 |
+
" },\n",
|
| 314 |
+
" \"creatures\": {\n",
|
| 315 |
+
" \"animals\": {\n",
|
| 316 |
+
" \"birds\": [],\n",
|
| 317 |
+
" \"cats\": [],\n",
|
| 318 |
+
" \"dogs\": []\n",
|
| 319 |
+
" },\n",
|
| 320 |
+
" \"legendary_creatures\": []\n",
|
| 321 |
+
" },\n",
|
| 322 |
+
" \"plants\": {\n",
|
| 323 |
+
" \"plant\": {\n",
|
| 324 |
+
" \"tree\": []\n",
|
| 325 |
+
" },\n",
|
| 326 |
+
" \"flowers\": []\n",
|
| 327 |
+
" },\n",
|
| 328 |
+
" \"games\": {\n",
|
| 329 |
+
" \"game_activities\": [],\n",
|
| 330 |
+
" \"board_games\": [],\n",
|
| 331 |
+
" \"sports\": [],\n",
|
| 332 |
+
" \"video_game\": []\n",
|
| 333 |
+
" },\n",
|
| 334 |
+
" \"real_world\": {\n",
|
| 335 |
+
" \"companies_and_brand_names\": [],\n",
|
| 336 |
+
" \"holidays_and_celebrations\": [],\n",
|
| 337 |
+
" \"jobs\": [],\n",
|
| 338 |
+
" \"locations\": [],\n",
|
| 339 |
+
" \"people\": [],\n",
|
| 340 |
+
" \"real_world_locations\": []\n",
|
| 341 |
+
" },\n",
|
| 342 |
+
" \"more\": {\n",
|
| 343 |
+
" \"dances\": [],\n",
|
| 344 |
+
" \"family_relationships\": [],\n",
|
| 345 |
+
" \"food_tags\": [],\n",
|
| 346 |
+
" \"fire\": [],\n",
|
| 347 |
+
" \"groups\": [],\n",
|
| 348 |
+
" \"phrases\": [],\n",
|
| 349 |
+
" \"scan\": [],\n",
|
| 350 |
+
" \"subjective\": [],\n",
|
| 351 |
+
" \"technology\": [],\n",
|
| 352 |
+
" \"verbs_and_gerunds\": [],\n",
|
| 353 |
+
" \"water\": [],\n",
|
| 354 |
+
" \"disambiguation_pages\": [],\n",
|
| 355 |
+
" \"magazine_publications\": [],\n",
|
| 356 |
+
" \"special_moves\": [],\n",
|
| 357 |
+
" \"uniforms\": [],\n",
|
| 358 |
+
" \"pokemon_media\": [],\n",
|
| 359 |
+
" \"tagged_songs\": [],\n",
|
| 360 |
+
" \"vocaloid_derivatives\": [],\n",
|
| 361 |
+
" \"vocaloid_songs\": [],\n",
|
| 362 |
+
" \"vocal_synthesizers\": [],\n",
|
| 363 |
+
" \"vocal_synth_derivatives\": [],\n",
|
| 364 |
+
" \"vocal_synth_songs\": [],\n",
|
| 365 |
+
" \"deemo_songs\": []\n",
|
| 366 |
+
" },\n",
|
| 367 |
+
" \"copyrights_artists_projects_and_media\": {\n",
|
| 368 |
+
" \"genres_of_video_games\": {\n",
|
| 369 |
+
" \"fighting_games\": [],\n",
|
| 370 |
+
" \"platform_games\": [],\n",
|
| 371 |
+
" \"role-playing_games\": [],\n",
|
| 372 |
+
" \"shooter_games\": [],\n",
|
| 373 |
+
" \"visual_novel_games\": []\n",
|
| 374 |
+
" },\n",
|
| 375 |
+
" \"artists\": {\n",
|
| 376 |
+
" \"named_drawfags\": [],\n",
|
| 377 |
+
" \"pixiv_projects\": []\n",
|
| 378 |
+
" }\n",
|
| 379 |
+
" },\n",
|
| 380 |
+
" \"characters\": {\n",
|
| 381 |
+
" \"ace_attorney\": [],\n",
|
| 382 |
+
" \"arknights\": [],\n",
|
| 383 |
+
" \"atelier\": [],\n",
|
| 384 |
+
" \"azur_lane\": [],\n",
|
| 385 |
+
" \"bleach\": [],\n",
|
| 386 |
+
" \"bokujou_monogatari\": [],\n",
|
| 387 |
+
" \"brave_girl_ravens\": [],\n",
|
| 388 |
+
" \"cardcaptor_sakura\": [],\n",
|
| 389 |
+
" \"danganronpa\": [],\n",
|
| 390 |
+
" \"digimon\": {\n",
|
| 391 |
+
" \"digimon_characters\": []\n",
|
| 392 |
+
" },\n",
|
| 393 |
+
" \"dragon_ball\": [],\n",
|
| 394 |
+
" \"dragon_quest\": [],\n",
|
| 395 |
+
" \"fate_series\": [],\n",
|
| 396 |
+
" \"final_fantasy\": [],\n",
|
| 397 |
+
" \"fire_emblem\": [],\n",
|
| 398 |
+
" \"flower_knight_girl\": [],\n",
|
| 399 |
+
" \"gensou_suikoden\": [],\n",
|
| 400 |
+
" \"girls_frontline\": [],\n",
|
| 401 |
+
" \"girls_und_panzer\": [],\n",
|
| 402 |
+
" \"gundam_mechas\": [],\n",
|
| 403 |
+
" \"hunter_x_hunter\": [],\n",
|
| 404 |
+
" \"jojo_no_kimyou_na_bouken\": [],\n",
|
| 405 |
+
" \"kamen_rider\": [],\n",
|
| 406 |
+
" \"kantai_collection\": [],\n",
|
| 407 |
+
" \"kingdom_hearts\": [],\n",
|
| 408 |
+
" \"mahou_sensei_negima\": [],\n",
|
| 409 |
+
" \"meitantei_conan\": [],\n",
|
| 410 |
+
" \"minecraft\": [],\n",
|
| 411 |
+
" \"naruto\": [],\n",
|
| 412 |
+
" \"nippon_ichi\": [],\n",
|
| 413 |
+
" \"one_piece\": [],\n",
|
| 414 |
+
" \"oshiro_project\": [],\n",
|
| 415 |
+
" \"pokemon\": {\n",
|
| 416 |
+
" \"elite_four_members\": [],\n",
|
| 417 |
+
" \"gym_leaders\": [],\n",
|
| 418 |
+
" \"families_of_pokemon_main_characters\": [],\n",
|
| 419 |
+
" \"pokemon_ranger_characters\": [],\n",
|
| 420 |
+
" \"pokemon_trainer_classes\": []\n",
|
| 421 |
+
" },\n",
|
| 422 |
+
" \"pretty_cure\": [],\n",
|
| 423 |
+
" \"ragnarok_online\": [],\n",
|
| 424 |
+
" \"rosenkreuzstilette\": [],\n",
|
| 425 |
+
" \"sailor_moon\": [],\n",
|
| 426 |
+
" \"street_fighter\": [],\n",
|
| 427 |
+
" \"super_smash_bros\": [],\n",
|
| 428 |
+
" \"world_witches_series\": [],\n",
|
| 429 |
+
" \"tales_of\": [],\n",
|
| 430 |
+
" \"toaru_majutsu_no_index\": [],\n",
|
| 431 |
+
" \"touhou\": [],\n",
|
| 432 |
+
" \"touken_ranbu\": [],\n",
|
| 433 |
+
" \"ultra_series\": [],\n",
|
| 434 |
+
" \"umamusume\": [],\n",
|
| 435 |
+
" \"vocaloid\": [],\n",
|
| 436 |
+
" \"yu_gi_oh\": [],\n",
|
| 437 |
+
" \"genderswap_characters\": [],\n",
|
| 438 |
+
" \"official_mascots\": [],\n",
|
| 439 |
+
" \"real_life_racehorses\": []\n",
|
| 440 |
+
" },\n",
|
| 441 |
+
" \"metatags\": {\n",
|
| 442 |
+
" \"tag_group_metatags\": [],\n",
|
| 443 |
+
" \"drawing_software\": []\n",
|
| 444 |
+
" }\n",
|
| 445 |
+
"}\n"
|
| 446 |
+
]
|
| 447 |
+
},
|
| 448 |
+
{
|
| 449 |
+
"cell_type": "markdown",
|
| 450 |
+
"id": "824760af-a568-47f8-b076-b30e8606b9ac",
|
| 451 |
+
"metadata": {},
|
| 452 |
+
"source": [
|
| 453 |
+
"## Tag Groups"
|
| 454 |
+
]
|
| 455 |
+
},
|
| 456 |
+
{
|
| 457 |
+
"cell_type": "code",
|
| 458 |
+
"execution_count": 9,
|
| 459 |
+
"id": "557cdd69-f6b1-4c33-9f67-a13b00c0146b",
|
| 460 |
+
"metadata": {
|
| 461 |
+
"execution": {
|
| 462 |
+
"iopub.execute_input": "2025-03-17T12:48:43.777297Z",
|
| 463 |
+
"iopub.status.busy": "2025-03-17T12:48:43.776934Z",
|
| 464 |
+
"iopub.status.idle": "2025-03-17T12:48:43.787862Z",
|
| 465 |
+
"shell.execute_reply": "2025-03-17T12:48:43.787288Z",
|
| 466 |
+
"shell.execute_reply.started": "2025-03-17T12:48:43.777279Z"
|
| 467 |
+
}
|
| 468 |
+
},
|
| 469 |
+
"outputs": [],
|
| 470 |
+
"source": [
|
| 471 |
+
"tag_groups = {\n",
|
| 472 |
+
" # **Subjects (Humans, Anthro, etc.)**\n",
|
| 473 |
+
" \"subject\": \"subject\",\n",
|
| 474 |
+
" \"subject.female\": \"female\",\n",
|
| 475 |
+
" \"subject.female_general\": \"female_general\", # Added\n",
|
| 476 |
+
" \"subject.female.1girl\": \"1girl\",\n",
|
| 477 |
+
" \"subject.female.2girls\": \"2girls\",\n",
|
| 478 |
+
"\n",
|
| 479 |
+
"\n",
|
| 480 |
+
" # **Male Subjects**\n",
|
| 481 |
+
" \"subject.male\": \"male\",\n",
|
| 482 |
+
" \"subject.male.1boy\": \"1boy\",\n",
|
| 483 |
+
" \"subject.male.2boys\": \"2boys\",\n",
|
| 484 |
+
" \"subject.male.3boys\": \"3boys\",\n",
|
| 485 |
+
" \"subject.male.4boys\": \"4boys\",\n",
|
| 486 |
+
" \"subject.male.5boys\": \"5boys\",\n",
|
| 487 |
+
" \"subject.male.6+boys\": \"6+boys\",\n",
|
| 488 |
+
" \"subject.male.multiple_boys\": \"multiple_boys\",\n",
|
| 489 |
+
" \"subject.male.man\": \"man\",\n",
|
| 490 |
+
" \"subject.male.boy\": \"boy\",\n",
|
| 491 |
+
"\n",
|
| 492 |
+
" # **Koma (Manga Panel Counts)**\n",
|
| 493 |
+
" \"subject.koma\": \"koma\",\n",
|
| 494 |
+
" \"subject.koma.1koma\": \"1koma\",\n",
|
| 495 |
+
" \"subject.koma.2koma\": \"2koma\",\n",
|
| 496 |
+
"\n",
|
| 497 |
+
" # **Anthropomorphic Characters**\n",
|
| 498 |
+
" \"subject.anthro\": \"anthro\",\n",
|
| 499 |
+
" \"subject.anthro.cat_girl\": \"cat_girl\",\n",
|
| 500 |
+
" \"subject.anthro.fox_girl\": \"fox_girl\",\n",
|
| 501 |
+
" \"subject.anthro.dog_girl\": \"dog_girl\",\n",
|
| 502 |
+
" \"subject.anthro.cat_boy\": \"cat_boy\",\n",
|
| 503 |
+
" \"subject.anthro.furry\": \"furry\",\n",
|
| 504 |
+
"\n",
|
| 505 |
+
" # **Visual Characteristics**\n",
|
| 506 |
+
" \"visual_characteristics\": \"visual_characteristics\",\n",
|
| 507 |
+
" \"visual_characteristics.image_composition_and_style\": \"image_composition_and_style\",\n",
|
| 508 |
+
" \"visual_characteristics.image_composition_and_style.artistic_license\": \"artistic_license\",\n",
|
| 509 |
+
" \"visual_characteristics.image_composition_and_style.image_composition\": \"image_composition\",\n",
|
| 510 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.backgrounds\": \"backgrounds\",\n",
|
| 511 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.censorship\": \"censorship\",\n",
|
| 512 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.colors\": \"colors\",\n",
|
| 513 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.focus_tags\": \"focus_tags\",\n",
|
| 514 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.lighting\": \"lighting\",\n",
|
| 515 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.prints\": \"prints\",\n",
|
| 516 |
+
" \"visual_characteristics.image_composition_and_style.image_composition.style_parodies\": \"style_parodies\",\n",
|
| 517 |
+
" \"visual_characteristics.image_composition_and_style.patterns\": \"patterns\",\n",
|
| 518 |
+
" \"visual_characteristics.image_composition_and_style.symbols\": \"symbols\",\n",
|
| 519 |
+
" \"visual_characteristics.image_composition_and_style.text\": \"text\",\n",
|
| 520 |
+
" \"visual_characteristics.image_composition_and_style.year_tags\": \"year_tags\",\n",
|
| 521 |
+
"\n",
|
| 522 |
+
"\n",
|
| 523 |
+
" # Body\n",
|
| 524 |
+
" \"body\": \"body\",\n",
|
| 525 |
+
" \"body.body_parts\": \"body_parts\",\n",
|
| 526 |
+
" \"body.body_parts.ass\": \"ass\",\n",
|
| 527 |
+
" \"body.body_parts.breasts_tags\": \"breasts_tags\",\n",
|
| 528 |
+
" \"body.body_parts.face_tags\": \"face_tags\",\n",
|
| 529 |
+
" \"body.body_parts.face_tags.eyes_tags\": \"eyes_tags\",\n",
|
| 530 |
+
" \"body.body_parts.ears_tags\": \"ears_tags\",\n",
|
| 531 |
+
" \"body.body_parts.hair\": \"hair\",\n",
|
| 532 |
+
" \"body.body_parts.hair.hair_color\": \"hair_color\",\n",
|
| 533 |
+
" \"body.body_parts.hair.hair_styles\": \"hair_styles\",\n",
|
| 534 |
+
" \"body.body_parts.hands\": \"hands\",\n",
|
| 535 |
+
" \"body.body_parts.hands.gestures\": \"gestures\",\n",
|
| 536 |
+
" \"body.body_parts.neck_and_neckwear\": \"neck_and_neckwear\",\n",
|
| 537 |
+
" \"body.body_parts.posture\": \"posture\",\n",
|
| 538 |
+
" \"body.body_parts.pussy\": \"pussy\",\n",
|
| 539 |
+
" \"body.body_parts.penis\": \"penis\",\n",
|
| 540 |
+
" \"body.body_parts.shoulders\": \"shoulders\",\n",
|
| 541 |
+
" \"body.body_parts.skin_color\": \"skin_color\",\n",
|
| 542 |
+
" \"body.body_parts.tail\": \"tail\",\n",
|
| 543 |
+
" \"body.body_parts.wings\": \"wings\",\n",
|
| 544 |
+
" \"body.injury\": \"injury\",\n",
|
| 545 |
+
"\n",
|
| 546 |
+
" # Attire & Accessories\n",
|
| 547 |
+
" \"attire_and_body_accessories\": \"attire_and_body_accessories\",\n",
|
| 548 |
+
" \"attire_and_body_accessories.attire\": \"attire\",\n",
|
| 549 |
+
" \"attire_and_body_accessories.attire.dress\": \"dress\",\n",
|
| 550 |
+
" \"attire_and_body_accessories.attire.handwear\": \"handwear\",\n",
|
| 551 |
+
" \"attire_and_body_accessories.attire.headwear\": \"headwear\",\n",
|
| 552 |
+
" \"attire_and_body_accessories.attire.legwear\": \"legwear\",\n",
|
| 553 |
+
" \"attire_and_body_accessories.attire.mask\": \"mask\",\n",
|
| 554 |
+
" \"attire_and_body_accessories.attire.neck_and_neckwear\": \"neck_and_neckwear\",\n",
|
| 555 |
+
" \"attire_and_body_accessories.attire.sexual_attire\": \"sexual_attire\",\n",
|
| 556 |
+
" \"attire_and_body_accessories.attire.sexual_attire.bra\": \"bra\",\n",
|
| 557 |
+
" \"attire_and_body_accessories.attire.sexual_attire.panties\": \"panties\",\n",
|
| 558 |
+
" \"attire_and_body_accessories.attire.sleeves\": \"sleeves\",\n",
|
| 559 |
+
" \"attire_and_body_accessories.attire.swimsuit\": \"swimsuit\",\n",
|
| 560 |
+
" \"attire_and_body_accessories.embellishment\": \"embellishment\",\n",
|
| 561 |
+
" \"attire_and_body_accessories.eyewear\": \"eyewear\",\n",
|
| 562 |
+
" \"attire_and_body_accessories.fashion_style\": \"fashion_style\",\n",
|
| 563 |
+
" \"attire_and_body_accessories.nudity\": \"nudity\",\n",
|
| 564 |
+
"\n",
|
| 565 |
+
" # Sex\n",
|
| 566 |
+
" \"sex\": \"sex\",\n",
|
| 567 |
+
" \"sex.sex_acts\": \"sex_acts\",\n",
|
| 568 |
+
" \"sex.sex_acts.simulated_sex_acts\": \"simulated_sex_acts\",\n",
|
| 569 |
+
" \"sex.sexual_positions\": \"sexual_positions\",\n",
|
| 570 |
+
"\n",
|
| 571 |
+
" # Objects\n",
|
| 572 |
+
" \"objects\": \"objects\",\n",
|
| 573 |
+
" \"objects.computer\": \"computer\",\n",
|
| 574 |
+
" \"objects.airplanes\": \"airplanes\",\n",
|
| 575 |
+
" \"objects.armor\": \"armor\",\n",
|
| 576 |
+
" \"objects.ground_vehicles\": \"ground_vehicles\",\n",
|
| 577 |
+
" \"objects.helicopters\": \"helicopters\",\n",
|
| 578 |
+
" \"objects.pokemon_objects\": \"pokemon_objects\",\n",
|
| 579 |
+
" \"objects.ships\": \"ships\",\n",
|
| 580 |
+
" \"objects.weapons\": \"weapons\",\n",
|
| 581 |
+
" \"objects.audio_tags\": \"audio_tags\",\n",
|
| 582 |
+
" \"objects.cards\": \"cards\",\n",
|
| 583 |
+
" \"objects.cards.playing_card_faces\": \"playing_card_faces\",\n",
|
| 584 |
+
" \"objects.eyewear\": \"eyewear\",\n",
|
| 585 |
+
" \"objects.piercings\": \"piercings\",\n",
|
| 586 |
+
" \"objects.sex_objects\": \"sex_objects\",\n",
|
| 587 |
+
"\n",
|
| 588 |
+
" # Creatures\n",
|
| 589 |
+
" \"creatures\": \"creatures\",\n",
|
| 590 |
+
" \"creatures.animals\": \"animals\",\n",
|
| 591 |
+
" \"creatures.animals.birds\": \"birds\",\n",
|
| 592 |
+
" \"creatures.animals.cats\": \"cats\",\n",
|
| 593 |
+
" \"creatures.animals.dogs\": \"dogs\",\n",
|
| 594 |
+
" \"creatures.legendary_creatures\": \"legendary_creatures\",\n",
|
| 595 |
+
"\n",
|
| 596 |
+
" # Plants\n",
|
| 597 |
+
" \"plants\": \"plant\",\n",
|
| 598 |
+
" \"plant.plant\": \"plant\",\n",
|
| 599 |
+
" \"plant.tree\": \"tree\",\n",
|
| 600 |
+
" \"plant.flowers\": \"flowers\",\n",
|
| 601 |
+
"\n",
|
| 602 |
+
" # Games\n",
|
| 603 |
+
" \"games\": \"games\",\n",
|
| 604 |
+
" \"games.game_activities\": \"game_activities\",\n",
|
| 605 |
+
" \"games.board_games\": \"board_games\",\n",
|
| 606 |
+
" \"games.sports\": \"sports\",\n",
|
| 607 |
+
" \"games.video_game\": \"video_game\",\n",
|
| 608 |
+
" \"games.fighting_games\": \"fighting_games\",\n",
|
| 609 |
+
"\n",
|
| 610 |
+
" # Real World\n",
|
| 611 |
+
" \"real_world\": \"real_world\",\n",
|
| 612 |
+
" \"real_world.companies_and_brand_names\": \"companies_and_brand_names\",\n",
|
| 613 |
+
" \"real_world.holidays_and_celebrations\": \"holidays_and_celebrations\",\n",
|
| 614 |
+
" \"real_world.jobs\": \"jobs\",\n",
|
| 615 |
+
" \"real_world.locations\": \"locations\",\n",
|
| 616 |
+
" \"real_world.people\": \"people\",\n",
|
| 617 |
+
" \"real_world.real_world_locations\": \"real_world_locations\",\n",
|
| 618 |
+
"\n",
|
| 619 |
+
" # More Categories\n",
|
| 620 |
+
" \"more\": \"more\",\n",
|
| 621 |
+
" \"more.dances\": \"dances\",\n",
|
| 622 |
+
" \"more.family_relationships\": \"family_relationships\",\n",
|
| 623 |
+
" \"more.food_tags\": \"food_tags\",\n",
|
| 624 |
+
" \"more.fire\": \"fire\",\n",
|
| 625 |
+
" \"more.groups\": \"groups\",\n",
|
| 626 |
+
" \"more.phrases\": \"phrases\",\n",
|
| 627 |
+
" \"more.scan\": \"scan\",\n",
|
| 628 |
+
" \"more.subjective\": \"subjective\",\n",
|
| 629 |
+
" \"more.technology\": \"technology\",\n",
|
| 630 |
+
" \"more.verbs_and_gerunds\": \"verbs_and_gerunds\",\n",
|
| 631 |
+
" \"more.water\": \"water\",\n",
|
| 632 |
+
"\n",
|
| 633 |
+
" # Genres of Video Games\n",
|
| 634 |
+
" \"copyrights_artists_projects_and_media\": \"copyrights_artists_projects_and_media\",\n",
|
| 635 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games\": \"genres_of_video_games\",\n",
|
| 636 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games.fighting_games\": \"fighting_games\",\n",
|
| 637 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games.platform_games\": \"platform_games\",\n",
|
| 638 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games.role-playing_games\": \"role-playing_games\",\n",
|
| 639 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games.shooter_games\": \"shooter_games\",\n",
|
| 640 |
+
" \"copyrights_artists_projects_and_media.genres_of_video_games.visual_novel_games\": \"visual_novel_games\",\n",
|
| 641 |
+
"\n",
|
| 642 |
+
" \"characters\": \"characters\",\n",
|
| 643 |
+
" \"characters.ace_attorney\": \"ace_attorney_characters\",\n",
|
| 644 |
+
" \"characters.arknights\": \"arknights_characters\",\n",
|
| 645 |
+
" \"characters.atelier\": \"atelier_characters\",\n",
|
| 646 |
+
" \"characters.azur_lane\": \"azur_lane_characters\",\n",
|
| 647 |
+
" \"characters.bleach\": \"bleach_characters\",\n",
|
| 648 |
+
" \"characters.bokujou_monogatari\": \"bokujou_monogatari_characters\",\n",
|
| 649 |
+
" \"characters.brave_girl_ravens\": \"brave_girl_ravens_characters\",\n",
|
| 650 |
+
" \"characters.cardcaptor_sakura\": \"cardcaptor_sakura_characters\",\n",
|
| 651 |
+
" \"characters.danganronpa\": \"danganronpa_characters\",\n",
|
| 652 |
+
" \"characters.digimon\": \"digimon\",\n",
|
| 653 |
+
" \"characters.digimon.digimon_characters\": \"digimon_characters\",\n",
|
| 654 |
+
" \"characters.dragon_ball\": \"dragon_ball_characters\",\n",
|
| 655 |
+
" \"characters.dragon_quest\": \"dragon_quest_characters\",\n",
|
| 656 |
+
" \"characters.fate_series\": \"fate_series_characters\",\n",
|
| 657 |
+
" \"characters.final_fantasy\": \"final_fantasy_characters\",\n",
|
| 658 |
+
" \"characters.fire_emblem\": \"fire_emblem_characters\",\n",
|
| 659 |
+
" \"characters.flower_knight_girl\": \"flower_knight_girl_characters\",\n",
|
| 660 |
+
" \"characters.gensou_suikoden\": \"gensou_suikoden_characters\",\n",
|
| 661 |
+
" \"characters.girls_frontline\": \"girls_frontline_characters\",\n",
|
| 662 |
+
" \"characters.girls_und_panzer\": \"girls_und_panzer_characters\",\n",
|
| 663 |
+
" \"characters.gundam_mechas\": \"gundam_mechas\",\n",
|
| 664 |
+
" \"characters.hunter_x_hunter\": \"hunter_x_hunter_characters\",\n",
|
| 665 |
+
" \"characters.jojo_no_kimyou_na_bouken\": \"jojo_no_kimyou_na_bouken_characters\",\n",
|
| 666 |
+
" \"characters.kamen_rider\": \"kamen_rider_characters\",\n",
|
| 667 |
+
" \"characters.kantai_collection\": \"kantai_collection_characters\",\n",
|
| 668 |
+
" \"characters.kingdom_hearts\": \"kingdom_hearts_characters\",\n",
|
| 669 |
+
" \"characters.mahou_sensei_negima\": \"mahou_sensei_negima_characters\",\n",
|
| 670 |
+
" \"characters.meitantei_conan\": \"meitantei_conan_characters\",\n",
|
| 671 |
+
" \"characters.minecraft\": \"minecraft_characters\",\n",
|
| 672 |
+
" \"characters.naruto\": \"naruto_characters\",\n",
|
| 673 |
+
" \"characters.nippon_ichi\": \"nippon_ichi_characters\",\n",
|
| 674 |
+
" \"characters.one_piece\": \"one_piece_characters\",\n",
|
| 675 |
+
" \"characters.oshiro_project\": \"oshiro_project_characters\",\n",
|
| 676 |
+
" \"characters.pokemon\": \"pokemon\",\n",
|
| 677 |
+
" \"characters.pokemon.elite_four_members\": \"elite_four_members\",\n",
|
| 678 |
+
" \"characters.pokemon.gym_leaders\": \"gym_leaders\",\n",
|
| 679 |
+
" \"characters.pokemon.families_of_pokemon_main_characters\": \"families_of_pokemon_main_characters\",\n",
|
| 680 |
+
" \"characters.pokemon.pokemon_ranger_characters\": \"pokemon_ranger_characters\",\n",
|
| 681 |
+
" \"characters.pokemon.pokemon_trainer_classes\": \"pokemon_trainer_classes\",\n",
|
| 682 |
+
" \"characters.pretty_cure\": \"pretty_cure_characters\",\n",
|
| 683 |
+
" \"characters.ragnarok_online\": \"ragnarok_online_characters\",\n",
|
| 684 |
+
" \"characters.rosenkreuzstilette\": \"rosenkreuzstilette_characters\",\n",
|
| 685 |
+
" \"characters.sailor_moon\": \"sailor_moon_characters\",\n",
|
| 686 |
+
" \"characters.street_fighter\": \"street_fighter_characters\",\n",
|
| 687 |
+
" \"characters.super_smash_bros\": \"super_smash_bros_characters\",\n",
|
| 688 |
+
" \"characters.world_witches_series\": \"world_witches_series_characters\",\n",
|
| 689 |
+
" \"characters.tales_of\": \"tales_of_characters\",\n",
|
| 690 |
+
" \"characters.toaru_majutsu_no_index\": \"toaru_majutsu_no_index_characters\",\n",
|
| 691 |
+
" \"characters.touhou\": \"touhou_characters\",\n",
|
| 692 |
+
" \"characters.touken_ranbu\": \"touken_ranbu_characters\",\n",
|
| 693 |
+
" \"characters.ultra_series\": \"ultra_series_characters\",\n",
|
| 694 |
+
" \"characters.umamusume\": \"umamusume_characters\",\n",
|
| 695 |
+
" \"characters.vocaloid\": \"vocaloid_characters\",\n",
|
| 696 |
+
" \"characters.yu_gi_oh\": \"yu_gi_oh_characters\",\n",
|
| 697 |
+
" \"characters.genderswap\": \"genderswap_characters\",\n",
|
| 698 |
+
" \"characters.official_mascots\": \"official_mascots\",\n",
|
| 699 |
+
" \"characters.real_life_racehorses\": \"real_life_racehorses\",\n",
|
| 700 |
+
"\n",
|
| 701 |
+
"\n",
|
| 702 |
+
" # metatags\n",
|
| 703 |
+
" \"metatags\": \"metatags\",\n",
|
| 704 |
+
" \"drawing_software\": \"drawing_software\",\n",
|
| 705 |
+
" \n",
|
| 706 |
+
"}\n"
|
| 707 |
+
]
|
| 708 |
+
},
|
| 709 |
+
{
|
| 710 |
+
"cell_type": "code",
|
| 711 |
+
"execution_count": 10,
|
| 712 |
+
"id": "c4759995-396a-4ec1-9f35-e75b3875e13c",
|
| 713 |
+
"metadata": {
|
| 714 |
+
"execution": {
|
| 715 |
+
"iopub.execute_input": "2025-03-17T12:48:44.138013Z",
|
| 716 |
+
"iopub.status.busy": "2025-03-17T12:48:44.137852Z",
|
| 717 |
+
"iopub.status.idle": "2025-03-17T12:48:44.491853Z",
|
| 718 |
+
"shell.execute_reply": "2025-03-17T12:48:44.491271Z",
|
| 719 |
+
"shell.execute_reply.started": "2025-03-17T12:48:44.137999Z"
|
| 720 |
+
}
|
| 721 |
+
},
|
| 722 |
+
"outputs": [],
|
| 723 |
+
"source": [
|
| 724 |
+
"list_based_categories = [\n",
|
| 725 |
+
" # Image composition style\n",
|
| 726 |
+
" \"style_parodies\",\n",
|
| 727 |
+
" # Objects\n",
|
| 728 |
+
" \"computer\",\n",
|
| 729 |
+
" \"airplanes\",\n",
|
| 730 |
+
" \"armor\",\n",
|
| 731 |
+
" \"ground_vehicles\",\n",
|
| 732 |
+
" \"helicopters\",\n",
|
| 733 |
+
" \"pokemon_objects\",\n",
|
| 734 |
+
" \"ships\",\n",
|
| 735 |
+
" \"weapons\",\n",
|
| 736 |
+
" \"playing_card_faces\",\n",
|
| 737 |
+
" \n",
|
| 738 |
+
" # Creatures\n",
|
| 739 |
+
" \"animals\",\n",
|
| 740 |
+
" \"birds\",\n",
|
| 741 |
+
" \"cats\",\n",
|
| 742 |
+
" \"dogs\",\n",
|
| 743 |
+
" \"legendary_creatures\",\n",
|
| 744 |
+
"\n",
|
| 745 |
+
" # Plants\n",
|
| 746 |
+
" \"plant\",\n",
|
| 747 |
+
" \"tree\",\n",
|
| 748 |
+
" \"flowers\",\n",
|
| 749 |
+
"\n",
|
| 750 |
+
" # Games\n",
|
| 751 |
+
" \"game_activities\",\n",
|
| 752 |
+
" \"board_games\",\n",
|
| 753 |
+
" \"sports\",\n",
|
| 754 |
+
" \"video_game\",\n",
|
| 755 |
+
" \"fighting_games\",\n",
|
| 756 |
+
" \"platform_games\",\n",
|
| 757 |
+
" #\"role-playing_games\",\n",
|
| 758 |
+
" \"shooter_games\",\n",
|
| 759 |
+
" \"visual_novel_games\",\n",
|
| 760 |
+
"\n",
|
| 761 |
+
" # Real World\n",
|
| 762 |
+
" \"companies_and_brand_names\",\n",
|
| 763 |
+
" \"holidays_and_celebrations\",\n",
|
| 764 |
+
" \"jobs\",\n",
|
| 765 |
+
" \"locations\",\n",
|
| 766 |
+
" \"people\",\n",
|
| 767 |
+
" \"real_world_locations\",\n",
|
| 768 |
+
"\n",
|
| 769 |
+
" # More Categories\n",
|
| 770 |
+
" \"dances\",\n",
|
| 771 |
+
" \"family_relationships\",\n",
|
| 772 |
+
" \"food_tags\",\n",
|
| 773 |
+
" \"fire\",\n",
|
| 774 |
+
" \"groups\",\n",
|
| 775 |
+
" \"phrases\",\n",
|
| 776 |
+
" \"scan\",\n",
|
| 777 |
+
" \"subjective\",\n",
|
| 778 |
+
" \"technology\",\n",
|
| 779 |
+
" \"verbs_and_gerunds\",\n",
|
| 780 |
+
" \"water\",\n",
|
| 781 |
+
" \"airplanes\",\n",
|
| 782 |
+
"\n",
|
| 783 |
+
" # Artists\n",
|
| 784 |
+
" \"named_drawfags\",\n",
|
| 785 |
+
" \"pixiv_projects\",\n",
|
| 786 |
+
"\n",
|
| 787 |
+
" # Characters\n",
|
| 788 |
+
" \"ace_attorney_characters\",\n",
|
| 789 |
+
" \"arknights_characters\",\n",
|
| 790 |
+
" \"atelier_characters\",\n",
|
| 791 |
+
" \"azur_lane_characters\",\n",
|
| 792 |
+
" \"bleach_characters\",\n",
|
| 793 |
+
" \"bokujou_monogatari_characters\",\n",
|
| 794 |
+
" \"brave_girl_ravens_characters\",\n",
|
| 795 |
+
" \"cardcaptor_sakura_characters\",\n",
|
| 796 |
+
" \"danganronpa_characters\",\n",
|
| 797 |
+
" \"digimon\",\n",
|
| 798 |
+
" \"digimon_characters\",\n",
|
| 799 |
+
" \"dragon_ball_characters\",\n",
|
| 800 |
+
" \"dragon_quest_characters\",\n",
|
| 801 |
+
" \"fate_series_characters\",\n",
|
| 802 |
+
" \"final_fantasy_characters\",\n",
|
| 803 |
+
" \"fire_emblem_characters\",\n",
|
| 804 |
+
" \"flower_knight_girl_characters\",\n",
|
| 805 |
+
" \"gensou_suikoden_characters\",\n",
|
| 806 |
+
" \"girls_frontline_characters\",\n",
|
| 807 |
+
" \"girls_und_panzer_characters\",\n",
|
| 808 |
+
" \"gundam_mechas\",\n",
|
| 809 |
+
" \"hunter_x_hunter_characters\",\n",
|
| 810 |
+
" \"jojo_no_kimyou_na_bouken_characters\",\n",
|
| 811 |
+
" \"kamen_rider_characters\",\n",
|
| 812 |
+
" \"kantai_collection_characters\",\n",
|
| 813 |
+
" \"kingdom_hearts_characters\",\n",
|
| 814 |
+
" \"mahou_sensei_negima_characters\",\n",
|
| 815 |
+
" \"meitantei_conan_characters\",\n",
|
| 816 |
+
" \"minecraft_characters\",\n",
|
| 817 |
+
" \"naruto_characters\",\n",
|
| 818 |
+
" \"nippon_ichi_characters\",\n",
|
| 819 |
+
" \"one_piece_characters\",\n",
|
| 820 |
+
" \"oshiro_project_characters\",\n",
|
| 821 |
+
" \"pokemon_characters\",\n",
|
| 822 |
+
" \"elite_four_members\",\n",
|
| 823 |
+
" \"gym_leaders\",\n",
|
| 824 |
+
" \"families_of_pokemon_main_characters\",\n",
|
| 825 |
+
" \"pokemon_ranger_characters\",\n",
|
| 826 |
+
" \"pokemon_trainer_classes\",\n",
|
| 827 |
+
" \"pokemon\",\n",
|
| 828 |
+
" \"pretty_cure_characters\",\n",
|
| 829 |
+
" \"ragnarok_online_characters\",\n",
|
| 830 |
+
" \"rosenkreuzstilette_characters\",\n",
|
| 831 |
+
" \"sailor_moon_characters\",\n",
|
| 832 |
+
" \"street_fighter_characters\",\n",
|
| 833 |
+
" \"super_smash_bros_characters\",\n",
|
| 834 |
+
" \"world_witches_series_characters\",\n",
|
| 835 |
+
" \"tales_of_characters\",\n",
|
| 836 |
+
" \"toaru_majutsu_no_index_characters\",\n",
|
| 837 |
+
" \"touhou_characters\",\n",
|
| 838 |
+
" \"touken_ranbu_characters\",\n",
|
| 839 |
+
" \"ultra_series_characters\",\n",
|
| 840 |
+
" \"umamusume_characters\",\n",
|
| 841 |
+
" \"vocaloid_characters\",\n",
|
| 842 |
+
" \"yu_gi_oh_characters\",\n",
|
| 843 |
+
" \"genderswap_characters\",\n",
|
| 844 |
+
" \"official_mascots\",\n",
|
| 845 |
+
" \"real_life_racehorses\",\n",
|
| 846 |
+
"\n",
|
| 847 |
+
"\n",
|
| 848 |
+
" \n",
|
| 849 |
+
" # Other Lists\n",
|
| 850 |
+
" \"disambiguation_pages\",\n",
|
| 851 |
+
" \"magazine_publications\",\n",
|
| 852 |
+
" \"special_moves\",\n",
|
| 853 |
+
" \"uniforms\",\n",
|
| 854 |
+
" \"pokemon_media\",\n",
|
| 855 |
+
" \"tagged_songs\",\n",
|
| 856 |
+
" \"vocaloid_derivatives\",\n",
|
| 857 |
+
" \"vocaloid_songs\",\n",
|
| 858 |
+
" \"vocal_synthesizers\",\n",
|
| 859 |
+
" \"vocal_synth_derivatives\",\n",
|
| 860 |
+
" \"vocal_synth_songs\",\n",
|
| 861 |
+
" \"deemo_songs\",\n",
|
| 862 |
+
" #\"role_playing_games\",\n",
|
| 863 |
+
"\n",
|
| 864 |
+
" # Metatags\n",
|
| 865 |
+
" #\"metatags\",\n",
|
| 866 |
+
" #\"drawing_software\",\n",
|
| 867 |
+
"\n",
|
| 868 |
+
" # Pool Groups & Meta-Wikis\n",
|
| 869 |
+
" \"meta_wikis\"\n",
|
| 870 |
+
"]\n"
|
| 871 |
+
]
|
| 872 |
+
},
|
| 873 |
+
{
|
| 874 |
+
"cell_type": "code",
|
| 875 |
+
"execution_count": 11,
|
| 876 |
+
"id": "8f8b6b29-e7bd-4814-8705-2b7a68f2d660",
|
| 877 |
+
"metadata": {
|
| 878 |
+
"execution": {
|
| 879 |
+
"iopub.execute_input": "2025-03-17T12:48:52.253272Z",
|
| 880 |
+
"iopub.status.busy": "2025-03-17T12:48:52.252077Z",
|
| 881 |
+
"iopub.status.idle": "2025-03-17T12:48:52.256578Z",
|
| 882 |
+
"shell.execute_reply": "2025-03-17T12:48:52.256016Z",
|
| 883 |
+
"shell.execute_reply.started": "2025-03-17T12:48:52.253248Z"
|
| 884 |
+
}
|
| 885 |
+
},
|
| 886 |
+
"outputs": [],
|
| 887 |
+
"source": [
|
| 888 |
+
"special_wiki_pages = [\n",
|
| 889 |
+
" \"plant\", \"tree\", \"computer\", \"on_object\", \"injury\", \"swimsuit\", \"on\" , \"mask\" # Add more here if needed\n",
|
| 890 |
+
"]"
|
| 891 |
+
]
|
| 892 |
+
},
|
| 893 |
+
{
|
| 894 |
+
"cell_type": "code",
|
| 895 |
+
"execution_count": 12,
|
| 896 |
+
"id": "79234bc5-b287-4306-85cd-a27ddba769ea",
|
| 897 |
+
"metadata": {
|
| 898 |
+
"execution": {
|
| 899 |
+
"iopub.execute_input": "2025-03-17T12:48:55.161786Z",
|
| 900 |
+
"iopub.status.busy": "2025-03-17T12:48:55.160800Z",
|
| 901 |
+
"iopub.status.idle": "2025-03-17T12:48:55.165137Z",
|
| 902 |
+
"shell.execute_reply": "2025-03-17T12:48:55.164568Z",
|
| 903 |
+
"shell.execute_reply.started": "2025-03-17T12:48:55.161760Z"
|
| 904 |
+
}
|
| 905 |
+
},
|
| 906 |
+
"outputs": [],
|
| 907 |
+
"source": [
|
| 908 |
+
"import base64\n",
|
| 909 |
+
"HEADERS = {\n",
|
| 910 |
+
" \"Authorization\": f\"Basic {base64.b64encode(f'{username}:{api_key}'.encode()).decode()}\"\n",
|
| 911 |
+
"}"
|
| 912 |
+
]
|
| 913 |
+
},
|
| 914 |
+
{
|
| 915 |
+
"cell_type": "code",
|
| 916 |
+
"execution_count": 13,
|
| 917 |
+
"id": "64d47678-78a4-4d55-a3a0-1c1b251b03b6",
|
| 918 |
+
"metadata": {
|
| 919 |
+
"execution": {
|
| 920 |
+
"iopub.execute_input": "2025-03-17T12:40:12.358876Z",
|
| 921 |
+
"iopub.status.busy": "2025-03-17T12:40:12.358573Z",
|
| 922 |
+
"iopub.status.idle": "2025-03-17T12:40:43.547891Z",
|
| 923 |
+
"shell.execute_reply": "2025-03-17T12:40:43.547451Z",
|
| 924 |
+
"shell.execute_reply.started": "2025-03-17T12:40:12.358851Z"
|
| 925 |
+
}
|
| 926 |
+
},
|
| 927 |
+
"outputs": [
|
| 928 |
+
{
|
| 929 |
+
"name": "stdout",
|
| 930 |
+
"output_type": "stream",
|
| 931 |
+
"text": [
|
| 932 |
+
"🚀 Starting Danbooru Tag Hierarchy API Fetcher...\n",
|
| 933 |
+
"❌ No data found for attire_and_body_accessories in any format\n",
|
| 934 |
+
"✅ Added 91 tags directly under attire_and_body_accessories.attire.dress\n",
|
| 935 |
+
"✅ Added 100 tags directly under attire_and_body_accessories.attire.handwear\n",
|
| 936 |
+
"✅ Added 251 tags directly under attire_and_body_accessories.attire.headwear\n",
|
| 937 |
+
"✅ Added 69 tags directly under attire_and_body_accessories.attire.legwear\n",
|
| 938 |
+
"✅ Added 64 tags directly under attire_and_body_accessories.attire.mask\n",
|
| 939 |
+
"✅ Added 264 tags directly under attire_and_body_accessories.attire.neck_and_neckwear\n",
|
| 940 |
+
"✅ Added 54 tags directly under attire_and_body_accessories.attire.sexual_attire.bra\n",
|
| 941 |
+
"✅ Added 83 tags directly under attire_and_body_accessories.attire.sexual_attire.panties\n",
|
| 942 |
+
"✅ Added 87 tags directly under attire_and_body_accessories.attire.sleeves\n",
|
| 943 |
+
"✅ Added 83 tags directly under attire_and_body_accessories.attire.swimsuit\n",
|
| 944 |
+
"✅ Added 4 tags directly under attire_and_body_accessories.embellishment\n",
|
| 945 |
+
"✅ Added 94 tags directly under attire_and_body_accessories.eyewear\n",
|
| 946 |
+
"✅ Added 62 tags directly under attire_and_body_accessories.fashion_style\n",
|
| 947 |
+
"✅ Added 189 tags directly under attire_and_body_accessories.nudity\n",
|
| 948 |
+
"❌ No data found for body in any format\n",
|
| 949 |
+
"✅ Added 40 tags directly under body.body_parts.ass\n",
|
| 950 |
+
"✅ Added 213 tags directly under body.body_parts.breasts_tags\n",
|
| 951 |
+
"✅ Added 54 tags directly under body.body_parts.ears_tags\n",
|
| 952 |
+
"✅ Added 200 tags directly under body.body_parts.face_tags.eyes_tags\n",
|
| 953 |
+
"✅ Added 31 tags directly under body.body_parts.hair.hair_color\n",
|
| 954 |
+
"✅ Added 157 tags directly under body.body_parts.hair.hair_styles\n",
|
| 955 |
+
"✅ Added 117 tags directly under body.body_parts.hands.gestures\n",
|
| 956 |
+
"✅ Added 264 tags directly under body.body_parts.neck_and_neckwear\n",
|
| 957 |
+
"✅ Added 74 tags directly under body.body_parts.penis\n",
|
| 958 |
+
"✅ Added 230 tags directly under body.body_parts.posture\n",
|
| 959 |
+
"✅ Added 50 tags directly under body.body_parts.pussy\n",
|
| 960 |
+
"✅ Added 64 tags directly under body.body_parts.shoulders\n",
|
| 961 |
+
"✅ Added 26 tags directly under body.body_parts.skin_color\n",
|
| 962 |
+
"✅ Added 79 tags directly under body.body_parts.tail\n",
|
| 963 |
+
"✅ Added 90 tags directly under body.body_parts.wings\n",
|
| 964 |
+
"✅ Added 55 tags directly under body.injury\n",
|
| 965 |
+
"❌ No data found for characters in any format\n",
|
| 966 |
+
"✅ Added 300 tags directly under characters.ace_attorney\n",
|
| 967 |
+
"✅ Added 579 tags directly under characters.arknights\n",
|
| 968 |
+
"✅ Added 296 tags directly under characters.atelier\n",
|
| 969 |
+
"✅ Added 706 tags directly under characters.azur_lane\n",
|
| 970 |
+
"✅ Added 223 tags directly under characters.bleach\n",
|
| 971 |
+
"✅ Added 98 tags directly under characters.bokujou_monogatari\n",
|
| 972 |
+
"✅ Added 46 tags directly under characters.brave_girl_ravens\n",
|
| 973 |
+
"✅ Added 144 tags directly under characters.cardcaptor_sakura\n",
|
| 974 |
+
"✅ Added 128 tags directly under characters.danganronpa\n",
|
| 975 |
+
"✅ Added 240 tags directly under characters.digimon.digimon_characters\n",
|
| 976 |
+
"✅ Added 613 tags directly under characters.dragon_ball\n",
|
| 977 |
+
"✅ Added 331 tags directly under characters.dragon_quest\n",
|
| 978 |
+
"✅ Added 814 tags directly under characters.fate_series\n",
|
| 979 |
+
"✅ Added 566 tags directly under characters.final_fantasy\n",
|
| 980 |
+
"✅ Added 727 tags directly under characters.fire_emblem\n",
|
| 981 |
+
"✅ Added 441 tags directly under characters.flower_knight_girl\n",
|
| 982 |
+
"✅ Added 221 tags directly under characters.genderswap\n",
|
| 983 |
+
"✅ Added 431 tags directly under characters.gensou_suikoden\n",
|
| 984 |
+
"❌ No data found for girls_frontline_characters in any format\n",
|
| 985 |
+
"✅ Added 0 tags directly under characters.girls_frontline\n",
|
| 986 |
+
"✅ Added 305 tags directly under characters.girls_und_panzer\n",
|
| 987 |
+
"✅ Added 267 tags directly under characters.gundam_mechas\n",
|
| 988 |
+
"✅ Added 313 tags directly under characters.hunter_x_hunter\n",
|
| 989 |
+
"✅ Added 348 tags directly under characters.jojo_no_kimyou_na_bouken\n",
|
| 990 |
+
"✅ Added 308 tags directly under characters.kamen_rider\n",
|
| 991 |
+
"✅ Added 465 tags directly under characters.kantai_collection\n",
|
| 992 |
+
"✅ Added 64 tags directly under characters.kingdom_hearts\n",
|
| 993 |
+
"✅ Added 72 tags directly under characters.mahou_sensei_negima\n",
|
| 994 |
+
"✅ Added 158 tags directly under characters.meitantei_conan\n",
|
| 995 |
+
"✅ Added 153 tags directly under characters.minecraft\n",
|
| 996 |
+
"✅ Added 198 tags directly under characters.naruto\n",
|
| 997 |
+
"✅ Added 336 tags directly under characters.nippon_ichi\n",
|
| 998 |
+
"✅ Added 375 tags directly under characters.official_mascots\n",
|
| 999 |
+
"✅ Added 464 tags directly under characters.one_piece\n",
|
| 1000 |
+
"✅ Added 305 tags directly under characters.oshiro_project\n",
|
| 1001 |
+
"✅ Added 72 tags directly under characters.pokemon.elite_four_members\n",
|
| 1002 |
+
"✅ Added 165 tags directly under characters.pokemon.families_of_pokemon_main_characters\n",
|
| 1003 |
+
"✅ Added 100 tags directly under characters.pokemon.gym_leaders\n",
|
| 1004 |
+
"✅ Added 37 tags directly under characters.pokemon.pokemon_ranger_characters\n",
|
| 1005 |
+
"✅ Added 149 tags directly under characters.pokemon.pokemon_trainer_classes\n",
|
| 1006 |
+
"✅ Added 925 tags directly under characters.pretty_cure\n",
|
| 1007 |
+
"✅ Added 1214 tags directly under characters.ragnarok_online\n",
|
| 1008 |
+
"✅ Added 434 tags directly under characters.real_life_racehorses\n",
|
| 1009 |
+
"✅ Added 23 tags directly under characters.rosenkreuzstilette\n",
|
| 1010 |
+
"✅ Added 320 tags directly under characters.sailor_moon\n",
|
| 1011 |
+
"✅ Added 170 tags directly under characters.street_fighter\n",
|
| 1012 |
+
"❌ No data found for super_smash_bros_characters in any format\n",
|
| 1013 |
+
"✅ Added 0 tags directly under characters.super_smash_bros\n",
|
| 1014 |
+
"❌ No data found for tales_of_characters in any format\n",
|
| 1015 |
+
"✅ Added 0 tags directly under characters.tales_of\n",
|
| 1016 |
+
"✅ Added 189 tags directly under characters.toaru_majutsu_no_index\n",
|
| 1017 |
+
"✅ Added 381 tags directly under characters.touhou\n",
|
| 1018 |
+
"✅ Added 128 tags directly under characters.touken_ranbu\n",
|
| 1019 |
+
"✅ Added 540 tags directly under characters.ultra_series\n",
|
| 1020 |
+
"✅ Added 250 tags directly under characters.umamusume\n",
|
| 1021 |
+
"✅ Added 140 tags directly under characters.vocaloid\n",
|
| 1022 |
+
"✅ Added 184 tags directly under characters.world_witches_series\n",
|
| 1023 |
+
"❌ No data found for yu_gi_oh_characters in any format\n",
|
| 1024 |
+
"✅ Added 0 tags directly under characters.yu_gi_oh\n",
|
| 1025 |
+
"❌ No data found for copyrights_artists_projects_and_media in any format\n",
|
| 1026 |
+
"❌ No data found for genres_of_video_games in any format\n",
|
| 1027 |
+
"✅ Added 175 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.fighting_games\n",
|
| 1028 |
+
"✅ Added 42 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.platform_games\n",
|
| 1029 |
+
"✅ Added 515 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.role-playing_games\n",
|
| 1030 |
+
"✅ Added 114 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.shooter_games\n",
|
| 1031 |
+
"✅ Added 185 tags directly under copyrights_artists_projects_and_media.genres_of_video_games.visual_novel_games\n",
|
| 1032 |
+
"❌ No data found for creatures in any format\n",
|
| 1033 |
+
"✅ Added 232 tags directly under creatures.animals.birds\n",
|
| 1034 |
+
"✅ Added 224 tags directly under creatures.animals.cats\n",
|
| 1035 |
+
"✅ Added 237 tags directly under creatures.animals.dogs\n",
|
| 1036 |
+
"✅ Added 285 tags directly under creatures.legendary_creatures\n",
|
| 1037 |
+
"✅ Added 55 tags directly under drawing_software\n",
|
| 1038 |
+
"❌ No data found for games in any format\n",
|
| 1039 |
+
"✅ Added 21 tags directly under games.board_games\n",
|
| 1040 |
+
"✅ Added 175 tags directly under games.fighting_games\n",
|
| 1041 |
+
"✅ Added 61 tags directly under games.game_activities\n",
|
| 1042 |
+
"✅ Added 420 tags directly under games.sports\n",
|
| 1043 |
+
"✅ Added 144 tags directly under games.video_game\n",
|
| 1044 |
+
"✅ Added 213 tags directly under metatags\n",
|
| 1045 |
+
"❌ No data found for more in any format\n",
|
| 1046 |
+
"✅ Added 110 tags directly under more.dances\n",
|
| 1047 |
+
"✅ Added 29 tags directly under more.family_relationships\n",
|
| 1048 |
+
"✅ Added 68 tags directly under more.fire\n",
|
| 1049 |
+
"✅ Added 1051 tags directly under more.food_tags\n",
|
| 1050 |
+
"✅ Added 31 tags directly under more.groups\n",
|
| 1051 |
+
"✅ Added 76 tags directly under more.phrases\n",
|
| 1052 |
+
"✅ Added 35 tags directly under more.scan\n",
|
| 1053 |
+
"✅ Added 45 tags directly under more.subjective\n",
|
| 1054 |
+
"✅ Added 237 tags directly under more.technology\n",
|
| 1055 |
+
"✅ Added 446 tags directly under more.verbs_and_gerunds\n",
|
| 1056 |
+
"✅ Added 54 tags directly under more.water\n",
|
| 1057 |
+
"❌ No data found for objects in any format\n",
|
| 1058 |
+
"✅ Added 410 tags directly under objects.airplanes\n",
|
| 1059 |
+
"✅ Added 145 tags directly under objects.armor\n",
|
| 1060 |
+
"✅ Added 377 tags directly under objects.audio_tags\n",
|
| 1061 |
+
"✅ Added 66 tags directly under objects.cards.playing_card_faces\n",
|
| 1062 |
+
"✅ Added 80 tags directly under objects.computer\n",
|
| 1063 |
+
"✅ Added 94 tags directly under objects.eyewear\n",
|
| 1064 |
+
"✅ Added 407 tags directly under objects.ground_vehicles\n",
|
| 1065 |
+
"✅ Added 25 tags directly under objects.helicopters\n",
|
| 1066 |
+
"✅ Added 48 tags directly under objects.piercings\n",
|
| 1067 |
+
"✅ Added 94 tags directly under objects.pokemon_objects\n",
|
| 1068 |
+
"✅ Added 105 tags directly under objects.sex_objects\n",
|
| 1069 |
+
"✅ Added 256 tags directly under objects.ships\n",
|
| 1070 |
+
"✅ Added 917 tags directly under objects.weapons\n",
|
| 1071 |
+
"✅ Added 21 tags directly under plant.flowers\n",
|
| 1072 |
+
"✅ Added 59 tags directly under plant.plant\n",
|
| 1073 |
+
"✅ Added 44 tags directly under plant.tree\n",
|
| 1074 |
+
"✅ Added 59 tags directly under plants\n",
|
| 1075 |
+
"❌ No data found for real_world in any format\n",
|
| 1076 |
+
"✅ Added 363 tags directly under real_world.companies_and_brand_names\n",
|
| 1077 |
+
"✅ Added 138 tags directly under real_world.holidays_and_celebrations\n",
|
| 1078 |
+
"✅ Added 75 tags directly under real_world.jobs\n",
|
| 1079 |
+
"✅ Added 263 tags directly under real_world.locations\n",
|
| 1080 |
+
"✅ Added 1526 tags directly under real_world.people\n",
|
| 1081 |
+
"✅ Added 463 tags directly under real_world.real_world_locations\n",
|
| 1082 |
+
"❌ No data found for sex in any format\n",
|
| 1083 |
+
"✅ Added 16 tags directly under sex.sex_acts.simulated_sex_acts\n",
|
| 1084 |
+
"✅ Added 56 tags directly under sex.sexual_positions\n",
|
| 1085 |
+
"❌ No data found for subject in any format\n",
|
| 1086 |
+
"❌ No data found for anthro in any format\n",
|
| 1087 |
+
"❌ No data found for cat_boy in any format\n",
|
| 1088 |
+
"✅ Added 0 tags directly under subject.anthro.cat_boy\n",
|
| 1089 |
+
"❌ No data found for cat_girl in any format\n",
|
| 1090 |
+
"✅ Added 0 tags directly under subject.anthro.cat_girl\n",
|
| 1091 |
+
"❌ No data found for dog_girl in any format\n",
|
| 1092 |
+
"✅ Added 0 tags directly under subject.anthro.dog_girl\n",
|
| 1093 |
+
"❌ No data found for fox_girl in any format\n",
|
| 1094 |
+
"✅ Added 0 tags directly under subject.anthro.fox_girl\n",
|
| 1095 |
+
"❌ No data found for furry in any format\n",
|
| 1096 |
+
"✅ Added 0 tags directly under subject.anthro.furry\n",
|
| 1097 |
+
"❌ No data found for female in any format\n",
|
| 1098 |
+
"❌ No data found for 1girl in any format\n",
|
| 1099 |
+
"✅ Added 0 tags directly under subject.female.1girl\n",
|
| 1100 |
+
"❌ No data found for 2girls in any format\n",
|
| 1101 |
+
"✅ Added 0 tags directly under subject.female.2girls\n",
|
| 1102 |
+
"❌ No data found for female_general in any format\n",
|
| 1103 |
+
"✅ Added 0 tags directly under subject.female_general\n",
|
| 1104 |
+
"❌ No data found for koma in any format\n",
|
| 1105 |
+
"❌ No data found for 1koma in any format\n",
|
| 1106 |
+
"✅ Added 0 tags directly under subject.koma.1koma\n",
|
| 1107 |
+
"❌ No data found for 2koma in any format\n",
|
| 1108 |
+
"✅ Added 0 tags directly under subject.koma.2koma\n",
|
| 1109 |
+
"❌ No data found for male in any format\n",
|
| 1110 |
+
"❌ No data found for 1boy in any format\n",
|
| 1111 |
+
"✅ Added 0 tags directly under subject.male.1boy\n",
|
| 1112 |
+
"❌ No data found for 2boys in any format\n",
|
| 1113 |
+
"✅ Added 0 tags directly under subject.male.2boys\n",
|
| 1114 |
+
"❌ No data found for 3boys in any format\n",
|
| 1115 |
+
"✅ Added 0 tags directly under subject.male.3boys\n",
|
| 1116 |
+
"❌ No data found for 4boys in any format\n",
|
| 1117 |
+
"✅ Added 0 tags directly under subject.male.4boys\n",
|
| 1118 |
+
"❌ No data found for 5boys in any format\n",
|
| 1119 |
+
"✅ Added 0 tags directly under subject.male.5boys\n",
|
| 1120 |
+
"❌ No data found for 6+boys in any format\n",
|
| 1121 |
+
"✅ Added 0 tags directly under subject.male.6+boys\n",
|
| 1122 |
+
"❌ No data found for boy in any format\n",
|
| 1123 |
+
"✅ Added 0 tags directly under subject.male.boy\n",
|
| 1124 |
+
"❌ No data found for man in any format\n",
|
| 1125 |
+
"✅ Added 0 tags directly under subject.male.man\n",
|
| 1126 |
+
"❌ No data found for multiple_boys in any format\n",
|
| 1127 |
+
"✅ Added 0 tags directly under subject.male.multiple_boys\n",
|
| 1128 |
+
"❌ No data found for visual_characteristics in any format\n",
|
| 1129 |
+
"❌ No data found for image_composition_and_style in any format\n",
|
| 1130 |
+
"✅ Added 73 tags directly under visual_characteristics.image_composition_and_style.artistic_license\n",
|
| 1131 |
+
"✅ Added 115 tags directly under visual_characteristics.image_composition_and_style.image_composition.backgrounds\n",
|
| 1132 |
+
"✅ Added 90 tags directly under visual_characteristics.image_composition_and_style.image_composition.censorship\n",
|
| 1133 |
+
"✅ Added 54 tags directly under visual_characteristics.image_composition_and_style.image_composition.colors\n",
|
| 1134 |
+
"✅ Added 28 tags directly under visual_characteristics.image_composition_and_style.image_composition.focus_tags\n",
|
| 1135 |
+
"✅ Added 55 tags directly under visual_characteristics.image_composition_and_style.image_composition.lighting\n",
|
| 1136 |
+
"✅ Added 76 tags directly under visual_characteristics.image_composition_and_style.image_composition.prints\n",
|
| 1137 |
+
"✅ Added 495 tags directly under visual_characteristics.image_composition_and_style.image_composition.style_parodies\n",
|
| 1138 |
+
"✅ Added 41 tags directly under visual_characteristics.image_composition_and_style.patterns\n",
|
| 1139 |
+
"✅ Added 310 tags directly under visual_characteristics.image_composition_and_style.symbols\n",
|
| 1140 |
+
"✅ Added 242 tags directly under visual_characteristics.image_composition_and_style.text\n",
|
| 1141 |
+
"✅ Added 62 tags directly under visual_characteristics.image_composition_and_style.year_tags\n",
|
| 1142 |
+
"✅ Finished building hierarchy.\n"
|
| 1143 |
+
]
|
| 1144 |
+
},
|
| 1145 |
+
{
|
| 1146 |
+
"ename": "FileNotFoundError",
|
| 1147 |
+
"evalue": "[Errno 2] No such file or directory: '/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru_donmai/danbooru_tags_step_01.json'",
|
| 1148 |
+
"output_type": "error",
|
| 1149 |
+
"traceback": [
|
| 1150 |
+
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
| 1151 |
+
"\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)",
|
| 1152 |
+
"Cell \u001b[0;32mIn[13], line 198\u001b[0m\n\u001b[1;32m 196\u001b[0m \u001b[38;5;66;03m# Save cleaned JSON\u001b[39;00m\n\u001b[1;32m 197\u001b[0m output_file \u001b[38;5;241m=\u001b[39m current_dir\u001b[38;5;241m.\u001b[39mparent \u001b[38;5;241m/\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mmisc/danbooru_donmai/danbooru_tags_step_01.json\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m--> 198\u001b[0m \u001b[38;5;28;01mwith\u001b[39;00m \u001b[38;5;28;43mopen\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43moutput_file\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mw\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mencoding\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mutf-8\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m \u001b[38;5;28;01mas\u001b[39;00m f:\n\u001b[1;32m 199\u001b[0m json\u001b[38;5;241m.\u001b[39mdump(manual_hierarchy, f, indent\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m4\u001b[39m, ensure_ascii\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[1;32m 201\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m✅ Hierarchy saved to `\u001b[39m\u001b[38;5;132;01m{\u001b[39;00moutput_file\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m`\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n",
|
| 1153 |
+
"File \u001b[0;32m~/.local/lib/python3.10/site-packages/IPython/core/interactiveshell.py:324\u001b[0m, in \u001b[0;36m_modified_open\u001b[0;34m(file, *args, **kwargs)\u001b[0m\n\u001b[1;32m 317\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m file \u001b[38;5;129;01min\u001b[39;00m {\u001b[38;5;241m0\u001b[39m, \u001b[38;5;241m1\u001b[39m, \u001b[38;5;241m2\u001b[39m}:\n\u001b[1;32m 318\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\n\u001b[1;32m 319\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mIPython won\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt let you open fd=\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfile\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m by default \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 320\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mas it is likely to crash IPython. If you know what you are doing, \u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 321\u001b[0m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124myou can use builtins\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m open.\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 322\u001b[0m )\n\u001b[0;32m--> 324\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mio_open\u001b[49m\u001b[43m(\u001b[49m\u001b[43mfile\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n",
|
| 1154 |
+
"\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] No such file or directory: '/home/lauhp/000_PHD/000_010_PUBLICATION/2025_SAGE/CODE/pm-paper_uzh_gitlab/pm-paper/misc/danbooru_donmai/danbooru_tags_step_01.json'"
|
| 1155 |
+
]
|
| 1156 |
+
}
|
| 1157 |
+
],
|
| 1158 |
+
"source": [
|
| 1159 |
+
"import requests\n",
|
| 1160 |
+
"import json\n",
|
| 1161 |
+
"import time\n",
|
| 1162 |
+
"import re\n",
|
| 1163 |
+
"from urllib.parse import quote\n",
|
| 1164 |
+
"\n",
|
| 1165 |
+
"# Base API URL\n",
|
| 1166 |
+
"BASE_URL = \"https://danbooru.donmai.us\"\n",
|
| 1167 |
+
"\n",
|
| 1168 |
+
"# Authentication (Replace with your credentials)\n",
|
| 1169 |
+
"AUTH = (username, api_key)\n",
|
| 1170 |
+
"\n",
|
| 1171 |
+
"# Storage for missing tags\n",
|
| 1172 |
+
"missing_tags = []\n",
|
| 1173 |
+
"\n",
|
| 1174 |
+
"# ✅ Initialize manual_hierarchy\n",
|
| 1175 |
+
"manual_hierarchy = {}\n",
|
| 1176 |
+
"\n",
|
| 1177 |
+
"def clean_hierarchy_and_count_tags(hierarchy):\n",
|
| 1178 |
+
" \"\"\"\n",
|
| 1179 |
+
" Recursively removes empty categories from the JSON hierarchy and counts total tags.\n",
|
| 1180 |
+
" \"\"\"\n",
|
| 1181 |
+
" total_tag_count = 0\n",
|
| 1182 |
+
"\n",
|
| 1183 |
+
" def clean_dict(d):\n",
|
| 1184 |
+
" nonlocal total_tag_count\n",
|
| 1185 |
+
" keys_to_delete = []\n",
|
| 1186 |
+
"\n",
|
| 1187 |
+
" for key, value in d.items():\n",
|
| 1188 |
+
" if isinstance(value, dict):\n",
|
| 1189 |
+
" clean_dict(value) # Recursively clean subcategories\n",
|
| 1190 |
+
" \n",
|
| 1191 |
+
" # ✅ If the subcategory is empty after cleaning, mark it for removal\n",
|
| 1192 |
+
" if not value:\n",
|
| 1193 |
+
" keys_to_delete.append(key)\n",
|
| 1194 |
+
" elif isinstance(value, list):\n",
|
| 1195 |
+
" # ✅ Count the total number of tags\n",
|
| 1196 |
+
" total_tag_count += len(value)\n",
|
| 1197 |
+
"\n",
|
| 1198 |
+
" # ✅ If the list is empty, mark it for deletion\n",
|
| 1199 |
+
" if not value:\n",
|
| 1200 |
+
" keys_to_delete.append(key)\n",
|
| 1201 |
+
"\n",
|
| 1202 |
+
" # ✅ Remove empty keys\n",
|
| 1203 |
+
" for key in keys_to_delete:\n",
|
| 1204 |
+
" del d[key]\n",
|
| 1205 |
+
"\n",
|
| 1206 |
+
" clean_dict(hierarchy)\n",
|
| 1207 |
+
"\n",
|
| 1208 |
+
" return hierarchy, total_tag_count\n",
|
| 1209 |
+
"\n",
|
| 1210 |
+
"\n",
|
| 1211 |
+
"def fetch_wiki_page(tag, is_list=False):\n",
|
| 1212 |
+
" \"\"\"\n",
|
| 1213 |
+
" Fetches the wiki page of a tag group, list category, or special wiki page using the API.\n",
|
| 1214 |
+
" Uses list_of_ for lists, tag_group: for regular groups, \n",
|
| 1215 |
+
" and direct queries for special cases.\n",
|
| 1216 |
+
" \"\"\"\n",
|
| 1217 |
+
" if tag in special_wiki_pages:\n",
|
| 1218 |
+
" prefixes = [\"\"] # Query directly with no prefix\n",
|
| 1219 |
+
" #print(f\"🔍 {tag} is a special wiki page, querying directly...\")\n",
|
| 1220 |
+
" elif is_list:\n",
|
| 1221 |
+
" prefixes = [\"list_of_\", \"tag_group:\"]\n",
|
| 1222 |
+
" else:\n",
|
| 1223 |
+
" prefixes = [\"tag_group:\"]\n",
|
| 1224 |
+
"\n",
|
| 1225 |
+
" # print(f\"🚀 Fetching {prefixes} {tag}\") # Debugging print\n",
|
| 1226 |
+
"\n",
|
| 1227 |
+
" for prefix in prefixes:\n",
|
| 1228 |
+
" query_tag = f\"{prefix}{tag}\".strip() # Avoid unnecessary :\n",
|
| 1229 |
+
" encoded_query = quote(query_tag, safe=\"\") # Proper URL encoding\n",
|
| 1230 |
+
" url = f\"{BASE_URL}/wiki_pages.json?search[title]={encoded_query}\"\n",
|
| 1231 |
+
"\n",
|
| 1232 |
+
" # print(f\"🚀 Fetching: {query_tag}\")\n",
|
| 1233 |
+
"\n",
|
| 1234 |
+
" try:\n",
|
| 1235 |
+
" response = requests.get(url, auth=AUTH) # Use authentication\n",
|
| 1236 |
+
" response.raise_for_status() # Raise error for bad responses (4xx, 5xx)\n",
|
| 1237 |
+
" wiki_data = response.json()\n",
|
| 1238 |
+
"\n",
|
| 1239 |
+
" if wiki_data and isinstance(wiki_data, list) and len(wiki_data) > 0:\n",
|
| 1240 |
+
" #print(f\"✅ Data found using {query_tag}\")\n",
|
| 1241 |
+
" return wiki_data[0].get(\"body\", \"\") # Extract the \"body\" text\n",
|
| 1242 |
+
"\n",
|
| 1243 |
+
" except requests.exceptions.HTTPError as e:\n",
|
| 1244 |
+
" if response.status_code == 401:\n",
|
| 1245 |
+
" print(f\"❌ Authentication Error (401 Unauthorized) for {query_tag}. Check your API key!\")\n",
|
| 1246 |
+
" exit() # Stop execution if authentication fails\n",
|
| 1247 |
+
" else:\n",
|
| 1248 |
+
" print(f\"❌ Error fetching {query_tag}: {e}\")\n",
|
| 1249 |
+
"\n",
|
| 1250 |
+
" # If all attempts fail\n",
|
| 1251 |
+
" print(f\"❌ No data found for {tag} in any format\")\n",
|
| 1252 |
+
" missing_tags.append(tag)\n",
|
| 1253 |
+
" return None\n",
|
| 1254 |
+
"\n",
|
| 1255 |
+
"\n",
|
| 1256 |
+
"\n",
|
| 1257 |
+
"\n",
|
| 1258 |
+
"def build_tag_hierarchy(tag_groups):\n",
|
| 1259 |
+
" \"\"\"\n",
|
| 1260 |
+
" Builds and structures the hierarchy properly, ensuring:\n",
|
| 1261 |
+
" - Categories with subcategories store their direct tags in `_general`\n",
|
| 1262 |
+
" - Parent categories exist before adding children\n",
|
| 1263 |
+
" - Ensures subject and other key groups are correctly retained\n",
|
| 1264 |
+
" \"\"\"\n",
|
| 1265 |
+
" processed_groups = set()\n",
|
| 1266 |
+
"\n",
|
| 1267 |
+
" for hierarchy_path, tag_group in sorted(tag_groups.items(), key=lambda x: x[0]):\n",
|
| 1268 |
+
" is_list = tag_group in list_based_categories # Check if it's a list-based category\n",
|
| 1269 |
+
"\n",
|
| 1270 |
+
" levels = hierarchy_path.split(\".\")\n",
|
| 1271 |
+
" current_level = manual_hierarchy\n",
|
| 1272 |
+
"\n",
|
| 1273 |
+
" for key in levels[:-1]: # Ensure each parent level exists\n",
|
| 1274 |
+
" if key not in current_level or not isinstance(current_level[key], dict):\n",
|
| 1275 |
+
" current_level[key] = {} # Create dictionary if missing\n",
|
| 1276 |
+
" current_level = current_level[key]\n",
|
| 1277 |
+
"\n",
|
| 1278 |
+
" last_level = levels[-1]\n",
|
| 1279 |
+
" has_subcategories = any(k.startswith(hierarchy_path + \".\") for k in tag_groups.keys())\n",
|
| 1280 |
+
"\n",
|
| 1281 |
+
" # ✅ Ensure the category itself exists\n",
|
| 1282 |
+
" if last_level not in current_level:\n",
|
| 1283 |
+
" current_level[last_level] = {} if has_subcategories else []\n",
|
| 1284 |
+
"\n",
|
| 1285 |
+
" # ✅ Fetch tags from API\n",
|
| 1286 |
+
" wiki_text = fetch_wiki_page(tag_group, is_list)\n",
|
| 1287 |
+
" extracted_tags = extract_tags_from_wiki(wiki_text, is_list)\n",
|
| 1288 |
+
"\n",
|
| 1289 |
+
" # ✅ Store tags under `<category>_general` if there are subcategories\n",
|
| 1290 |
+
" if has_subcategories:\n",
|
| 1291 |
+
" general_key = f\"{last_level}_general\"\n",
|
| 1292 |
+
"\n",
|
| 1293 |
+
" if isinstance(current_level[last_level], list):\n",
|
| 1294 |
+
" # print(f\"⚠️ Warning: {last_level} was a list but has subcategories. Converting to dictionary.\")\n",
|
| 1295 |
+
" current_level[last_level] = {}\n",
|
| 1296 |
+
"\n",
|
| 1297 |
+
" if general_key not in current_level[last_level]:\n",
|
| 1298 |
+
" current_level[last_level][general_key] = [] # Initialize `_general` list\n",
|
| 1299 |
+
"\n",
|
| 1300 |
+
" current_level[last_level][general_key].extend(extracted_tags)\n",
|
| 1301 |
+
" #print(f\"✅ Added {len(extracted_tags)} tags under {hierarchy_path} → {general_key}\")\n",
|
| 1302 |
+
"\n",
|
| 1303 |
+
" else:\n",
|
| 1304 |
+
" # ✅ If no subcategories exist, store tags directly\n",
|
| 1305 |
+
" if isinstance(current_level[last_level], list):\n",
|
| 1306 |
+
" current_level[last_level].extend(extracted_tags)\n",
|
| 1307 |
+
" else:\n",
|
| 1308 |
+
" # print(f\"⚠️ Warning: {last_level} was a dictionary but has no subcategories. Converting to list.\")\n",
|
| 1309 |
+
" current_level[last_level] = extracted_tags\n",
|
| 1310 |
+
"\n",
|
| 1311 |
+
" print(f\"✅ Added {len(extracted_tags)} tags directly under {hierarchy_path}\")\n",
|
| 1312 |
+
"\n",
|
| 1313 |
+
" print(f\"✅ Finished building hierarchy.\")\n",
|
| 1314 |
+
"\n",
|
| 1315 |
+
"\n",
|
| 1316 |
+
"\n",
|
| 1317 |
+
"\n",
|
| 1318 |
+
"\n",
|
| 1319 |
+
"def extract_tags_from_wiki(wiki_text, is_list=False):\n",
|
| 1320 |
+
" \"\"\"\n",
|
| 1321 |
+
" Extracts valid tags from the wiki text.\n",
|
| 1322 |
+
" Uses different extraction logic for tag groups and list-based pages.\n",
|
| 1323 |
+
" \"\"\"\n",
|
| 1324 |
+
" if not wiki_text:\n",
|
| 1325 |
+
" return []\n",
|
| 1326 |
+
"\n",
|
| 1327 |
+
" if is_list:\n",
|
| 1328 |
+
" # Extract `[[tag_name]]` from list pages\n",
|
| 1329 |
+
" tag_pattern = re.compile(r\"\\[\\[(.*?)\\]\\]\")\n",
|
| 1330 |
+
" tags = tag_pattern.findall(wiki_text)\n",
|
| 1331 |
+
" else:\n",
|
| 1332 |
+
" # Extract `[[tag_name]]` from tag groups (skip first item if \"Tag Groups\")\n",
|
| 1333 |
+
" tag_pattern = re.compile(r\"\\[\\[(.*?)\\]\\]\")\n",
|
| 1334 |
+
" tags = tag_pattern.findall(wiki_text)\n",
|
| 1335 |
+
" tags = tags[1:] if tags else [] # Skip first tag\n",
|
| 1336 |
+
"\n",
|
| 1337 |
+
" # Clean tags by removing alternative names `[[Tag|Alternative]]`\n",
|
| 1338 |
+
" cleaned_tags = [tag.split(\"|\")[0].strip() for tag in tags]\n",
|
| 1339 |
+
"\n",
|
| 1340 |
+
" return cleaned_tags\n",
|
| 1341 |
+
"\n",
|
| 1342 |
+
"if __name__ == \"__main__\":\n",
|
| 1343 |
+
" print(\"🚀 Starting Danbooru Tag Hierarchy API Fetcher...\")\n",
|
| 1344 |
+
"\n",
|
| 1345 |
+
" # ✅ Ensure manual_hierarchy exists before cleaning\n",
|
| 1346 |
+
" manual_hierarchy = {}\n",
|
| 1347 |
+
"\n",
|
| 1348 |
+
" # Build hierarchy dynamically\n",
|
| 1349 |
+
" build_tag_hierarchy(tag_groups)\n",
|
| 1350 |
+
"\n",
|
| 1351 |
+
" # ✅ Clean hierarchy and count total tags\n",
|
| 1352 |
+
" manual_hierarchy, total_tags = clean_hierarchy_and_count_tags(manual_hierarchy)\n",
|
| 1353 |
+
"\n",
|
| 1354 |
+
" # Save cleaned JSON\n",
|
| 1355 |
+
" output_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_01.json\"\n",
|
| 1356 |
+
" with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
|
| 1357 |
+
" json.dump(manual_hierarchy, f, indent=4, ensure_ascii=False)\n",
|
| 1358 |
+
"\n",
|
| 1359 |
+
" print(f\"\\n✅ Hierarchy saved to `{output_file}`\")\n",
|
| 1360 |
+
" print(f\"📊 Total tags in the final hierarchy: {total_tags}\")\n",
|
| 1361 |
+
"\n",
|
| 1362 |
+
" if missing_tags:\n",
|
| 1363 |
+
" print(f\"⚠️ The following tag groups were not found: {missing_tags}\")\n"
|
| 1364 |
+
]
|
| 1365 |
+
},
|
| 1366 |
+
{
|
| 1367 |
+
"cell_type": "markdown",
|
| 1368 |
+
"id": "5bf6fa19-04dd-4799-8220-e37a4434a1f8",
|
| 1369 |
+
"metadata": {},
|
| 1370 |
+
"source": [
|
| 1371 |
+
"### Add subject keys etc"
|
| 1372 |
+
]
|
| 1373 |
+
},
|
| 1374 |
+
{
|
| 1375 |
+
"cell_type": "code",
|
| 1376 |
+
"execution_count": 15,
|
| 1377 |
+
"id": "9bf7c0ed-0cfd-41c8-89a4-c59f1783c823",
|
| 1378 |
+
"metadata": {
|
| 1379 |
+
"execution": {
|
| 1380 |
+
"iopub.execute_input": "2025-03-17T12:49:06.895515Z",
|
| 1381 |
+
"iopub.status.busy": "2025-03-17T12:49:06.895119Z",
|
| 1382 |
+
"iopub.status.idle": "2025-03-17T12:49:07.135844Z",
|
| 1383 |
+
"shell.execute_reply": "2025-03-17T12:49:07.135358Z",
|
| 1384 |
+
"shell.execute_reply.started": "2025-03-17T12:49:06.895492Z"
|
| 1385 |
+
}
|
| 1386 |
+
},
|
| 1387 |
+
"outputs": [
|
| 1388 |
+
{
|
| 1389 |
+
"name": "stdout",
|
| 1390 |
+
"output_type": "stream",
|
| 1391 |
+
"text": [
|
| 1392 |
+
"✅ Adding woman to subject.female.female_general\n",
|
| 1393 |
+
"✅ Adding girl to subject.female.female_general\n",
|
| 1394 |
+
"✅ Adding 1girl to subject.female.female_general\n",
|
| 1395 |
+
"🔄 Moving 2girls from more.groups to subject.female.female_general\n",
|
| 1396 |
+
"✅ Adding 2girls to subject.female.female_general\n",
|
| 1397 |
+
"🔄 Moving 3girls from more.groups to subject.female.female_general\n",
|
| 1398 |
+
"✅ Adding 3girls to subject.female.female_general\n",
|
| 1399 |
+
"🔄 Moving 4girls from more.groups to subject.female.female_general\n",
|
| 1400 |
+
"✅ Adding 4girls to subject.female.female_general\n",
|
| 1401 |
+
"🔄 Moving 5girls from more.groups to subject.female.female_general\n",
|
| 1402 |
+
"✅ Adding 5girls to subject.female.female_general\n",
|
| 1403 |
+
"🔄 Moving 6+girls from more.groups to subject.female.female_general\n",
|
| 1404 |
+
"✅ Adding 6+girls to subject.female.female_general\n",
|
| 1405 |
+
"🔄 Moving multiple girls from more.groups to subject.female.female_general\n",
|
| 1406 |
+
"✅ Adding multiple girls to subject.female.female_general\n",
|
| 1407 |
+
"🔄 Moving guitar girl from objects.audio_tags to subject.female.female_general\n",
|
| 1408 |
+
"✅ Adding guitar girl to subject.female.female_general\n",
|
| 1409 |
+
"✅ Adding man to subject.male.male_general\n",
|
| 1410 |
+
"✅ Adding boy to subject.male.male_general\n",
|
| 1411 |
+
"✅ Adding 1boy to subject.male.male_general\n",
|
| 1412 |
+
"🔄 Moving 2boys from more.groups to subject.male.male_general\n",
|
| 1413 |
+
"✅ Adding 2boys to subject.male.male_general\n",
|
| 1414 |
+
"🔄 Moving 3boys from more.groups to subject.male.male_general\n",
|
| 1415 |
+
"✅ Adding 3boys to subject.male.male_general\n",
|
| 1416 |
+
"🔄 Moving 4boys from more.groups to subject.male.male_general\n",
|
| 1417 |
+
"✅ Adding 4boys to subject.male.male_general\n",
|
| 1418 |
+
"🔄 Moving 5boys from more.groups to subject.male.male_general\n",
|
| 1419 |
+
"✅ Adding 5boys to subject.male.male_general\n",
|
| 1420 |
+
"🔄 Moving 6+boys from more.groups to subject.male.male_general\n",
|
| 1421 |
+
"✅ Adding 6+boys to subject.male.male_general\n",
|
| 1422 |
+
"🔄 Moving multiple boys from more.groups to subject.male.male_general\n",
|
| 1423 |
+
"✅ Adding multiple boys to subject.male.male_general\n",
|
| 1424 |
+
"✅ Adding guitar boy to subject.male.male_general\n",
|
| 1425 |
+
"✅ Adding 1koma to subject.koma.koma_general\n",
|
| 1426 |
+
"✅ Adding 2koma to subject.koma.koma_general\n",
|
| 1427 |
+
"🔄 Moving cat girl from creatures.animals.cats to subject.anthro.anthro_general\n",
|
| 1428 |
+
"✅ Adding cat girl to subject.anthro.anthro_general\n",
|
| 1429 |
+
"✅ Adding fox girl to subject.anthro.anthro_general\n",
|
| 1430 |
+
"✅ Adding dog girl to subject.anthro.anthro_general\n",
|
| 1431 |
+
"🔄 Moving plant girl from plant.plant to subject.anthro.anthro_general\n",
|
| 1432 |
+
"🔄 Moving plant girl from plants to subject.anthro.anthro_general\n",
|
| 1433 |
+
"✅ Adding plant girl to subject.anthro.anthro_general\n",
|
| 1434 |
+
"🔄 Moving plant boy from plant.plant to subject.anthro.anthro_general\n",
|
| 1435 |
+
"🔄 Moving plant boy from plants to subject.anthro.anthro_general\n",
|
| 1436 |
+
"✅ Adding plant boy to subject.anthro.anthro_general\n",
|
| 1437 |
+
"🔄 Moving cat boy from creatures.animals.cats to subject.anthro.anthro_general\n",
|
| 1438 |
+
"✅ Adding cat boy to subject.anthro.anthro_general\n",
|
| 1439 |
+
"✅ Adding furry to subject.anthro.anthro_general\n",
|
| 1440 |
+
"✅ Adding monster boy to subject.anthro.anthro_general\n",
|
| 1441 |
+
"✅ Adding monster girl to subject.anthro.anthro_general\n",
|
| 1442 |
+
"✅ Adding demon girl to subject.anthro.anthro_general\n",
|
| 1443 |
+
"✅ Adding demon boy to subject.anthro.anthro_general\n",
|
| 1444 |
+
"✅ Adding magical boy to subject.anthro.anthro_general\n",
|
| 1445 |
+
"🔄 Moving magical girl from characters.sailor_moon to subject.anthro.anthro_general\n",
|
| 1446 |
+
"✅ Adding magical girl to subject.anthro.anthro_general\n",
|
| 1447 |
+
"✅ Updated JSON saved as `/shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/danbooru_tags_step_02.json`\n"
|
| 1448 |
+
]
|
| 1449 |
+
}
|
| 1450 |
+
],
|
| 1451 |
+
"source": [
|
| 1452 |
+
"import json\n",
|
| 1453 |
+
"\n",
|
| 1454 |
+
"# Load the existing JSON file\n",
|
| 1455 |
+
"input_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_01.json\"\n",
|
| 1456 |
+
"output_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_02.json\"\n",
|
| 1457 |
+
"\n",
|
| 1458 |
+
"with open(input_file, \"r\", encoding=\"utf-8\") as f:\n",
|
| 1459 |
+
" manual_hierarchy = json.load(f)\n",
|
| 1460 |
+
"\n",
|
| 1461 |
+
"# ✅ Define the correct subject structure\n",
|
| 1462 |
+
"subject_structure = {\n",
|
| 1463 |
+
" \"subject\": {\n",
|
| 1464 |
+
" \"female\": {\n",
|
| 1465 |
+
" \"female_general\": [\"woman\", \"girl\", \"1girl\", \"2girls\", \"3girls\", \"4girls\", \"5girls\", \"6+girls\", \"multiple girls\", \"guitar girl\" ]\n",
|
| 1466 |
+
" },\n",
|
| 1467 |
+
" \"male\": {\n",
|
| 1468 |
+
" \"male_general\": [\"man\", \"boy\", \"1boy\", \"2boys\", \"3boys\", \"4boys\", \"5boys\", \"6+boys\", \"multiple boys\", \"guitar boy\"]\n",
|
| 1469 |
+
" },\n",
|
| 1470 |
+
" \"koma\": {\n",
|
| 1471 |
+
" \"koma_general\": [\"1koma\", \"2koma\"]\n",
|
| 1472 |
+
" },\n",
|
| 1473 |
+
" \"anthro\": {\n",
|
| 1474 |
+
" \"anthro_general\": [\"cat girl\", \"fox girl\", \"dog girl\", \"plant girl\", \"plant boy\", \"cat boy\", \"furry\", \"monster boy\", \"monster girl\", \"demon girl\" , \"demon boy\", \"magical boy\", \"magical girl\", ]\n",
|
| 1475 |
+
" }\n",
|
| 1476 |
+
" }\n",
|
| 1477 |
+
"}\n",
|
| 1478 |
+
"\n",
|
| 1479 |
+
"# ✅ Ensure \"subject\" exists in the hierarchy\n",
|
| 1480 |
+
"if \"subject\" not in manual_hierarchy:\n",
|
| 1481 |
+
" manual_hierarchy[\"subject\"] = {}\n",
|
| 1482 |
+
"\n",
|
| 1483 |
+
"# ✅ Ensure all subcategories exist\n",
|
| 1484 |
+
"for category, subcategories in subject_structure[\"subject\"].items():\n",
|
| 1485 |
+
" if category not in manual_hierarchy[\"subject\"]:\n",
|
| 1486 |
+
" manual_hierarchy[\"subject\"][category] = {}\n",
|
| 1487 |
+
"\n",
|
| 1488 |
+
" for subcategory, tags in subcategories.items():\n",
|
| 1489 |
+
" if subcategory not in manual_hierarchy[\"subject\"][category]:\n",
|
| 1490 |
+
" manual_hierarchy[\"subject\"][category][subcategory] = []\n",
|
| 1491 |
+
"\n",
|
| 1492 |
+
"# ✅ Move misplaced tags and also ensure missing tags are added\n",
|
| 1493 |
+
"for category, subcategories in subject_structure[\"subject\"].items():\n",
|
| 1494 |
+
" for subcategory, tags in subcategories.items():\n",
|
| 1495 |
+
" target_list = manual_hierarchy[\"subject\"][category][subcategory]\n",
|
| 1496 |
+
"\n",
|
| 1497 |
+
" for tag in tags:\n",
|
| 1498 |
+
" found_in_wrong_place = False # Track if tag was found elsewhere\n",
|
| 1499 |
+
"\n",
|
| 1500 |
+
" # ✅ Search for the tag in other categories and remove if found\n",
|
| 1501 |
+
" for key, value in list(manual_hierarchy.items()): # Use list() to avoid runtime changes\n",
|
| 1502 |
+
" if isinstance(value, list) and tag in value:\n",
|
| 1503 |
+
" print(f\"🔄 Moving {tag} from {key} to subject.{category}.{subcategory}\")\n",
|
| 1504 |
+
" value.remove(tag)\n",
|
| 1505 |
+
" found_in_wrong_place = True\n",
|
| 1506 |
+
"\n",
|
| 1507 |
+
" elif isinstance(value, dict): # Search deeper\n",
|
| 1508 |
+
" for subkey, subvalue in list(value.items()):\n",
|
| 1509 |
+
" if isinstance(subvalue, list) and tag in subvalue:\n",
|
| 1510 |
+
" print(f\"🔄 Moving {tag} from {key}.{subkey} to subject.{category}.{subcategory}\")\n",
|
| 1511 |
+
" subvalue.remove(tag)\n",
|
| 1512 |
+
" found_in_wrong_place = True\n",
|
| 1513 |
+
"\n",
|
| 1514 |
+
" elif isinstance(subvalue, dict):\n",
|
| 1515 |
+
" for deepkey, deepvalue in list(subvalue.items()):\n",
|
| 1516 |
+
" if isinstance(deepvalue, list) and tag in deepvalue:\n",
|
| 1517 |
+
" print(f\"🔄 Moving {tag} from {key}.{subkey}.{deepkey} to subject.{category}.{subcategory}\")\n",
|
| 1518 |
+
" deepvalue.remove(tag)\n",
|
| 1519 |
+
" found_in_wrong_place = True\n",
|
| 1520 |
+
"\n",
|
| 1521 |
+
" # ✅ Add the tag to the correct subject category if missing\n",
|
| 1522 |
+
" if tag not in target_list:\n",
|
| 1523 |
+
" print(f\"✅ Adding {tag} to subject.{category}.{subcategory}\")\n",
|
| 1524 |
+
" target_list.append(tag)\n",
|
| 1525 |
+
"\n",
|
| 1526 |
+
"# ✅ Save the updated JSON\n",
|
| 1527 |
+
"with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
|
| 1528 |
+
" json.dump(manual_hierarchy, f, indent=4, ensure_ascii=False)\n",
|
| 1529 |
+
"\n",
|
| 1530 |
+
"print(f\"✅ Updated JSON saved as `{output_file}`\")\n",
|
| 1531 |
+
"\n"
|
| 1532 |
+
]
|
| 1533 |
+
},
|
| 1534 |
+
{
|
| 1535 |
+
"cell_type": "markdown",
|
| 1536 |
+
"id": "aff1a30d-17e7-4e55-b881-b5ce622a533a",
|
| 1537 |
+
"metadata": {},
|
| 1538 |
+
"source": [
|
| 1539 |
+
"### Compare with wd-14 vocabulary"
|
| 1540 |
+
]
|
| 1541 |
+
},
|
| 1542 |
+
{
|
| 1543 |
+
"cell_type": "code",
|
| 1544 |
+
"execution_count": 16,
|
| 1545 |
+
"id": "2a8f986c-0ccd-4235-ac50-be3d6e495f9f",
|
| 1546 |
+
"metadata": {
|
| 1547 |
+
"execution": {
|
| 1548 |
+
"iopub.execute_input": "2025-03-17T12:49:10.749271Z",
|
| 1549 |
+
"iopub.status.busy": "2025-03-17T12:49:10.748481Z",
|
| 1550 |
+
"iopub.status.idle": "2025-03-17T12:49:11.628872Z",
|
| 1551 |
+
"shell.execute_reply": "2025-03-17T12:49:11.628200Z",
|
| 1552 |
+
"shell.execute_reply.started": "2025-03-17T12:49:10.749248Z"
|
| 1553 |
+
}
|
| 1554 |
+
},
|
| 1555 |
+
"outputs": [
|
| 1556 |
+
{
|
| 1557 |
+
"name": "stdout",
|
| 1558 |
+
"output_type": "stream",
|
| 1559 |
+
"text": [
|
| 1560 |
+
"\n",
|
| 1561 |
+
"📌 Tags in CSV but NOT in JSON:\n",
|
| 1562 |
+
"\n",
|
| 1563 |
+
"✅ Missing tags saved to `missing_tags.txt`\n"
|
| 1564 |
+
]
|
| 1565 |
+
}
|
| 1566 |
+
],
|
| 1567 |
+
"source": [
|
| 1568 |
+
"import json\n",
|
| 1569 |
+
"import pandas as pd\n",
|
| 1570 |
+
"\n",
|
| 1571 |
+
"# Load CSV file\n",
|
| 1572 |
+
"csv_file = current_dir.parent / \"misc/autotagging-vocabularies/danbooru.csv\" # Update with the actual CSV file name\n",
|
| 1573 |
+
"df = pd.read_csv(csv_file)\n",
|
| 1574 |
+
"\n",
|
| 1575 |
+
"# Load JSON file\n",
|
| 1576 |
+
"json_file = input_file = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_03.json\"\n",
|
| 1577 |
+
"with open(json_file, \"r\", encoding=\"utf-8\") as f:\n",
|
| 1578 |
+
" manual_hierarchy = json.load(f)\n",
|
| 1579 |
+
"\n",
|
| 1580 |
+
"# ✅ Extract all tags from the CSV\n",
|
| 1581 |
+
"csv_tags = set(df[\"name\"].astype(str).str.replace(\"_\", \" \").str.lower()) # Normalize tags\n",
|
| 1582 |
+
"\n",
|
| 1583 |
+
"# ✅ Extract all tags from the JSON recursively\n",
|
| 1584 |
+
"def extract_tags_from_json(data):\n",
|
| 1585 |
+
" tags = set()\n",
|
| 1586 |
+
" if isinstance(data, dict):\n",
|
| 1587 |
+
" for value in data.values():\n",
|
| 1588 |
+
" tags.update(extract_tags_from_json(value))\n",
|
| 1589 |
+
" elif isinstance(data, list):\n",
|
| 1590 |
+
" tags.update(str(tag).replace(\"_\", \" \").lower() for tag in data)\n",
|
| 1591 |
+
" return tags\n",
|
| 1592 |
+
"\n",
|
| 1593 |
+
"json_tags = extract_tags_from_json(manual_hierarchy)\n",
|
| 1594 |
+
"\n",
|
| 1595 |
+
"# ✅ Find tags in CSV but NOT in JSON\n",
|
| 1596 |
+
"missing_tags = csv_tags - json_tags\n",
|
| 1597 |
+
"\n",
|
| 1598 |
+
"# ✅ Print the missing tags\n",
|
| 1599 |
+
"print(\"\\n📌 Tags in CSV but NOT in JSON:\")\n",
|
| 1600 |
+
"for tag in sorted(missing_tags):\n",
|
| 1601 |
+
" print(tag)\n",
|
| 1602 |
+
"\n",
|
| 1603 |
+
"# ✅ Save missing tags to a file for review (optional)\n",
|
| 1604 |
+
"missing_tags_file = \"missing_tags.txt\"\n",
|
| 1605 |
+
"with open(missing_tags_file, \"w\", encoding=\"utf-8\") as f:\n",
|
| 1606 |
+
" f.write(\"\\n\".join(sorted(missing_tags)))\n",
|
| 1607 |
+
"\n",
|
| 1608 |
+
"print(f\"\\n✅ Missing tags saved to `{missing_tags_file}`\")\n"
|
| 1609 |
+
]
|
| 1610 |
+
},
|
| 1611 |
+
{
|
| 1612 |
+
"cell_type": "markdown",
|
| 1613 |
+
"id": "706317c1-498d-40ab-a29a-ee658e1735e2",
|
| 1614 |
+
"metadata": {
|
| 1615 |
+
"execution": {
|
| 1616 |
+
"iopub.execute_input": "2025-03-17T10:53:52.106543Z",
|
| 1617 |
+
"iopub.status.busy": "2025-03-17T10:53:52.105944Z",
|
| 1618 |
+
"iopub.status.idle": "2025-03-17T10:53:52.108812Z",
|
| 1619 |
+
"shell.execute_reply": "2025-03-17T10:53:52.108451Z",
|
| 1620 |
+
"shell.execute_reply.started": "2025-03-17T10:53:52.106526Z"
|
| 1621 |
+
}
|
| 1622 |
+
},
|
| 1623 |
+
"source": [
|
| 1624 |
+
"### Fetch wikidata for collected tags"
|
| 1625 |
+
]
|
| 1626 |
+
},
|
| 1627 |
+
{
|
| 1628 |
+
"cell_type": "code",
|
| 1629 |
+
"execution_count": null,
|
| 1630 |
+
"id": "ec696e5b-be39-4407-a04f-6ce6b0b08855",
|
| 1631 |
+
"metadata": {
|
| 1632 |
+
"execution": {
|
| 1633 |
+
"execution_failed": "2025-03-17T12:46:52.921Z",
|
| 1634 |
+
"iopub.execute_input": "2025-03-17T12:45:32.542088Z",
|
| 1635 |
+
"iopub.status.busy": "2025-03-17T12:45:32.541554Z"
|
| 1636 |
+
}
|
| 1637 |
+
},
|
| 1638 |
+
"outputs": [
|
| 1639 |
+
{
|
| 1640 |
+
"name": "stdout",
|
| 1641 |
+
"output_type": "stream",
|
| 1642 |
+
"text": [
|
| 1643 |
+
"✅ Extracted 35559 unique tags.\n",
|
| 1644 |
+
"🚀 Fetching data for amamiya_elena (1/35559)...\n",
|
| 1645 |
+
"✅ Saved amamiya_elena to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/amamiya_elena.json\n",
|
| 1646 |
+
"🚀 Fetching data for gilles_de_rais (2/35559)...\n",
|
| 1647 |
+
"✅ Saved gilles_de_rais to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/gilles_de_rais.json\n",
|
| 1648 |
+
"🚀 Fetching data for blue_scrunchie (3/35559)...\n",
|
| 1649 |
+
"✅ Saved blue_scrunchie to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/blue_scrunchie.json\n",
|
| 1650 |
+
"🚀 Fetching data for playing (4/35559)...\n",
|
| 1651 |
+
"✅ Saved playing to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/playing.json\n",
|
| 1652 |
+
"🚀 Fetching data for album_cover_redraw (5/35559)...\n",
|
| 1653 |
+
"✅ Saved album_cover_redraw to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/album_cover_redraw.json\n",
|
| 1654 |
+
"🚀 Fetching data for >3< (6/35559)...\n",
|
| 1655 |
+
"✅ Saved >3< to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/>3<.json\n",
|
| 1656 |
+
"🚀 Fetching data for damom (7/35559)...\n",
|
| 1657 |
+
"🚀 Fetching data for shoulder_cannon (8/35559)...\n",
|
| 1658 |
+
"✅ Saved shoulder_cannon to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/shoulder_cannon.json\n",
|
| 1659 |
+
"🚀 Fetching data for cover_image (9/35559)...\n",
|
| 1660 |
+
"✅ Saved cover_image to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/cover_image.json\n",
|
| 1661 |
+
"🚀 Fetching data for ultraman_legend (10/35559)...\n",
|
| 1662 |
+
"✅ Saved ultraman_legend to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ultraman_legend.json\n",
|
| 1663 |
+
"🚀 Fetching data for mastiff (11/35559)...\n",
|
| 1664 |
+
"✅ Saved mastiff to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/mastiff.json\n",
|
| 1665 |
+
"🚀 Fetching data for oingo (12/35559)...\n",
|
| 1666 |
+
"✅ Saved oingo to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/oingo.json\n",
|
| 1667 |
+
"🚀 Fetching data for mario_strikers_(series) (13/35559)...\n",
|
| 1668 |
+
"✅ Saved mario_strikers_(series) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/mario_strikers_(series).json\n",
|
| 1669 |
+
"🚀 Fetching data for phallic_symbol (14/35559)...\n",
|
| 1670 |
+
"✅ Saved phallic_symbol to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/phallic_symbol.json\n",
|
| 1671 |
+
"🚀 Fetching data for 4girls (15/35559)...\n",
|
| 1672 |
+
"✅ Saved 4girls to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/4girls.json\n",
|
| 1673 |
+
"🚀 Fetching data for nice_nature_(racehorse) (16/35559)...\n",
|
| 1674 |
+
"✅ Saved nice_nature_(racehorse) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/nice_nature_(racehorse).json\n",
|
| 1675 |
+
"🚀 Fetching data for hair_beads (17/35559)...\n",
|
| 1676 |
+
"✅ Saved hair_beads to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/hair_beads.json\n",
|
| 1677 |
+
"🚀 Fetching data for fu_po_(azur_lane) (18/35559)...\n",
|
| 1678 |
+
"✅ Saved fu_po_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/fu_po_(azur_lane).json\n",
|
| 1679 |
+
"🚀 Fetching data for togepi (19/35559)...\n",
|
| 1680 |
+
"✅ Saved togepi to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/togepi.json\n",
|
| 1681 |
+
"🚀 Fetching data for yasopp (20/35559)...\n",
|
| 1682 |
+
"✅ Saved yasopp to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/yasopp.json\n",
|
| 1683 |
+
"🚀 Fetching data for oyafune_suama (21/35559)...\n",
|
| 1684 |
+
"✅ Saved oyafune_suama to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/oyafune_suama.json\n",
|
| 1685 |
+
"🚀 Fetching data for phantasy_star_iii (22/35559)...\n",
|
| 1686 |
+
"✅ Saved phantasy_star_iii to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/phantasy_star_iii.json\n",
|
| 1687 |
+
"🚀 Fetching data for qingdai_guanmao (23/35559)...\n",
|
| 1688 |
+
"✅ Saved qingdai_guanmao to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/qingdai_guanmao.json\n",
|
| 1689 |
+
"🚀 Fetching data for kufei (24/35559)...\n",
|
| 1690 |
+
"✅ Saved kufei to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/kufei.json\n",
|
| 1691 |
+
"🚀 Fetching data for stefan_(atelier) (25/35559)...\n",
|
| 1692 |
+
"✅ Saved stefan_(atelier) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/stefan_(atelier).json\n",
|
| 1693 |
+
"🚀 Fetching data for dille_blood (26/35559)...\n",
|
| 1694 |
+
"✅ Saved dille_blood to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/dille_blood.json\n",
|
| 1695 |
+
"🚀 Fetching data for vivillon_(modern) (27/35559)...\n",
|
| 1696 |
+
"✅ Saved vivillon_(modern) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/vivillon_(modern).json\n",
|
| 1697 |
+
"🚀 Fetching data for sweetie_(ragnarok_online) (28/35559)...\n",
|
| 1698 |
+
"✅ Saved sweetie_(ragnarok_online) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/sweetie_(ragnarok_online).json\n",
|
| 1699 |
+
"🚀 Fetching data for whisking (29/35559)...\n",
|
| 1700 |
+
"✅ Saved whisking to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/whisking.json\n",
|
| 1701 |
+
"🚀 Fetching data for h&k_hk33 (30/35559)...\n",
|
| 1702 |
+
"✅ Saved h&k_hk33 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/h&k_hk33.json\n",
|
| 1703 |
+
"🚀 Fetching data for winx_club (31/35559)...\n",
|
| 1704 |
+
"✅ Saved winx_club to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/winx_club.json\n",
|
| 1705 |
+
"🚀 Fetching data for anti-tank_grenade (32/35559)...\n",
|
| 1706 |
+
"✅ Saved anti-tank_grenade to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/anti-tank_grenade.json\n",
|
| 1707 |
+
"🚀 Fetching data for devin_booker (33/35559)...\n",
|
| 1708 |
+
"✅ Saved devin_booker to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/devin_booker.json\n",
|
| 1709 |
+
"🚀 Fetching data for scylla_(azur_lane) (34/35559)...\n",
|
| 1710 |
+
"✅ Saved scylla_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/scylla_(azur_lane).json\n",
|
| 1711 |
+
"🚀 Fetching data for penance_(arknights) (35/35559)...\n",
|
| 1712 |
+
"✅ Saved penance_(arknights) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/penance_(arknights).json\n",
|
| 1713 |
+
"🚀 Fetching data for toba_(oshiro_project) (36/35559)...\n",
|
| 1714 |
+
"✅ Saved toba_(oshiro_project) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/toba_(oshiro_project).json\n",
|
| 1715 |
+
"🚀 Fetching data for scott_adams_(style) (37/35559)...\n",
|
| 1716 |
+
"✅ Saved scott_adams_(style) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/scott_adams_(style).json\n",
|
| 1717 |
+
"🚀 Fetching data for 502nd_joint_fighter_wing (38/35559)...\n",
|
| 1718 |
+
"✅ Saved 502nd_joint_fighter_wing to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/502nd_joint_fighter_wing.json\n",
|
| 1719 |
+
"🚀 Fetching data for komica_wiki (39/35559)...\n",
|
| 1720 |
+
"✅ Saved komica_wiki to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/komica_wiki.json\n",
|
| 1721 |
+
"🚀 Fetching data for final_fantasy_vi (40/35559)...\n",
|
| 1722 |
+
"✅ Saved final_fantasy_vi to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/final_fantasy_vi.json\n",
|
| 1723 |
+
"🚀 Fetching data for h&k_hk45 (41/35559)...\n",
|
| 1724 |
+
"✅ Saved h&k_hk45 to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/h&k_hk45.json\n",
|
| 1725 |
+
"🚀 Fetching data for saint_seiya (42/35559)...\n",
|
| 1726 |
+
"✅ Saved saint_seiya to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/saint_seiya.json\n",
|
| 1727 |
+
"🚀 Fetching data for ike_(fire_emblem) (43/35559)...\n",
|
| 1728 |
+
"✅ Saved ike_(fire_emblem) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ike_(fire_emblem).json\n",
|
| 1729 |
+
"🚀 Fetching data for cooperative_breast_smother (44/35559)...\n",
|
| 1730 |
+
"✅ Saved cooperative_breast_smother to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/cooperative_breast_smother.json\n",
|
| 1731 |
+
"🚀 Fetching data for pamiat_merkuria_(azur_lane) (45/35559)...\n",
|
| 1732 |
+
"✅ Saved pamiat_merkuria_(azur_lane) to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/pamiat_merkuria_(azur_lane).json\n",
|
| 1733 |
+
"🚀 Fetching data for ribbon-trimmed_skirt (46/35559)...\n",
|
| 1734 |
+
"✅ Saved ribbon-trimmed_skirt to /shares/weddigen.ki.uzh/laura_wagner/Civitai_page_analysis/Civitai_visualizations/misc/danbooru_donmai/tags_wiki/ribbon-trimmed_skirt.json\n"
|
| 1735 |
+
]
|
| 1736 |
+
}
|
| 1737 |
+
],
|
| 1738 |
+
"source": [
|
| 1739 |
+
"import json\n",
|
| 1740 |
+
"import os\n",
|
| 1741 |
+
"import requests\n",
|
| 1742 |
+
"import time\n",
|
| 1743 |
+
"import urllib.parse\n",
|
| 1744 |
+
"\n",
|
| 1745 |
+
"# Load the cleaned hierarchy JSON\n",
|
| 1746 |
+
"json_file_path = current_dir.parent / \"misc/danbooru_donmai/danbooru_tags_step_03.json\"\n",
|
| 1747 |
+
"with open(json_file_path, \"r\", encoding=\"utf-8\") as file:\n",
|
| 1748 |
+
" tag_hierarchy = json.load(file)\n",
|
| 1749 |
+
"\n",
|
| 1750 |
+
"# Authentication (Replace with your credentials)\n",
|
| 1751 |
+
"USERNAME = username\n",
|
| 1752 |
+
"API_KEY = api_key\n",
|
| 1753 |
+
"\n",
|
| 1754 |
+
"# Base API URL\n",
|
| 1755 |
+
"BASE_URL = \"https://danbooru.donmai.us\"\n",
|
| 1756 |
+
"\n",
|
| 1757 |
+
"# Output folder for JSON results\n",
|
| 1758 |
+
"output_folder = current_dir.parent / \"misc/danbooru_donmai/tags_wiki\"\n",
|
| 1759 |
+
"os.makedirs(output_folder, exist_ok=True) # Ensure the directory exists\n",
|
| 1760 |
+
"\n",
|
| 1761 |
+
"# Function to extract all unique tags from nested dictionaries/lists\n",
|
| 1762 |
+
"def extract_tags(data):\n",
|
| 1763 |
+
" tags = set()\n",
|
| 1764 |
+
" if isinstance(data, dict):\n",
|
| 1765 |
+
" for key, value in data.items():\n",
|
| 1766 |
+
" tags.update(extract_tags(value)) # Recursively extract tags\n",
|
| 1767 |
+
" elif isinstance(data, list):\n",
|
| 1768 |
+
" for item in data:\n",
|
| 1769 |
+
" if isinstance(item, str):\n",
|
| 1770 |
+
" formatted_tag = item.lower().replace(\" \", \"_\") # ✅ Convert to lowercase and replace spaces\n",
|
| 1771 |
+
" tags.add(formatted_tag)\n",
|
| 1772 |
+
" else:\n",
|
| 1773 |
+
" tags.update(extract_tags(item)) # Handle nested lists\n",
|
| 1774 |
+
" return tags\n",
|
| 1775 |
+
"\n",
|
| 1776 |
+
"# Extract all unique tags\n",
|
| 1777 |
+
"all_tags = extract_tags(tag_hierarchy)\n",
|
| 1778 |
+
"print(f\"✅ Extracted {len(all_tags)} unique tags.\")\n",
|
| 1779 |
+
"\n",
|
| 1780 |
+
"# API query function\n",
|
| 1781 |
+
"def fetch_tag_data(tag):\n",
|
| 1782 |
+
" \"\"\"Fetch tag details from Danbooru API.\"\"\"\n",
|
| 1783 |
+
" encoded_tag = urllib.parse.quote(tag, safe=\"\")\n",
|
| 1784 |
+
" api_url = f\"{BASE_URL}/tags.json?search[name]={encoded_tag}&only=id,name,category,post_count,is_deprecated,created_at,updated_at,wiki_page,artist,antecedent_alias,consequent_aliases,antecedent_implications,consequent_implications,dtext_links\"\n",
|
| 1785 |
+
"\n",
|
| 1786 |
+
" try:\n",
|
| 1787 |
+
" response = requests.get(api_url, auth=(USERNAME, API_KEY))\n",
|
| 1788 |
+
" response.raise_for_status()\n",
|
| 1789 |
+
" return response.json()\n",
|
| 1790 |
+
" except requests.exceptions.RequestException as e:\n",
|
| 1791 |
+
" print(f\"⚠️ Error fetching data for '{tag}': {e}\")\n",
|
| 1792 |
+
" return None\n",
|
| 1793 |
+
"\n",
|
| 1794 |
+
"# Process each tag\n",
|
| 1795 |
+
"for idx, tag in enumerate(all_tags):\n",
|
| 1796 |
+
" tag_filename = os.path.join(output_folder, f\"{tag}.json\")\n",
|
| 1797 |
+
"\n",
|
| 1798 |
+
" # Skip if the file already exists to avoid redundant API calls\n",
|
| 1799 |
+
" if os.path.exists(tag_filename):\n",
|
| 1800 |
+
" print(f\"🔄 Skipping {tag}, already saved.\")\n",
|
| 1801 |
+
" continue\n",
|
| 1802 |
+
"\n",
|
| 1803 |
+
" print(f\"🚀 Fetching data for {tag} ({idx+1}/{len(all_tags)})...\")\n",
|
| 1804 |
+
"\n",
|
| 1805 |
+
" tag_data = fetch_tag_data(tag)\n",
|
| 1806 |
+
"\n",
|
| 1807 |
+
" if tag_data:\n",
|
| 1808 |
+
" with open(tag_filename, \"w\", encoding=\"utf-8\") as f:\n",
|
| 1809 |
+
" json.dump(tag_data, f, indent=4, ensure_ascii=False)\n",
|
| 1810 |
+
" print(f\"✅ Saved {tag} to {tag_filename}\")\n",
|
| 1811 |
+
"\n",
|
| 1812 |
+
" # Respect API rate limits\n",
|
| 1813 |
+
" time.sleep(1.5) # Adjust delay if necessary\n",
|
| 1814 |
+
"\n",
|
| 1815 |
+
"print(\"\\n✅ All tags processed and saved in the 'tags/' folder.\")\n"
|
| 1816 |
+
]
|
| 1817 |
+
},
|
| 1818 |
+
{
|
| 1819 |
+
"cell_type": "code",
|
| 1820 |
+
"execution_count": null,
|
| 1821 |
+
"id": "8e364060-36e9-4418-8bdd-6018c9edcc33",
|
| 1822 |
+
"metadata": {},
|
| 1823 |
+
"outputs": [],
|
| 1824 |
+
"source": []
|
| 1825 |
+
}
|
| 1826 |
+
],
|
| 1827 |
+
"metadata": {
|
| 1828 |
+
"kernelspec": {
|
| 1829 |
+
"display_name": "latm",
|
| 1830 |
+
"language": "python",
|
| 1831 |
+
"name": "python3"
|
| 1832 |
+
},
|
| 1833 |
+
"language_info": {
|
| 1834 |
+
"codemirror_mode": {
|
| 1835 |
+
"name": "ipython",
|
| 1836 |
+
"version": 3
|
| 1837 |
+
},
|
| 1838 |
+
"file_extension": ".py",
|
| 1839 |
+
"mimetype": "text/x-python",
|
| 1840 |
+
"name": "python",
|
| 1841 |
+
"nbconvert_exporter": "python",
|
| 1842 |
+
"pygments_lexer": "ipython3",
|
| 1843 |
+
"version": "3.10.15"
|
| 1844 |
+
}
|
| 1845 |
+
},
|
| 1846 |
+
"nbformat": 4,
|
| 1847 |
+
"nbformat_minor": 5
|
| 1848 |
+
}
|
jupyter_notebooks/SuppM_Figure_S14_co-occurence_training_data.ipynb
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cells": [
|
| 3 |
+
{
|
| 4 |
+
"cell_type": "markdown",
|
| 5 |
+
"metadata": {},
|
| 6 |
+
"source": [
|
| 7 |
+
"# Prepare *.json for Figure 13"
|
| 8 |
+
]
|
| 9 |
+
},
|
| 10 |
+
{
|
| 11 |
+
"cell_type": "code",
|
| 12 |
+
"execution_count": null,
|
| 13 |
+
"metadata": {},
|
| 14 |
+
"outputs": [],
|
| 15 |
+
"source": [
|
| 16 |
+
"import pandas as pd\n",
|
| 17 |
+
"import json\n",
|
| 18 |
+
"from itertools import combinations\n",
|
| 19 |
+
"from collections import Counter\n",
|
| 20 |
+
"\n",
|
| 21 |
+
"# Load the data with only needed columns\n",
|
| 22 |
+
"csv_file_path = \"all_models_with_tags.csv\" # Update with your actual file path\n",
|
| 23 |
+
"\n",
|
| 24 |
+
"df = pd.read_csv(csv_file_path, low_memory=False)\n",
|
| 25 |
+
"\n",
|
| 26 |
+
"# Identify the starting point of tag columns (after 'civitai_id')\n",
|
| 27 |
+
"civitai_index = df.columns.get_loc(\"civitai_id\") + 1\n",
|
| 28 |
+
"\n",
|
| 29 |
+
"df_tags = df.iloc[:, civitai_index:]\n",
|
| 30 |
+
"\n",
|
| 31 |
+
"# Extract tag columns (tag01 to tag199) and corresponding counts\n",
|
| 32 |
+
"tag_columns = [col for col in df_tags.columns if col.startswith(\"tag\") and col[3:].isdigit() and int(col[3:]) <= 199]\n",
|
| 33 |
+
"tag_no_columns = [col for col in df_tags.columns if col.startswith(\"tag\") and col.endswith(\"_no\")]\n",
|
| 34 |
+
"\n",
|
| 35 |
+
"if not tag_columns:\n",
|
| 36 |
+
" print(\"No tag columns found in the dataset.\")\n",
|
| 37 |
+
" exit()\n",
|
| 38 |
+
"\n",
|
| 39 |
+
"df_tags = df[tag_columns]\n",
|
| 40 |
+
"df_tag_counts = df[tag_no_columns] if tag_no_columns else None\n",
|
| 41 |
+
"\n",
|
| 42 |
+
"# Load tag categories from JSON\n",
|
| 43 |
+
"json_file_path = \"danbooru_tags_step_03.json\" # Update with your actual file path\n",
|
| 44 |
+
"with open(json_file_path, \"r\", encoding=\"utf-8\") as f:\n",
|
| 45 |
+
" tag_categories = json.load(f)\n",
|
| 46 |
+
"\n",
|
| 47 |
+
"# Function to normalize tags (lowercase and replace underscores)\n",
|
| 48 |
+
"def normalize_tag(tag):\n",
|
| 49 |
+
" return tag.lower().replace(\"_\", \" \") if isinstance(tag, str) else tag\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"# Flatten JSON structure to map tags to top-level categories\n",
|
| 52 |
+
"def extract_tag_categories(json_data, current_category=None):\n",
|
| 53 |
+
" tag_mapping = {}\n",
|
| 54 |
+
" for key, value in json_data.items():\n",
|
| 55 |
+
" if isinstance(value, dict):\n",
|
| 56 |
+
" tag_mapping.update(extract_tag_categories(value, key))\n",
|
| 57 |
+
" elif isinstance(value, list):\n",
|
| 58 |
+
" for tag in value:\n",
|
| 59 |
+
" normalized_tag = normalize_tag(tag)\n",
|
| 60 |
+
" if normalized_tag:\n",
|
| 61 |
+
" tag_mapping[normalized_tag] = current_category\n",
|
| 62 |
+
" return tag_mapping\n",
|
| 63 |
+
"\n",
|
| 64 |
+
"# Create mapping of tags to categories\n",
|
| 65 |
+
"tag_category_mapping = extract_tag_categories(tag_categories)\n",
|
| 66 |
+
"\n",
|
| 67 |
+
"# Flatten and count occurrences of individual tags\n",
|
| 68 |
+
"all_tags = []\n",
|
| 69 |
+
"tag_counts = Counter()\n",
|
| 70 |
+
"\n",
|
| 71 |
+
"for i, row in df_tags.iterrows():\n",
|
| 72 |
+
" tags = [normalize_tag(tag) for tag in row if pd.notna(tag)]\n",
|
| 73 |
+
" if df_tag_counts is not None:\n",
|
| 74 |
+
" counts = df_tag_counts.iloc[i].fillna(1).tolist()\n",
|
| 75 |
+
" else:\n",
|
| 76 |
+
" counts = [1] * len(tags)\n",
|
| 77 |
+
" for tag, count in zip(tags, counts):\n",
|
| 78 |
+
" all_tags.append(tag)\n",
|
| 79 |
+
" tag_counts[tag] += count\n",
|
| 80 |
+
"\n",
|
| 81 |
+
"if not all_tags:\n",
|
| 82 |
+
" print(\"No valid tags found in the dataset.\")\n",
|
| 83 |
+
" exit()\n",
|
| 84 |
+
"print(f\"Extracted {len(all_tags)} total tags.\")\n",
|
| 85 |
+
"\n",
|
| 86 |
+
"# Compute co-occurrence frequencies efficiently\n",
|
| 87 |
+
"co_occurrences = Counter()\n",
|
| 88 |
+
"for i, row in df_tags.iterrows():\n",
|
| 89 |
+
" tags = [normalize_tag(tag) for tag in row if pd.notna(tag)]\n",
|
| 90 |
+
" if len(tags) < 2:\n",
|
| 91 |
+
" continue # Skip rows with fewer than two tags\n",
|
| 92 |
+
" for tag1, tag2 in combinations(tags, 2):\n",
|
| 93 |
+
" co_occurrences[frozenset([tag1, tag2])] += 1\n",
|
| 94 |
+
"\n",
|
| 95 |
+
"if not co_occurrences:\n",
|
| 96 |
+
" print(\"No co-occurrence data found.\")\n",
|
| 97 |
+
" exit()\n",
|
| 98 |
+
"print(f\"Computed {len(co_occurrences)} co-occurrence pairs.\")\n",
|
| 99 |
+
"\n",
|
| 100 |
+
"# Convert co-occurrence counts to a weighted edge list\n",
|
| 101 |
+
"edges = [(tuple(pair)[0], tuple(pair)[1], weight) for pair, weight in co_occurrences.items() if len(pair) == 2]\n",
|
| 102 |
+
"\n",
|
| 103 |
+
"# Create a set of connected tags with a minimum connection threshold\n",
|
| 104 |
+
"min_connections = 5 # Increased threshold to reduce memory usage\n",
|
| 105 |
+
"connected_tags = Counter()\n",
|
| 106 |
+
"for tag1, tag2, _ in edges:\n",
|
| 107 |
+
" connected_tags[tag1] += 1\n",
|
| 108 |
+
" connected_tags[tag2] += 1\n",
|
| 109 |
+
"\n",
|
| 110 |
+
"# Filter nodes to keep only those that meet the minimum connection threshold\n",
|
| 111 |
+
"nodes = [\n",
|
| 112 |
+
" {\"id\": tag, \"size\": count, \"category\": tag_category_mapping.get(tag, \"unknown\")}\n",
|
| 113 |
+
" for tag, count in tag_counts.items()\n",
|
| 114 |
+
" if connected_tags[tag] >= min_connections\n",
|
| 115 |
+
"]\n",
|
| 116 |
+
"\n",
|
| 117 |
+
"if not nodes:\n",
|
| 118 |
+
" print(\"No nodes meet the connection threshold.\")\n",
|
| 119 |
+
" exit()\n",
|
| 120 |
+
"print(f\"Final node count: {len(nodes)}\")\n",
|
| 121 |
+
"\n",
|
| 122 |
+
"links = [\n",
|
| 123 |
+
" {\"source\": tag1, \"target\": tag2, \"value\": weight}\n",
|
| 124 |
+
" for tag1, tag2, weight in edges\n",
|
| 125 |
+
" if connected_tags[tag1] >= min_connections and connected_tags[tag2] >= min_connections\n",
|
| 126 |
+
"]\n",
|
| 127 |
+
"\n",
|
| 128 |
+
"if not links:\n",
|
| 129 |
+
" print(\"No links meet the connection threshold.\")\n",
|
| 130 |
+
" exit()\n",
|
| 131 |
+
"print(f\"Final link count: {len(links)}\")\n",
|
| 132 |
+
"\n",
|
| 133 |
+
"# Prepare JSON output\n",
|
| 134 |
+
"d3_data = {\"nodes\": nodes, \"links\": links}\n",
|
| 135 |
+
"\n",
|
| 136 |
+
"# Save as JSON for visualization\n",
|
| 137 |
+
"output_file = \"co_occurrence_network.json\"\n",
|
| 138 |
+
"with open(output_file, \"w\", encoding=\"utf-8\") as f:\n",
|
| 139 |
+
" json.dump(d3_data, f, indent=4)\n",
|
| 140 |
+
"\n",
|
| 141 |
+
"print(f\"D3.js data saved to {output_file}\")"
|
| 142 |
+
]
|
| 143 |
+
}
|
| 144 |
+
],
|
| 145 |
+
"metadata": {
|
| 146 |
+
"language_info": {
|
| 147 |
+
"name": "python"
|
| 148 |
+
}
|
| 149 |
+
},
|
| 150 |
+
"nbformat": 4,
|
| 151 |
+
"nbformat_minor": 2
|
| 152 |
+
}
|
md/DEEPFAKE_PIPELINE_GUIDE.md
ADDED
|
@@ -0,0 +1,210 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Deepfake Adapter Dataset Processing - Quick Start Guide
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
This pipeline processes the `real_person_adapters.csv` dataset to identify and annotate real people used in deepfake LoRA models using three LLM options: **Qwen**, **Llama**, and **Mistral**.
|
| 6 |
+
|
| 7 |
+
## Quick Start
|
| 8 |
+
|
| 9 |
+
### 1. Prerequisites
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
# Install required packages
|
| 13 |
+
pip install pandas numpy emoji requests tqdm spacy
|
| 14 |
+
|
| 15 |
+
# Download spaCy English model (for NER)
|
| 16 |
+
python -m spacy download en_core_web_sm
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
**Note**: The spaCy model will be automatically downloaded when you run the notebook if not already installed.
|
| 20 |
+
|
| 21 |
+
### 2. Set Up API Keys
|
| 22 |
+
|
| 23 |
+
Choose at least ONE LLM provider and get an API key:
|
| 24 |
+
|
| 25 |
+
| Provider | Model | Sign Up Link | Est. Cost (10k entries) |
|
| 26 |
+
|----------|-------|--------------|-------------------------|
|
| 27 |
+
| **Qwen** | Qwen-Max | https://dashscope.aliyun.com/ | Varies |
|
| 28 |
+
| **Llama** | Llama-3.1-70B | https://www.together.ai/ | ~$5-10 |
|
| 29 |
+
| **Mistral** | Mistral Large | https://mistral.ai/ | ~$40-80 |
|
| 30 |
+
|
| 31 |
+
Create your API key file in `misc/credentials/`:
|
| 32 |
+
|
| 33 |
+
```bash
|
| 34 |
+
# For Qwen
|
| 35 |
+
echo "your-api-key-here" > misc/credentials/qwen_api_key.txt
|
| 36 |
+
|
| 37 |
+
# For Llama (via Together AI)
|
| 38 |
+
echo "your-api-key-here" > misc/credentials/together_api_key.txt
|
| 39 |
+
|
| 40 |
+
# For Mistral
|
| 41 |
+
echo "your-api-key-here" > misc/credentials/mistral_api_key.txt
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
### 3. Run the Notebook
|
| 45 |
+
|
| 46 |
+
Open `Section_2-3-4_Figure_8_deepfake_adapters.ipynb` and:
|
| 47 |
+
|
| 48 |
+
1. **Run all cells sequentially** from top to bottom
|
| 49 |
+
2. The default configuration uses Qwen in test mode (10 samples)
|
| 50 |
+
3. Review the test results
|
| 51 |
+
4. To process the full dataset, change in the LLM annotation cell:
|
| 52 |
+
```python
|
| 53 |
+
TEST_MODE = False
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
## Pipeline Stages
|
| 57 |
+
|
| 58 |
+
### Stage 1: NER & Name Cleaning
|
| 59 |
+
- **Input**: `data/CSV/real_person_adapters.csv`
|
| 60 |
+
- **Output**: `data/CSV/NER_POI_step01_pre_annotation.csv`
|
| 61 |
+
- **Function**: Cleans adapter names to extract real person names
|
| 62 |
+
- Removes: emoji, "lora", "v1", special characters
|
| 63 |
+
- Example: "IU LoRA v2 🎤" → "IU"
|
| 64 |
+
|
| 65 |
+
### Stage 2: Country/Nationality Mapping
|
| 66 |
+
- **Input**: Step 1 output + `misc/lists/countries.csv`
|
| 67 |
+
- **Output**: `data/CSV/NER_POI_step02_annotated.csv`
|
| 68 |
+
- **Function**: Maps tags to standardized countries
|
| 69 |
+
- Example: "korean" → "South Korea"
|
| 70 |
+
- Excludes uninhabited territories
|
| 71 |
+
|
| 72 |
+
### Stage 3: LLM Profession Annotation
|
| 73 |
+
- **Input**: Step 2 output + `misc/lists/professions.csv`
|
| 74 |
+
- **Output**: `data/CSV/{llm}_annotated_POI_test.csv` (test) or `{llm}_annotated_POI.csv` (full)
|
| 75 |
+
- **Function**: Uses LLM to identify:
|
| 76 |
+
- Full name
|
| 77 |
+
- Gender
|
| 78 |
+
- Up to 3 professions (from profession list)
|
| 79 |
+
- Country
|
| 80 |
+
- **Progress**: Automatically saves every 10 rows
|
| 81 |
+
- **Resumable**: Can continue from last saved progress if interrupted
|
| 82 |
+
|
| 83 |
+
## Configuration Options
|
| 84 |
+
|
| 85 |
+
In the LLM annotation cell, you can configure:
|
| 86 |
+
|
| 87 |
+
```python
|
| 88 |
+
# Choose LLM provider
|
| 89 |
+
SELECTED_LLM = 'qwen' # Options: 'qwen', 'llama', 'mistral'
|
| 90 |
+
|
| 91 |
+
# Test mode (recommended for first run)
|
| 92 |
+
TEST_MODE = True # True = test on small sample
|
| 93 |
+
TEST_SIZE = 10 # Number of rows for testing
|
| 94 |
+
|
| 95 |
+
# Processing limits
|
| 96 |
+
MAX_ROWS = 20000 # Maximum rows to process (None = all)
|
| 97 |
+
SAVE_INTERVAL = 10 # Save progress every N rows
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## Expected Output Format
|
| 101 |
+
|
| 102 |
+
The final dataset will include all original columns plus:
|
| 103 |
+
|
| 104 |
+
| Column | Description | Example |
|
| 105 |
+
|--------|-------------|---------|
|
| 106 |
+
| `real_name` | Cleaned name | "IU" |
|
| 107 |
+
| `full_name` | Full name from LLM | "Lee Ji-eun (IU)" |
|
| 108 |
+
| `gender` | Gender from LLM | "Female" |
|
| 109 |
+
| `profession_llm` | Up to 3 professions | "singer, actor, celebrity" |
|
| 110 |
+
| `country` | Country from LLM | "South Korea" |
|
| 111 |
+
| `likely_country` | Country from tags | "South Korea" |
|
| 112 |
+
| `likely_nationality` | Nationality from tags | "South Korean" |
|
| 113 |
+
| `tags` | Combined tags | "['korean', 'celebrity', 'singer']" |
|
| 114 |
+
|
| 115 |
+
## Troubleshooting
|
| 116 |
+
|
| 117 |
+
### API Key Errors
|
| 118 |
+
```
|
| 119 |
+
Warning: No API key for qwen
|
| 120 |
+
```
|
| 121 |
+
**Solution**: Ensure your API key file exists and contains only the key (no extra whitespace)
|
| 122 |
+
|
| 123 |
+
### Rate Limiting
|
| 124 |
+
```
|
| 125 |
+
Qwen API error (attempt 1/3): 429 Too Many Requests
|
| 126 |
+
```
|
| 127 |
+
**Solution**: The code automatically retries with exponential backoff. You can also:
|
| 128 |
+
- Increase `time.sleep(0.5)` to a higher value
|
| 129 |
+
- Process in smaller batches
|
| 130 |
+
|
| 131 |
+
### Progress Lost
|
| 132 |
+
**Solution**: The pipeline saves progress automatically. Check:
|
| 133 |
+
- `data/CSV/{llm}_annotated_POI_test.csv` - your partial results
|
| 134 |
+
- `misc/{llm}_query_index.txt` - last processed index
|
| 135 |
+
- Just re-run the cell and it will resume from the last saved progress
|
| 136 |
+
|
| 137 |
+
### JSON Parse Errors from LLM
|
| 138 |
+
```
|
| 139 |
+
Qwen API error: JSONDecodeError
|
| 140 |
+
```
|
| 141 |
+
**Solution**: This is usually temporary. The code:
|
| 142 |
+
- Returns "Unknown" for failed queries
|
| 143 |
+
- Continues processing
|
| 144 |
+
- You can manually review/reprocess failed entries later
|
| 145 |
+
|
| 146 |
+
## Cost Management
|
| 147 |
+
|
| 148 |
+
### Estimate Costs Before Processing
|
| 149 |
+
|
| 150 |
+
For a dataset with N entries:
|
| 151 |
+
- **Qwen**: Contact Alibaba Cloud for pricing
|
| 152 |
+
- **Llama**: ~N × $0.0005 = ~$5 per 10k entries
|
| 153 |
+
- **Mistral**: ~N × $0.004 = ~$40 per 10k entries
|
| 154 |
+
|
| 155 |
+
### Best Practices
|
| 156 |
+
|
| 157 |
+
1. **Always test first**: Run with `TEST_MODE = True` on 10 samples
|
| 158 |
+
2. **Monitor API usage**: Check your API provider's dashboard
|
| 159 |
+
3. **Use cheaper models first**: Try Llama before Mistral
|
| 160 |
+
4. **Process in batches**: Set `MAX_ROWS` to process incrementally
|
| 161 |
+
5. **Save intermediate results**: The automatic saving feature helps prevent data loss
|
| 162 |
+
|
| 163 |
+
## Comparing Multiple LLMs
|
| 164 |
+
|
| 165 |
+
To compare results from different LLMs:
|
| 166 |
+
|
| 167 |
+
1. Run the pipeline with `SELECTED_LLM = 'qwen'`
|
| 168 |
+
2. Change to `SELECTED_LLM = 'llama'` and run again
|
| 169 |
+
3. Change to `SELECTED_LLM = 'mistral'` and run again
|
| 170 |
+
4. Compare the three output files:
|
| 171 |
+
- `qwen_annotated_POI.csv`
|
| 172 |
+
- `llama_annotated_POI.csv`
|
| 173 |
+
- `mistral_annotated_POI.csv`
|
| 174 |
+
|
| 175 |
+
## Files Created
|
| 176 |
+
|
| 177 |
+
The pipeline creates these files:
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
data/CSV/
|
| 181 |
+
├── NER_POI_step01_pre_annotation.csv # After name cleaning
|
| 182 |
+
├── NER_POI_step02_annotated.csv # After country mapping
|
| 183 |
+
├── qwen_annotated_POI_test.csv # Test results (Qwen)
|
| 184 |
+
├── qwen_annotated_POI.csv # Full results (Qwen)
|
| 185 |
+
├── llama_annotated_POI.csv # Full results (Llama)
|
| 186 |
+
└── mistral_annotated_POI.csv # Full results (Mistral)
|
| 187 |
+
|
| 188 |
+
misc/
|
| 189 |
+
├── qwen_query_index.txt # Progress tracking
|
| 190 |
+
├── llama_query_index.txt # Progress tracking
|
| 191 |
+
└── mistral_query_index.txt # Progress tracking
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
## Support
|
| 195 |
+
|
| 196 |
+
For issues or questions:
|
| 197 |
+
1. Check this guide for common problems
|
| 198 |
+
2. Review `misc/credentials/README.md` for API setup
|
| 199 |
+
3. Read the notebook documentation (first cell)
|
| 200 |
+
4. Check API provider documentation for service-specific issues
|
| 201 |
+
|
| 202 |
+
## Ethical Considerations
|
| 203 |
+
|
| 204 |
+
This research documents ethical problems with AI deepfake models. The dataset and analysis help:
|
| 205 |
+
- Understand the scope of unauthorized person likeness usage
|
| 206 |
+
- Document professions/demographics most affected
|
| 207 |
+
- Inform policy and technical solutions
|
| 208 |
+
- Raise awareness about deepfake technology misuse
|
| 209 |
+
|
| 210 |
+
Use this data responsibly and respect individual privacy and consent.
|
md/LLM_MODELS_COMPARISON.md
ADDED
|
@@ -0,0 +1,326 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LLM Models for Deepfake Annotation
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
The pipeline now includes **6 LLM options** in individual cells for easy comparison:
|
| 6 |
+
|
| 7 |
+
1. **Deepseek** - Testing (use first!)
|
| 8 |
+
2. **Qwen (API)** - Chinese (Alibaba Cloud)
|
| 9 |
+
3. **Llama** - American (Meta)
|
| 10 |
+
4. **Mixtral** - French (Mistral AI)
|
| 11 |
+
5. **Gemma** - American Open Source (Google)
|
| 12 |
+
6. **Qwen-2.5-32B Local** - FREE local inference (NEW!)
|
| 13 |
+
|
| 14 |
+
## The 6 LLMs
|
| 15 |
+
|
| 16 |
+
### 1. Deepseek (Testing)
|
| 17 |
+
**Cell 10**
|
| 18 |
+
|
| 19 |
+
- **Model**: deepseek-chat
|
| 20 |
+
- **Provider**: DeepSeek
|
| 21 |
+
- **API**: https://platform.deepseek.com/
|
| 22 |
+
- **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries)
|
| 23 |
+
- **Use case**: **Test this first!** Cheapest option to verify pipeline works
|
| 24 |
+
- **API Key**: `misc/credentials/deepseek_api_key.txt`
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
### 2. Qwen API (Chinese)
|
| 29 |
+
**Cells 11-12**
|
| 30 |
+
|
| 31 |
+
- **Model**: qwen-max (automatically uses Qwen3-Max)
|
| 32 |
+
- **Provider**: Alibaba Cloud DashScope
|
| 33 |
+
- **API**: https://dashscope.aliyun.com/
|
| 34 |
+
- **Cost**: Variable (check Alibaba pricing)
|
| 35 |
+
- **Use case**: Chinese company, strong multilingual support
|
| 36 |
+
- **API Key**: `misc/credentials/qwen_api_key.txt`
|
| 37 |
+
- **Note**: Uses latest Qwen3-Max when you specify `qwen-max`
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
### 6. Qwen-2.5-32B Local (FREE!)
|
| 42 |
+
**Cells 19-20** (NEW!)
|
| 43 |
+
|
| 44 |
+
- **Model**: qwen2.5:32b-instruct
|
| 45 |
+
- **Provider**: Ollama (local inference)
|
| 46 |
+
- **Setup**: https://ollama.com/
|
| 47 |
+
- **Cost**: **$0** (FREE - no API costs!)
|
| 48 |
+
- **Requirements**:
|
| 49 |
+
- A100 80GB GPU (or similar)
|
| 50 |
+
- ~25GB VRAM during inference
|
| 51 |
+
- ~20GB storage for model download
|
| 52 |
+
- Ollama installed
|
| 53 |
+
- **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour)
|
| 54 |
+
- **Use case**:
|
| 55 |
+
- ✅ Large datasets (>1000 samples) where cost matters
|
| 56 |
+
- ✅ Privacy-sensitive research data
|
| 57 |
+
- ✅ Offline processing
|
| 58 |
+
- ✅ Strong multilingual support
|
| 59 |
+
- **Setup guide**: See `QWEN_LOCAL_SETUP.md`
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
### 3. Llama (American)
|
| 64 |
+
**Cells 13-14**
|
| 65 |
+
|
| 66 |
+
- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
|
| 67 |
+
- **Provider**: Together AI (hosting Meta's model)
|
| 68 |
+
- **Developer**: Meta (American)
|
| 69 |
+
- **API**: https://www.together.ai/
|
| 70 |
+
- **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries)
|
| 71 |
+
- **Use case**: Open-source American model, good quality
|
| 72 |
+
- **API Key**: `misc/credentials/together_api_key.txt`
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
### 4. Mixtral (French)
|
| 77 |
+
**Cells 15-16**
|
| 78 |
+
|
| 79 |
+
- **Model**: open-mixtral-8x22b
|
| 80 |
+
- **Provider**: Mistral AI
|
| 81 |
+
- **Developer**: Mistral AI (French)
|
| 82 |
+
- **API**: https://mistral.ai/
|
| 83 |
+
- **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries)
|
| 84 |
+
- **Use case**: European alternative, Mixture-of-Experts architecture
|
| 85 |
+
- **API Key**: `misc/credentials/mistral_api_key.txt`
|
| 86 |
+
- **Note**: Using open-mixtral-8x22b (cheaper than mistral-large)
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
### 5. Gemma (American Open Source)
|
| 91 |
+
**Cells 17-18**
|
| 92 |
+
|
| 93 |
+
- **Model**: google/gemma-2-27b-it
|
| 94 |
+
- **Provider**: Together AI (hosting Google's model)
|
| 95 |
+
- **Developer**: Google (American)
|
| 96 |
+
- **API**: https://www.together.ai/ (same as Llama)
|
| 97 |
+
- **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries)
|
| 98 |
+
- **Use case**: American open-source alternative, competitive quality
|
| 99 |
+
- **API Key**: `misc/credentials/together_api_key.txt` (same as Llama)
|
| 100 |
+
- **Note**: Fully open-source, can be self-hosted
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
## Cost Comparison (10,000 entries)
|
| 105 |
+
|
| 106 |
+
| Model | Provider | Cost | Time | Origin |
|
| 107 |
+
|-------|----------|------|------|--------|
|
| 108 |
+
| **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | 🇨🇳 Chinese |
|
| 109 |
+
| **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | 🇨🇳 Chinese |
|
| 110 |
+
| **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | 🇺🇸 American (open) |
|
| 111 |
+
| **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | 🇺🇸 American (open) |
|
| 112 |
+
| **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | 🇫🇷 French (open) |
|
| 113 |
+
| **Qwen API** | Alibaba | Variable | ~5-10 hrs | 🇨🇳 Chinese |
|
| 114 |
+
|
| 115 |
+
**Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time.
|
| 116 |
+
|
| 117 |
+
## Recommended Testing Order
|
| 118 |
+
|
| 119 |
+
### 1. Start with Deepseek
|
| 120 |
+
```python
|
| 121 |
+
# Cell 10
|
| 122 |
+
TEST_MODE = True
|
| 123 |
+
TEST_SIZE = 10
|
| 124 |
+
```
|
| 125 |
+
- **Why**: Cheapest, verify pipeline works
|
| 126 |
+
- **Cost**: Pennies for 10 samples
|
| 127 |
+
|
| 128 |
+
### 2. Compare on Small Sample
|
| 129 |
+
Pick 2-3 models and run on same 100 samples:
|
| 130 |
+
```python
|
| 131 |
+
# In each cell:
|
| 132 |
+
TEST_MODE = True
|
| 133 |
+
TEST_SIZE = 100
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
**Good combinations:**
|
| 137 |
+
- Budget: Deepseek + Gemma
|
| 138 |
+
- Quality: Llama + Mixtral
|
| 139 |
+
- Geographic: Qwen + Llama + Mixtral
|
| 140 |
+
|
| 141 |
+
### 3. Production Run
|
| 142 |
+
Choose best model from testing and run full dataset:
|
| 143 |
+
```python
|
| 144 |
+
TEST_MODE = False
|
| 145 |
+
MAX_ROWS = None # or 20000
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
## API Key Setup
|
| 149 |
+
|
| 150 |
+
### For Deepseek & Qwen (separate keys):
|
| 151 |
+
```bash
|
| 152 |
+
echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
|
| 153 |
+
echo "your-qwen-key" > misc/credentials/qwen_api_key.txt
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### For Llama & Gemma (same Together AI key):
|
| 157 |
+
```bash
|
| 158 |
+
echo "your-together-key" > misc/credentials/together_api_key.txt
|
| 159 |
+
```
|
| 160 |
+
Both Llama and Gemma use the same Together AI key!
|
| 161 |
+
|
| 162 |
+
### For Mixtral:
|
| 163 |
+
```bash
|
| 164 |
+
echo "your-mistral-key" > misc/credentials/mistral_api_key.txt
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
## Output Files
|
| 168 |
+
|
| 169 |
+
Each LLM saves to a separate file:
|
| 170 |
+
|
| 171 |
+
```
|
| 172 |
+
data/CSV/
|
| 173 |
+
├── deepseek_annotated_POI_test.csv # Deepseek test
|
| 174 |
+
├── deepseek_annotated_POI.csv # Deepseek full
|
| 175 |
+
├── qwen_annotated_POI_test.csv # Qwen API test
|
| 176 |
+
├── qwen_annotated_POI.csv # Qwen API full
|
| 177 |
+
├── qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!)
|
| 178 |
+
├── qwen_local_annotated_POI.csv # Qwen Local full (NEW!)
|
| 179 |
+
├── llama_annotated_POI_test.csv # Llama test
|
| 180 |
+
├── llama_annotated_POI.csv # Llama full
|
| 181 |
+
├── mixtral_annotated_POI_test.csv # Mixtral test
|
| 182 |
+
├── mixtral_annotated_POI.csv # Mixtral full
|
| 183 |
+
├── gemma_annotated_POI_test.csv # Gemma test
|
| 184 |
+
└── gemma_annotated_POI.csv # Gemma full
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
## Comparing Results
|
| 188 |
+
|
| 189 |
+
After running multiple LLMs, compare results:
|
| 190 |
+
|
| 191 |
+
```python
|
| 192 |
+
import pandas as pd
|
| 193 |
+
|
| 194 |
+
# Load results from different models
|
| 195 |
+
deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
|
| 196 |
+
qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
|
| 197 |
+
qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW!
|
| 198 |
+
llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
|
| 199 |
+
mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
|
| 200 |
+
gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')
|
| 201 |
+
|
| 202 |
+
# Compare profession distributions
|
| 203 |
+
print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
|
| 204 |
+
print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
|
| 205 |
+
print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW!
|
| 206 |
+
print("Llama professions:", llama_df['profession_llm'].value_counts().head())
|
| 207 |
+
print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
|
| 208 |
+
print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())
|
| 209 |
+
|
| 210 |
+
# Compare specific cases
|
| 211 |
+
print("\nIrene identification:")
|
| 212 |
+
print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
|
| 213 |
+
print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
|
| 214 |
+
print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
|
| 215 |
+
print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
|
| 216 |
+
print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
|
| 217 |
+
print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
## Model Characteristics
|
| 221 |
+
|
| 222 |
+
### Deepseek
|
| 223 |
+
- ✅ Very cheap
|
| 224 |
+
- ✅ Good for testing
|
| 225 |
+
- ⚠️ Less documentation
|
| 226 |
+
- 🇨🇳 Chinese company
|
| 227 |
+
|
| 228 |
+
### Qwen (Qwen3-Max)
|
| 229 |
+
- ✅ Latest version automatically used
|
| 230 |
+
- ✅ Strong multilingual
|
| 231 |
+
- ✅ Good Asian name recognition
|
| 232 |
+
- 💰 Variable cost
|
| 233 |
+
- 🇨🇳 Chinese company (Alibaba)
|
| 234 |
+
|
| 235 |
+
### Llama 3.1 70B
|
| 236 |
+
- ✅ Open-source
|
| 237 |
+
- ✅ Strong overall performance
|
| 238 |
+
- ✅ Well-documented
|
| 239 |
+
- ✅ American (Meta)
|
| 240 |
+
- 💰 Mid-range cost
|
| 241 |
+
|
| 242 |
+
### Mixtral 8x22B
|
| 243 |
+
- ✅ Open-source
|
| 244 |
+
- ✅ MoE architecture (efficient)
|
| 245 |
+
- ✅ European alternative
|
| 246 |
+
- 💰 Mid-range cost
|
| 247 |
+
- 🇫🇷 French company
|
| 248 |
+
|
| 249 |
+
### Gemma 2 27B
|
| 250 |
+
- ✅ Fully open-source
|
| 251 |
+
- ✅ Can self-host
|
| 252 |
+
- ✅ American (Google)
|
| 253 |
+
- ✅ Cheap via API
|
| 254 |
+
- ✅ Good quality for size
|
| 255 |
+
|
| 256 |
+
### Qwen-2.5-32B Local (NEW!)
|
| 257 |
+
- ✅ **FREE** - $0 cost (no API fees)
|
| 258 |
+
- ✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
|
| 259 |
+
- ✅ **PRIVATE** - Data never leaves your machine
|
| 260 |
+
- ✅ **OFFLINE** - Works without internet
|
| 261 |
+
- ✅ **HIGH QUALITY** - 32B parameter model
|
| 262 |
+
- ✅ Strong multilingual support
|
| 263 |
+
- ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
|
| 264 |
+
- 🇨🇳 Chinese company (Alibaba)
|
| 265 |
+
- 📦 Model size: ~20GB download
|
| 266 |
+
|
| 267 |
+
## Decision Matrix
|
| 268 |
+
|
| 269 |
+
### If you prioritize...
|
| 270 |
+
|
| 271 |
+
**FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!)
|
| 272 |
+
|
| 273 |
+
**Cost** (with API): Use **Deepseek** or **Gemma**
|
| 274 |
+
|
| 275 |
+
**Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral**
|
| 276 |
+
|
| 277 |
+
**Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine)
|
| 278 |
+
|
| 279 |
+
**American/Open Source**: Use **Gemma** or **Llama**
|
| 280 |
+
|
| 281 |
+
**Asian Names**: Use **Qwen** (API or Local - strong multilingual)
|
| 282 |
+
|
| 283 |
+
**European Provider**: Use **Mixtral**
|
| 284 |
+
|
| 285 |
+
**Testing**: Use **Deepseek** first, always!
|
| 286 |
+
|
| 287 |
+
## Running Multiple Models
|
| 288 |
+
|
| 289 |
+
You can run all 6 models in sequence:
|
| 290 |
+
|
| 291 |
+
```python
|
| 292 |
+
# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
|
| 293 |
+
# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
|
| 294 |
+
# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
|
| 295 |
+
# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
|
| 296 |
+
# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
|
| 297 |
+
# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
Each saves to its own file, so you can compare results!
|
| 301 |
+
|
| 302 |
+
## Notes
|
| 303 |
+
|
| 304 |
+
- **Llama and Gemma use the same API key** (Together AI)
|
| 305 |
+
- All models use the **same 9 profession categories**
|
| 306 |
+
- All models have **automatic retries** with exponential backoff
|
| 307 |
+
- All models **save progress** every 10 rows
|
| 308 |
+
- All models are **resumable** if interrupted
|
| 309 |
+
|
| 310 |
+
## Summary
|
| 311 |
+
|
| 312 |
+
You now have **6 LLM options** to choose from:
|
| 313 |
+
|
| 314 |
+
1. 🧪 **Deepseek** - Test first (cheapest API)
|
| 315 |
+
2. ����🇳 **Qwen3-Max API** - Chinese, strong multilingual
|
| 316 |
+
3. 🇺🇸 **Llama 3.1 70B** - American, open-source
|
| 317 |
+
4. 🇫🇷 **Mixtral 8x22B** - French, open-source MoE
|
| 318 |
+
5. 🇺🇸 **Gemma 2 27B** - American open-source (Google)
|
| 319 |
+
6. 💰 **Qwen-2.5-32B Local** - FREE local inference (NEW!)
|
| 320 |
+
|
| 321 |
+
Each in its own cell, easy to run and compare! 🎉
|
| 322 |
+
|
| 323 |
+
**Recommended workflow**:
|
| 324 |
+
1. Test with Deepseek (Cell 10) - verify pipeline works
|
| 325 |
+
2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
|
| 326 |
+
3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!
|
md/QUICK_START_LOCAL.md
ADDED
|
@@ -0,0 +1,171 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Quick Start: Running Qwen-2.5-32B Locally
|
| 2 |
+
|
| 3 |
+
This is a quick guide to get you started with FREE local LLM inference using your A100 GPU.
|
| 4 |
+
|
| 5 |
+
## Why Local?
|
| 6 |
+
|
| 7 |
+
✅ **$0 cost** - No API fees
|
| 8 |
+
✅ **Privacy** - Data stays on your machine
|
| 9 |
+
✅ **Quality** - 32B parameter model with strong performance
|
| 10 |
+
|
| 11 |
+
## Setup (One-time)
|
| 12 |
+
|
| 13 |
+
### 1. Pull the Model (~10-30 minutes)
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
# Pull Qwen-2.5-32B-Instruct
|
| 17 |
+
ollama pull qwen2.5:32b-instruct
|
| 18 |
+
|
| 19 |
+
# Wait for download to complete (~20GB)
|
| 20 |
+
# Model will be cached at: ~/.ollama/models/
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
### 2. Verify Model is Ready
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
# List installed models
|
| 27 |
+
ollama list
|
| 28 |
+
|
| 29 |
+
# Should show: qwen2.5:32b-instruct
|
| 30 |
+
|
| 31 |
+
# Test it
|
| 32 |
+
ollama run qwen2.5:32b-instruct "Hello, who are you?"
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
If you see a response, you're ready! ✅
|
| 36 |
+
|
| 37 |
+
## Running the Notebook
|
| 38 |
+
|
| 39 |
+
### Open the Notebook
|
| 40 |
+
|
| 41 |
+
```bash
|
| 42 |
+
cd jupyter_notebooks
|
| 43 |
+
jupyter notebook Section_2-3-4_Figure_8_deepfake_adapters.ipynb
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Run the Cells
|
| 47 |
+
|
| 48 |
+
1. **Cell 5**: NER & Name Cleaning (processes names)
|
| 49 |
+
2. **Cell 7**: Country/Nationality Mapping
|
| 50 |
+
3. **Cell 20**: Qwen-2.5-32B Local Annotation 👈 **This is the new one!**
|
| 51 |
+
|
| 52 |
+
### Configure Cell 20
|
| 53 |
+
|
| 54 |
+
```python
|
| 55 |
+
# Start with test mode
|
| 56 |
+
TEST_MODE = True
|
| 57 |
+
TEST_SIZE = 10
|
| 58 |
+
|
| 59 |
+
# Then run full dataset
|
| 60 |
+
TEST_MODE = False
|
| 61 |
+
MAX_ROWS = 20000 # or None for all
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
### Run Cell 20
|
| 65 |
+
|
| 66 |
+
Just click "Run" or press Shift+Enter. The cell will:
|
| 67 |
+
1. Check if Ollama is installed ✅
|
| 68 |
+
2. Check if model is available ✅
|
| 69 |
+
3. Start annotating
|
| 70 |
+
4. Save progress every 10 rows
|
| 71 |
+
5. Show completion stats
|
| 72 |
+
|
| 73 |
+
### Monitor Progress
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
|
| 77 |
+
✅ Saved after 10 rows (~24.0 samples/hour)
|
| 78 |
+
|
| 79 |
+
✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
|
| 80 |
+
Total time: 2.5 minutes
|
| 81 |
+
Average speed: 240.0 samples/hour
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Performance
|
| 85 |
+
|
| 86 |
+
On your A100 80GB:
|
| 87 |
+
- **Speed**: ~5-10 tokens/second
|
| 88 |
+
- **Throughput**: ~100-200 samples/hour
|
| 89 |
+
- **Memory**: ~22-25GB VRAM
|
| 90 |
+
- **Cost**: $0
|
| 91 |
+
|
| 92 |
+
### Time Estimates
|
| 93 |
+
|
| 94 |
+
| Dataset Size | Time |
|
| 95 |
+
|-------------|------|
|
| 96 |
+
| 10 samples (test) | ~2-3 minutes |
|
| 97 |
+
| 100 samples | ~20-30 minutes |
|
| 98 |
+
| 1,000 samples | ~5-10 hours |
|
| 99 |
+
| 10,000 samples | ~50-100 hours |
|
| 100 |
+
|
| 101 |
+
**Tip**: Run overnight or over the weekend for large datasets!
|
| 102 |
+
|
| 103 |
+
## Troubleshooting
|
| 104 |
+
|
| 105 |
+
### "Model not found"
|
| 106 |
+
|
| 107 |
+
```bash
|
| 108 |
+
ollama pull qwen2.5:32b-instruct
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### "Ollama not running"
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
ollama serve
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### Out of Memory
|
| 118 |
+
|
| 119 |
+
Your A100 has 80GB VRAM - this should NOT happen with the 32B model (~25GB VRAM).
|
| 120 |
+
|
| 121 |
+
If it does, try the quantized version:
|
| 122 |
+
```bash
|
| 123 |
+
ollama pull qwen2.5:32b-instruct-q4_0 # Only ~12GB VRAM
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Output
|
| 127 |
+
|
| 128 |
+
Results saved to:
|
| 129 |
+
- Test: `data/CSV/qwen_local_annotated_POI_test.csv`
|
| 130 |
+
- Full: `data/CSV/qwen_local_annotated_POI.csv`
|
| 131 |
+
|
| 132 |
+
Same format as API results - easy to compare!
|
| 133 |
+
|
| 134 |
+
## Custom Model Cache Location
|
| 135 |
+
|
| 136 |
+
To store models in `data/models/`:
|
| 137 |
+
|
| 138 |
+
```bash
|
| 139 |
+
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
|
| 140 |
+
ollama pull qwen2.5:32b-instruct
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
## Comparing API vs Local
|
| 144 |
+
|
| 145 |
+
After running both:
|
| 146 |
+
|
| 147 |
+
```python
|
| 148 |
+
import pandas as pd
|
| 149 |
+
|
| 150 |
+
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
|
| 151 |
+
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
|
| 152 |
+
|
| 153 |
+
# Check agreement
|
| 154 |
+
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
|
| 155 |
+
print(f"Agreement: {agreement*100:.1f}%")
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
## Full Documentation
|
| 159 |
+
|
| 160 |
+
For more details, see:
|
| 161 |
+
- `QWEN_LOCAL_SETUP.md` - Complete setup guide
|
| 162 |
+
- `LLM_MODELS_COMPARISON.md` - All 6 LLM options compared
|
| 163 |
+
|
| 164 |
+
## Summary
|
| 165 |
+
|
| 166 |
+
✅ Ollama already installed
|
| 167 |
+
✅ A100 80GB GPU - perfect for Qwen-2.5-32B
|
| 168 |
+
✅ FREE inference - no API costs
|
| 169 |
+
✅ Privacy - data stays local
|
| 170 |
+
|
| 171 |
+
**Next step**: Run Cell 20 in the notebook! 🚀
|
md/QWEN_LOCAL_SETUP.md
ADDED
|
@@ -0,0 +1,321 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Running Qwen-2.5-32B Locally with Ollama
|
| 2 |
+
|
| 3 |
+
This guide explains how to run Qwen-2.5-32B-Instruct locally on your A100 GPU using Ollama.
|
| 4 |
+
|
| 5 |
+
## Why Run Locally?
|
| 6 |
+
|
| 7 |
+
✅ **FREE** - No API costs ($0 per query)
|
| 8 |
+
✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
|
| 9 |
+
✅ **PRIVATE** - Data never leaves your machine
|
| 10 |
+
✅ **OFFLINE** - Works without internet (after model download)
|
| 11 |
+
✅ **HIGH QUALITY** - 32B parameter model with strong multilingual support
|
| 12 |
+
|
| 13 |
+
## System Requirements
|
| 14 |
+
|
| 15 |
+
### Minimum Specs
|
| 16 |
+
- **GPU**: NVIDIA A100 80GB (or similar high-end GPU)
|
| 17 |
+
- **VRAM**: 22-25GB during inference
|
| 18 |
+
- **RAM**: 32GB system RAM (you have 265GB - more than enough!)
|
| 19 |
+
- **Storage**: ~20GB for model download
|
| 20 |
+
- **OS**: Linux (you're on Ubuntu)
|
| 21 |
+
|
| 22 |
+
### Your Setup
|
| 23 |
+
✅ NVIDIA A100 80GB - Perfect for Qwen-2.5-32B
|
| 24 |
+
✅ 265GB RAM - Excellent
|
| 25 |
+
✅ Linux (Ubuntu) - Supported
|
| 26 |
+
✅ Ollama already installed at `/usr/local/bin/ollama`
|
| 27 |
+
|
| 28 |
+
## Installation Steps
|
| 29 |
+
|
| 30 |
+
### 1. Verify Ollama Installation
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
# Check if Ollama is installed
|
| 34 |
+
which ollama
|
| 35 |
+
# Should output: /usr/local/bin/ollama
|
| 36 |
+
|
| 37 |
+
# Check Ollama version
|
| 38 |
+
ollama --version
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
If not installed, install with:
|
| 42 |
+
```bash
|
| 43 |
+
curl -fsSL https://ollama.com/install.sh | sh
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### 2. Pull Qwen-2.5-32B-Instruct Model
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
# This will download ~20GB
|
| 50 |
+
ollama pull qwen2.5:32b-instruct
|
| 51 |
+
|
| 52 |
+
# Alternative: Use the base model (not instruct-tuned)
|
| 53 |
+
# ollama pull qwen2.5:32b
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
**Download time**: ~10-30 minutes depending on your internet speed.
|
| 57 |
+
|
| 58 |
+
**Model cache location**: By default, models are cached at:
|
| 59 |
+
- Linux: `~/.ollama/models/`
|
| 60 |
+
|
| 61 |
+
To use custom cache location (e.g., `data/models/`):
|
| 62 |
+
```bash
|
| 63 |
+
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
|
| 64 |
+
ollama pull qwen2.5:32b-instruct
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### 3. Verify Model is Ready
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
# List all installed models
|
| 71 |
+
ollama list
|
| 72 |
+
|
| 73 |
+
# Test the model
|
| 74 |
+
ollama run qwen2.5:32b-instruct "Hello, who are you?"
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
You should see a response from Qwen!
|
| 78 |
+
|
| 79 |
+
### 4. Start Ollama Server (if needed)
|
| 80 |
+
|
| 81 |
+
Ollama runs as a background service by default. If you need to start it manually:
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
# Start Ollama server
|
| 85 |
+
ollama serve
|
| 86 |
+
|
| 87 |
+
# Or run in background
|
| 88 |
+
nohup ollama serve > /dev/null 2>&1 &
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Using Qwen-2.5-32B in the Notebook
|
| 92 |
+
|
| 93 |
+
### Cell 20: Qwen-2.5-32B Local Annotation
|
| 94 |
+
|
| 95 |
+
The notebook cell handles everything automatically:
|
| 96 |
+
|
| 97 |
+
1. **Checks Ollama installation**
|
| 98 |
+
2. **Verifies model availability**
|
| 99 |
+
3. **Runs inference locally**
|
| 100 |
+
4. **Saves progress every 10 rows**
|
| 101 |
+
|
| 102 |
+
### Configuration
|
| 103 |
+
|
| 104 |
+
```python
|
| 105 |
+
# In Cell 20
|
| 106 |
+
TEST_MODE = True # Start with small test
|
| 107 |
+
TEST_SIZE = 10 # Test on 10 samples first
|
| 108 |
+
MAX_ROWS = 20000 # Full dataset size
|
| 109 |
+
SAVE_INTERVAL = 10 # Save every 10 rows
|
| 110 |
+
|
| 111 |
+
MODEL_NAME = "qwen2.5:32b-instruct" # Model to use
|
| 112 |
+
OLLAMA_HOST = "http://localhost:11434" # Default Ollama port
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### Running the Pipeline
|
| 116 |
+
|
| 117 |
+
1. **Test run first** (recommended):
|
| 118 |
+
```python
|
| 119 |
+
TEST_MODE = True
|
| 120 |
+
TEST_SIZE = 10
|
| 121 |
+
```
|
| 122 |
+
Run Cell 20 to test on 10 samples (~1-2 minutes)
|
| 123 |
+
|
| 124 |
+
2. **Check results**:
|
| 125 |
+
```python
|
| 126 |
+
# Output saved to:
|
| 127 |
+
data/CSV/qwen_local_annotated_POI_test.csv
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
3. **Full run**:
|
| 131 |
+
```python
|
| 132 |
+
TEST_MODE = False
|
| 133 |
+
MAX_ROWS = 20000 # or None for all rows
|
| 134 |
+
```
|
| 135 |
+
Run Cell 20 for full dataset (~2-3 hours for 10k samples on A100)
|
| 136 |
+
|
| 137 |
+
### Performance Expectations
|
| 138 |
+
|
| 139 |
+
On NVIDIA A100 80GB:
|
| 140 |
+
- **Speed**: 5-10 tokens/second
|
| 141 |
+
- **Throughput**: 100-200 samples/hour (depends on prompt length)
|
| 142 |
+
- **Memory**: ~22-25GB VRAM during inference
|
| 143 |
+
- **Time for 10k samples**: ~50-100 hours (can run overnight/over weekend)
|
| 144 |
+
|
| 145 |
+
### Monitoring
|
| 146 |
+
|
| 147 |
+
The cell shows progress updates:
|
| 148 |
+
```
|
| 149 |
+
Qwen Local: 100%|██████████| 10/10 [02:30<00:00, 15.0s/it]
|
| 150 |
+
✅ Saved after 10 rows (~24.0 samples/hour)
|
| 151 |
+
|
| 152 |
+
✅ Done! Results: data/CSV/qwen_local_annotated_POI_test.csv
|
| 153 |
+
Total time: 2.5 minutes
|
| 154 |
+
Average speed: 240.0 samples/hour
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## Troubleshooting
|
| 158 |
+
|
| 159 |
+
### Model Not Found
|
| 160 |
+
|
| 161 |
+
```bash
|
| 162 |
+
# Check if model is installed
|
| 163 |
+
ollama list
|
| 164 |
+
|
| 165 |
+
# If not listed, pull it
|
| 166 |
+
ollama pull qwen2.5:32b-instruct
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
### Ollama Server Not Running
|
| 170 |
+
|
| 171 |
+
```bash
|
| 172 |
+
# Check if Ollama is running
|
| 173 |
+
ps aux | grep ollama
|
| 174 |
+
|
| 175 |
+
# If not running, start it
|
| 176 |
+
ollama serve
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
### GPU Not Detected
|
| 180 |
+
|
| 181 |
+
```bash
|
| 182 |
+
# Check NVIDIA GPU
|
| 183 |
+
nvidia-smi
|
| 184 |
+
|
| 185 |
+
# Check CUDA
|
| 186 |
+
nvcc --version
|
| 187 |
+
|
| 188 |
+
# Ollama should automatically detect GPU
|
| 189 |
+
# If not, check Ollama logs
|
| 190 |
+
journalctl -u ollama
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Out of Memory (OOM)
|
| 194 |
+
|
| 195 |
+
If you get OOM errors:
|
| 196 |
+
|
| 197 |
+
1. **Check VRAM usage**:
|
| 198 |
+
```bash
|
| 199 |
+
watch -n 1 nvidia-smi
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
2. **Try smaller batch size** (not applicable here - we process 1 at a time)
|
| 203 |
+
|
| 204 |
+
3. **Try quantized version** (smaller model):
|
| 205 |
+
```bash
|
| 206 |
+
# 4-bit quantized version (~12GB VRAM)
|
| 207 |
+
ollama pull qwen2.5:32b-instruct-q4_0
|
| 208 |
+
|
| 209 |
+
# Update MODEL_NAME in notebook
|
| 210 |
+
MODEL_NAME = "qwen2.5:32b-instruct-q4_0"
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
### Slow Inference
|
| 214 |
+
|
| 215 |
+
If inference is very slow (<1 token/sec):
|
| 216 |
+
|
| 217 |
+
1. **Check GPU utilization**:
|
| 218 |
+
```bash
|
| 219 |
+
nvidia-smi
|
| 220 |
+
```
|
| 221 |
+
GPU should show ~90%+ utilization during inference
|
| 222 |
+
|
| 223 |
+
2. **Check CPU vs GPU**:
|
| 224 |
+
Ollama might be using CPU instead of GPU
|
| 225 |
+
```bash
|
| 226 |
+
# Force GPU usage
|
| 227 |
+
OLLAMA_GPU=1 ollama serve
|
| 228 |
+
```
|
| 229 |
+
|
| 230 |
+
## Model Variants
|
| 231 |
+
|
| 232 |
+
Ollama provides several Qwen-2.5 variants:
|
| 233 |
+
|
| 234 |
+
| Model | Size | VRAM | Speed | Quality |
|
| 235 |
+
|-------|------|------|-------|---------|
|
| 236 |
+
| `qwen2.5:32b-instruct` | 32B | ~25GB | Medium | Best |
|
| 237 |
+
| `qwen2.5:32b-instruct-q4_0` | 32B (4-bit) | ~12GB | Fast | Good |
|
| 238 |
+
| `qwen2.5:14b-instruct` | 14B | ~10GB | Fast | Good |
|
| 239 |
+
| `qwen2.5:7b-instruct` | 7B | ~5GB | Very Fast | OK |
|
| 240 |
+
|
| 241 |
+
For your A100 80GB, **`qwen2.5:32b-instruct`** is recommended (best quality, no VRAM issues).
|
| 242 |
+
|
| 243 |
+
## Custom Model Cache Location
|
| 244 |
+
|
| 245 |
+
To store models in `data/models/` directory:
|
| 246 |
+
|
| 247 |
+
```bash
|
| 248 |
+
# Set environment variable
|
| 249 |
+
export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"
|
| 250 |
+
|
| 251 |
+
# Add to ~/.bashrc for persistence
|
| 252 |
+
echo 'export OLLAMA_MODELS="/home/lauhp/000_PHD/000_010_PUBLICATION/CODE/pm-paper/data/models"' >> ~/.bashrc
|
| 253 |
+
|
| 254 |
+
# Pull model (will download to data/models/)
|
| 255 |
+
ollama pull qwen2.5:32b-instruct
|
| 256 |
+
|
| 257 |
+
# Verify
|
| 258 |
+
ls -lh $OLLAMA_MODELS/
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
## Comparing Results
|
| 262 |
+
|
| 263 |
+
After running both API and local versions, compare results:
|
| 264 |
+
|
| 265 |
+
```python
|
| 266 |
+
import pandas as pd
|
| 267 |
+
|
| 268 |
+
# Load results
|
| 269 |
+
qwen_api = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
|
| 270 |
+
qwen_local = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')
|
| 271 |
+
|
| 272 |
+
# Compare professions
|
| 273 |
+
print("API professions:", qwen_api['profession_llm'].value_counts().head())
|
| 274 |
+
print("Local professions:", qwen_local['profession_llm'].value_counts().head())
|
| 275 |
+
|
| 276 |
+
# Check agreement
|
| 277 |
+
agreement = (qwen_api['profession_llm'] == qwen_local['profession_llm']).mean()
|
| 278 |
+
print(f"Agreement rate: {agreement*100:.1f}%")
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
## Cost Comparison (10,000 samples)
|
| 282 |
+
|
| 283 |
+
| Method | Cost | Time | Privacy |
|
| 284 |
+
|--------|------|------|---------|
|
| 285 |
+
| **Qwen Local (A100)** | **$0** | ~50-100 hours | ✅ Full |
|
| 286 |
+
| Qwen API (Alibaba) | ~$10-20 | ~5-10 hours | ⚠️ Data sent to Alibaba |
|
| 287 |
+
| Llama API (Together) | ~$5-10 | ~5-10 hours | ⚠️ Data sent to Together AI |
|
| 288 |
+
| Deepseek API | ~$1-2 | ~5-10 hours | ⚠️ Data sent to Deepseek |
|
| 289 |
+
|
| 290 |
+
**Recommendation**:
|
| 291 |
+
- For **small tests** (<100 samples): Use API (faster)
|
| 292 |
+
- For **large datasets** (>1000 samples): Use local (free, private)
|
| 293 |
+
- For **research papers**: Use local to avoid data privacy concerns
|
| 294 |
+
|
| 295 |
+
## Advanced: Parallel Processing
|
| 296 |
+
|
| 297 |
+
For faster processing on multi-GPU setup:
|
| 298 |
+
|
| 299 |
+
```python
|
| 300 |
+
# Not implemented yet, but possible with:
|
| 301 |
+
# - Multiple Ollama instances on different GPUs
|
| 302 |
+
# - Ray or Dask for parallel processing
|
| 303 |
+
# - ~4x speedup with 4 GPUs
|
| 304 |
+
```
|
| 305 |
+
|
| 306 |
+
## Summary
|
| 307 |
+
|
| 308 |
+
✅ **Ollama** already installed
|
| 309 |
+
✅ **A100 80GB** GPU - perfect for Qwen-2.5-32B
|
| 310 |
+
✅ **Free inference** - no API costs
|
| 311 |
+
✅ **Privacy** - data stays local
|
| 312 |
+
|
| 313 |
+
**Next steps:**
|
| 314 |
+
1. Pull model: `ollama pull qwen2.5:32b-instruct`
|
| 315 |
+
2. Test with Cell 20: `TEST_MODE = True`, `TEST_SIZE = 10`
|
| 316 |
+
3. Run full dataset: `TEST_MODE = False`
|
| 317 |
+
|
| 318 |
+
**Estimated time for 10,000 samples**: ~50-100 hours
|
| 319 |
+
**Cost**: $0
|
| 320 |
+
|
| 321 |
+
Good luck! 🚀
|
md/SPACY_NER_EXPLANATION.md
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# spaCy NER Implementation
|
| 2 |
+
|
| 3 |
+
## Why spaCy for NER?
|
| 4 |
+
|
| 5 |
+
Using **spaCy's Named Entity Recognition (NER)** is significantly better than regex-based cleaning because:
|
| 6 |
+
|
| 7 |
+
1. **Intelligent entity extraction**: Recognizes PERSON entities using machine learning
|
| 8 |
+
2. **Context-aware**: Understands sentence structure and context
|
| 9 |
+
3. **Robust**: Handles various name formats (first, last, full, stage names)
|
| 10 |
+
4. **Language support**: Works with multiple languages and scripts
|
| 11 |
+
5. **Industry standard**: Used in production NLP applications
|
| 12 |
+
|
| 13 |
+
## How It Works
|
| 14 |
+
|
| 15 |
+
### Pipeline Overview
|
| 16 |
+
|
| 17 |
+
```
|
| 18 |
+
Original Name
|
| 19 |
+
↓
|
| 20 |
+
1. Translate Leetspeak (4→a, 3→e, 1→i)
|
| 21 |
+
↓
|
| 22 |
+
2. Remove Noise (emoji, LoRA terms, versions)
|
| 23 |
+
↓
|
| 24 |
+
3. spaCy NER - Extract PERSON entities
|
| 25 |
+
↓
|
| 26 |
+
4. Fallback to capitalized words if needed
|
| 27 |
+
↓
|
| 28 |
+
Cleaned Name
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
### Detailed Steps
|
| 32 |
+
|
| 33 |
+
#### Step 1: Leetspeak Translation
|
| 34 |
+
```python
|
| 35 |
+
"4kira LoRA v2" → "akira LoRA v2"
|
| 36 |
+
"1rene Model" → "irene Model"
|
| 37 |
+
"3mma Watson" → "emma Watson"
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
#### Step 2: Noise Removal
|
| 41 |
+
```python
|
| 42 |
+
"akira LoRA v2" → "akira"
|
| 43 |
+
"irene Model" → "irene"
|
| 44 |
+
"emma Watson" → "emma Watson"
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
#### Step 3: spaCy NER
|
| 48 |
+
```python
|
| 49 |
+
nlp("akira")
|
| 50 |
+
# Entities: [("akira", PERSON)]
|
| 51 |
+
# Result: "akira"
|
| 52 |
+
|
| 53 |
+
nlp("emma Watson")
|
| 54 |
+
# Entities: [("emma Watson", PERSON)]
|
| 55 |
+
# Result: "emma Watson"
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
#### Step 4: Fallback
|
| 59 |
+
If spaCy doesn't find a PERSON entity:
|
| 60 |
+
- Extract capitalized words (likely names)
|
| 61 |
+
- Or return cleaned text as-is
|
| 62 |
+
|
| 63 |
+
## Examples
|
| 64 |
+
|
| 65 |
+
### Case 1: Simple Name
|
| 66 |
+
```
|
| 67 |
+
Input: "IU"
|
| 68 |
+
Output: "IU"
|
| 69 |
+
|
| 70 |
+
Process:
|
| 71 |
+
- Preprocess: "IU" (no noise)
|
| 72 |
+
- spaCy NER: Recognizes "IU" as PERSON
|
| 73 |
+
- Result: "IU"
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Case 2: Name with LoRA Terms
|
| 77 |
+
```
|
| 78 |
+
Input: "Scarlett Johansson「LoRa」"
|
| 79 |
+
Output: "Scarlett Johansson"
|
| 80 |
+
|
| 81 |
+
Process:
|
| 82 |
+
- Preprocess: "Scarlett Johansson" (removed 「LoRa」)
|
| 83 |
+
- spaCy NER: Recognizes "Scarlett Johansson" as PERSON
|
| 84 |
+
- Result: "Scarlett Johansson"
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
### Case 3: Leetspeak Name
|
| 88 |
+
```
|
| 89 |
+
Input: "4kira Anime Character v1"
|
| 90 |
+
Output: "akira"
|
| 91 |
+
|
| 92 |
+
Process:
|
| 93 |
+
- Leetspeak: "akira Anime Character v1"
|
| 94 |
+
- Preprocess: "akira Anime Character"
|
| 95 |
+
- spaCy NER: Recognizes "akira" as PERSON
|
| 96 |
+
- Result: "akira"
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
### Case 4: Complex Format
|
| 100 |
+
```
|
| 101 |
+
Input: "Gakki | Aragaki Yui | 新垣結衣"
|
| 102 |
+
Output: "Gakki"
|
| 103 |
+
|
| 104 |
+
Process:
|
| 105 |
+
- Preprocess: "Gakki" (kept first part before |)
|
| 106 |
+
- spaCy NER: Recognizes "Gakki" as PERSON
|
| 107 |
+
- Result: "Gakki"
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
### Case 5: With Metadata
|
| 111 |
+
```
|
| 112 |
+
Input: "Emma Watson (JG) v3.5"
|
| 113 |
+
Output: "Emma Watson"
|
| 114 |
+
|
| 115 |
+
Process:
|
| 116 |
+
- Preprocess: "Emma Watson" (removed (JG) and v3.5)
|
| 117 |
+
- spaCy NER: Recognizes "Emma Watson" as PERSON
|
| 118 |
+
- Result: "Emma Watson"
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## Advantages Over Regex-Only
|
| 122 |
+
|
| 123 |
+
### Old Approach (Regex Only)
|
| 124 |
+
```python
|
| 125 |
+
# Just remove noise and hope for the best
|
| 126 |
+
name = remove_noise(name)
|
| 127 |
+
name = name.strip()
|
| 128 |
+
# Result: May include non-name words
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
Problems:
|
| 132 |
+
- Can't distinguish names from other capitalized words
|
| 133 |
+
- May include words like "Model", "Anime", "Character"
|
| 134 |
+
- No context awareness
|
| 135 |
+
- Language-dependent regex patterns needed
|
| 136 |
+
|
| 137 |
+
### New Approach (spaCy NER)
|
| 138 |
+
```python
|
| 139 |
+
# Intelligent entity extraction
|
| 140 |
+
preprocessed = remove_noise(name)
|
| 141 |
+
doc = nlp(preprocessed)
|
| 142 |
+
person_names = [ent.text for ent in doc.ents if ent.label_ == "PERSON"]
|
| 143 |
+
# Result: Only actual person names
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
Benefits:
|
| 147 |
+
- ✅ Identifies actual person entities
|
| 148 |
+
- ✅ Ignores non-person words
|
| 149 |
+
- ✅ Context-aware (understands "Emma Watson" is one entity)
|
| 150 |
+
- ✅ Multi-language support
|
| 151 |
+
- ✅ Handles various name formats
|
| 152 |
+
|
| 153 |
+
## Comparison Examples
|
| 154 |
+
|
| 155 |
+
| Input | Regex Only | spaCy NER |
|
| 156 |
+
|-------|------------|-----------|
|
| 157 |
+
| `"Emma Watson Model"` | `"Emma Watson Model"` ❌ | `"Emma Watson"` ✅ |
|
| 158 |
+
| `"Anime Character Levi"` | `"Anime Character Levi"` ❌ | `"Levi"` ✅ |
|
| 159 |
+
| `"Taylor Swift v2"` | `"Taylor Swift"` ✅ | `"Taylor Swift"` ✅ |
|
| 160 |
+
| `"K4te Middleton"` | `"K4te Middleton"` ❌ | `"Kate Middleton"` ✅ |
|
| 161 |
+
| `"Celebrity IU"` | `"Celebrity IU"` ❌ | `"IU"` ✅ |
|
| 162 |
+
|
| 163 |
+
## spaCy Model Information
|
| 164 |
+
|
| 165 |
+
### Model Used
|
| 166 |
+
- **Name**: `en_core_web_sm`
|
| 167 |
+
- **Language**: English (but works reasonably with romanized names)
|
| 168 |
+
- **Size**: ~13 MB
|
| 169 |
+
- **Entities**: Recognizes PERSON, ORG, GPE, etc.
|
| 170 |
+
|
| 171 |
+
### Installation
|
| 172 |
+
```bash
|
| 173 |
+
# Install spaCy
|
| 174 |
+
pip install spacy
|
| 175 |
+
|
| 176 |
+
# Download model
|
| 177 |
+
python -m spacy download en_core_web_sm
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
The notebook automatically downloads the model if not found.
|
| 181 |
+
|
| 182 |
+
### Performance
|
| 183 |
+
- **Speed**: ~1000-5000 docs/second
|
| 184 |
+
- **Accuracy**: High for common names
|
| 185 |
+
- **Memory**: Low (~100MB loaded)
|
| 186 |
+
|
| 187 |
+
## Fallback Strategy
|
| 188 |
+
|
| 189 |
+
If spaCy doesn't recognize a PERSON entity:
|
| 190 |
+
|
| 191 |
+
1. **Extract capitalized words**:
|
| 192 |
+
```python
|
| 193 |
+
"unknown name here" → ["unknown"]
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
2. **Return first few capitalized words**:
|
| 197 |
+
```python
|
| 198 |
+
"Celebrity Model Actor" → "Celebrity Model Actor"
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
3. **Last resort**: Return cleaned text as-is
|
| 202 |
+
|
| 203 |
+
This ensures we always get something, even for:
|
| 204 |
+
- Uncommon/rare names
|
| 205 |
+
- Nicknames
|
| 206 |
+
- Non-English names
|
| 207 |
+
- Stage names
|
| 208 |
+
|
| 209 |
+
## Testing
|
| 210 |
+
|
| 211 |
+
### How to Verify spaCy is Working
|
| 212 |
+
|
| 213 |
+
Run Cell 5 and check the output:
|
| 214 |
+
|
| 215 |
+
```
|
| 216 |
+
✅ spaCy model loaded: en_core_web_sm
|
| 217 |
+
|
| 218 |
+
📊 Name cleaning examples (with spaCy NER):
|
| 219 |
+
===================================================================================================
|
| 220 |
+
Original Name | Cleaned Name
|
| 221 |
+
===================================================================================================
|
| 222 |
+
Scarlett Johansson「LoRa」 | Scarlett Johansson
|
| 223 |
+
Emma Watson (JG) | Emma Watson
|
| 224 |
+
IU | IU
|
| 225 |
+
Belle Delphine | Belle Delphine
|
| 226 |
+
...
|
| 227 |
+
```
|
| 228 |
+
|
| 229 |
+
### Key Indicators
|
| 230 |
+
|
| 231 |
+
✅ **Good signs**:
|
| 232 |
+
- Person names cleanly extracted
|
| 233 |
+
- No extra words like "Model", "LoRA", "Celebrity"
|
| 234 |
+
- Multi-word names kept together (e.g., "Emma Watson" not just "Emma")
|
| 235 |
+
|
| 236 |
+
❌ **Issues to watch**:
|
| 237 |
+
- Empty results (increase fallback logic)
|
| 238 |
+
- Partial names (e.g., only first name)
|
| 239 |
+
- Non-names included (tune preprocessing)
|
| 240 |
+
|
| 241 |
+
## Customization
|
| 242 |
+
|
| 243 |
+
### Add More Languages
|
| 244 |
+
|
| 245 |
+
For better support of non-English names:
|
| 246 |
+
|
| 247 |
+
```python
|
| 248 |
+
# Download multilingual model
|
| 249 |
+
python -m spacy download xx_ent_wiki_sm
|
| 250 |
+
|
| 251 |
+
# Use in code
|
| 252 |
+
nlp = spacy.load("xx_ent_wiki_sm")
|
| 253 |
+
```
|
| 254 |
+
|
| 255 |
+
### Adjust Entity Extraction
|
| 256 |
+
|
| 257 |
+
To extract other entities:
|
| 258 |
+
|
| 259 |
+
```python
|
| 260 |
+
# Extract organizations too
|
| 261 |
+
entities = [ent.text for ent in doc.ents
|
| 262 |
+
if ent.label_ in ["PERSON", "ORG"]]
|
| 263 |
+
```
|
| 264 |
+
|
| 265 |
+
### Custom Entity Rules
|
| 266 |
+
|
| 267 |
+
Add custom patterns for names spaCy might miss:
|
| 268 |
+
|
| 269 |
+
```python
|
| 270 |
+
from spacy.matcher import Matcher
|
| 271 |
+
|
| 272 |
+
matcher = Matcher(nlp.vocab)
|
| 273 |
+
# Add patterns for specific name formats
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
## Benefits for This Project
|
| 277 |
+
|
| 278 |
+
### Better Person Identification
|
| 279 |
+
|
| 280 |
+
With cleaner names:
|
| 281 |
+
- LLMs receive recognizable names
|
| 282 |
+
- "Emma Watson" instead of "Emma Watson Model LoRA v3"
|
| 283 |
+
- Better identification accuracy
|
| 284 |
+
|
| 285 |
+
### Reduced Ambiguity
|
| 286 |
+
|
| 287 |
+
spaCy helps distinguish:
|
| 288 |
+
- Person names vs. descriptive words
|
| 289 |
+
- "Celebrity IU" → "IU" (person)
|
| 290 |
+
- "Model Bella" → "Bella" (person)
|
| 291 |
+
|
| 292 |
+
### Improved Context for LLMs
|
| 293 |
+
|
| 294 |
+
Cleaner input = better prompts:
|
| 295 |
+
```
|
| 296 |
+
Before: "Given 'Celebrity Model Emma Watson LoRA v2' (actress)..."
|
| 297 |
+
After: "Given 'Emma Watson' (actress)..."
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
The LLM can now focus on identifying the person, not parsing the noise.
|
| 301 |
+
|
| 302 |
+
## Summary
|
| 303 |
+
|
| 304 |
+
✅ **spaCy NER** provides intelligent, context-aware name extraction
|
| 305 |
+
✅ **Better than regex** for handling complex name formats
|
| 306 |
+
✅ **Fallback strategy** ensures we always get a result
|
| 307 |
+
✅ **Industry standard** tool used in production NLP
|
| 308 |
+
✅ **Easy to use** with minimal code
|
| 309 |
+
|
| 310 |
+
The combination of:
|
| 311 |
+
1. Leetspeak translation
|
| 312 |
+
2. Noise removal
|
| 313 |
+
3. spaCy NER
|
| 314 |
+
4. Smart fallbacks
|
| 315 |
+
|
| 316 |
+
...results in clean, accurate person names ready for LLM annotation!
|
md/TESTING_INSTRUCTIONS.md
ADDED
|
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Quick Testing Instructions
|
| 2 |
+
|
| 3 |
+
## Start Here! 🚀
|
| 4 |
+
|
| 5 |
+
You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs.
|
| 6 |
+
|
| 7 |
+
## Step-by-Step Testing
|
| 8 |
+
|
| 9 |
+
### 1. Make sure your Deepseek API key is in place
|
| 10 |
+
|
| 11 |
+
Check if this file exists:
|
| 12 |
+
```bash
|
| 13 |
+
cat misc/credentials/deepseek_api_key.txt
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
If not, create it:
|
| 17 |
+
```bash
|
| 18 |
+
echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
### 2. Open the notebook
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
### 3. Run the cells in order
|
| 28 |
+
|
| 29 |
+
1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed)
|
| 30 |
+
2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`)
|
| 31 |
+
3. **Cell 7**: Country/Nationality Mapping
|
| 32 |
+
4. **Cell 10**: 🌟 **DEEPSEEK ANNOTATION** (TEST THIS FIRST!)
|
| 33 |
+
- Default: `TEST_MODE = True` (10 samples)
|
| 34 |
+
- Will create: `data/CSV/deepseek_annotated_POI_test.csv`
|
| 35 |
+
5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works)
|
| 36 |
+
|
| 37 |
+
### 4. Review Deepseek Results
|
| 38 |
+
|
| 39 |
+
After Cell 10 completes, check:
|
| 40 |
+
- Console output shows summary statistics
|
| 41 |
+
- Output file: `data/CSV/deepseek_annotated_POI_test.csv`
|
| 42 |
+
|
| 43 |
+
Example output should look like:
|
| 44 |
+
```
|
| 45 |
+
✅ Progress saved after 10 rows
|
| 46 |
+
✅ Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv
|
| 47 |
+
|
| 48 |
+
=== Summary Statistics ===
|
| 49 |
+
Total processed: 10
|
| 50 |
+
|
| 51 |
+
Gender distribution:
|
| 52 |
+
Female 8
|
| 53 |
+
Male 2
|
| 54 |
+
...
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
### 5. If Deepseek Works Well
|
| 58 |
+
|
| 59 |
+
Once you're satisfied with the Deepseek results:
|
| 60 |
+
|
| 61 |
+
**Option A: Process full dataset with Deepseek**
|
| 62 |
+
```python
|
| 63 |
+
# In Cell 10, change:
|
| 64 |
+
TEST_MODE = False
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
**Option B: Try other LLMs for comparison**
|
| 68 |
+
1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`)
|
| 69 |
+
2. Run Cell 12 with your chosen LLM:
|
| 70 |
+
```python
|
| 71 |
+
SELECTED_LLM = 'qwen' # or 'llama' or 'mistral'
|
| 72 |
+
TEST_MODE = True # Test first!
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## Expected Cost (Deepseek)
|
| 76 |
+
|
| 77 |
+
- **10 samples** (test): ~$0.01 or less
|
| 78 |
+
- **1,000 entries**: ~$0.10-0.20
|
| 79 |
+
- **10,000 entries**: ~$1-2
|
| 80 |
+
|
| 81 |
+
Much cheaper than the other options, making it perfect for testing!
|
| 82 |
+
|
| 83 |
+
## Troubleshooting
|
| 84 |
+
|
| 85 |
+
### "deepseek_api_key.txt not found"
|
| 86 |
+
```bash
|
| 87 |
+
# Create the file with your key
|
| 88 |
+
echo "your-api-key" > misc/credentials/deepseek_api_key.txt
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### "File does not exist: real_person_adapters.csv"
|
| 92 |
+
Make sure the input file exists:
|
| 93 |
+
```bash
|
| 94 |
+
ls -lh data/CSV/real_person_adapters.csv
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### API Rate Limiting
|
| 98 |
+
The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited:
|
| 99 |
+
- Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)`
|
| 100 |
+
|
| 101 |
+
### Pipeline Interrupted
|
| 102 |
+
No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.
|
| 103 |
+
|
| 104 |
+
## What's Next?
|
| 105 |
+
|
| 106 |
+
After testing with Deepseek:
|
| 107 |
+
|
| 108 |
+
1. **If results look good**: Scale up to full dataset with Deepseek
|
| 109 |
+
2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results
|
| 110 |
+
3. **Production run**: Choose your preferred LLM and process the full dataset
|
| 111 |
+
|
| 112 |
+
## File Outputs
|
| 113 |
+
|
| 114 |
+
The pipeline creates these files:
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
data/CSV/
|
| 118 |
+
├── NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning)
|
| 119 |
+
├── NER_POI_step02_annotated.csv # After Cell 7 (country mapping)
|
| 120 |
+
├── deepseek_annotated_POI_test.csv # After Cell 10 (test mode)
|
| 121 |
+
├── deepseek_annotated_POI.csv # After Cell 10 (full mode)
|
| 122 |
+
├── qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen)
|
| 123 |
+
└── ...
|
| 124 |
+
|
| 125 |
+
misc/
|
| 126 |
+
├── deepseek_query_index.txt # Progress tracking
|
| 127 |
+
└── ...
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
## Quick Commands
|
| 131 |
+
|
| 132 |
+
```bash
|
| 133 |
+
# View first few results
|
| 134 |
+
head -20 data/CSV/deepseek_annotated_POI_test.csv
|
| 135 |
+
|
| 136 |
+
# Count processed rows
|
| 137 |
+
wc -l data/CSV/deepseek_annotated_POI_test.csv
|
| 138 |
+
|
| 139 |
+
# Check progress
|
| 140 |
+
cat misc/deepseek_query_index.txt
|
| 141 |
+
|
| 142 |
+
# Reset progress (start from scratch)
|
| 143 |
+
rm misc/deepseek_query_index.txt
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
---
|
| 147 |
+
|
| 148 |
+
**Ready to start?** Open the notebook and run Cell 5 → Cell 7 → Cell 10! 🎉
|
md/UPDATES_AND_FIXES.md
ADDED
|
@@ -0,0 +1,235 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Recent Updates and Fixes
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
|
| 5 |
+
Two important fixes have been implemented based on testing feedback:
|
| 6 |
+
|
| 7 |
+
1. **Leetspeak Translation** (before NER)
|
| 8 |
+
2. **Improved Country Mapping** (check ALL tags)
|
| 9 |
+
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
## Fix 1: Leetspeak Translation
|
| 13 |
+
|
| 14 |
+
### Problem
|
| 15 |
+
Names with leetspeak (numbers replacing letters) weren't being properly cleaned:
|
| 16 |
+
- `4kira` should be `Akira`
|
| 17 |
+
- `1rene` should be `Irene`
|
| 18 |
+
- `3mma` should be `Emma`
|
| 19 |
+
|
| 20 |
+
### Solution
|
| 21 |
+
Added leetspeak translation **before** other NER processing in Cell 5.
|
| 22 |
+
|
| 23 |
+
### Mapping Table
|
| 24 |
+
| Leetspeak | Letter |
|
| 25 |
+
|-----------|--------|
|
| 26 |
+
| 4 | A |
|
| 27 |
+
| 3 | E |
|
| 28 |
+
| 1 | I |
|
| 29 |
+
| 0 | O |
|
| 30 |
+
| 7 | T |
|
| 31 |
+
| 5 | S |
|
| 32 |
+
| 8 | B |
|
| 33 |
+
| 9 | G |
|
| 34 |
+
| @ | A |
|
| 35 |
+
| $ | S |
|
| 36 |
+
| ! | I |
|
| 37 |
+
|
| 38 |
+
### Examples
|
| 39 |
+
```
|
| 40 |
+
4kira -> akira
|
| 41 |
+
3mma -> emma
|
| 42 |
+
1rene -> irene
|
| 43 |
+
L3vi -> Levi
|
| 44 |
+
S4sha -> Sasha
|
| 45 |
+
K4te -> Kate
|
| 46 |
+
J3ssica -> Jessica
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Implementation
|
| 50 |
+
The `translate_leetspeak()` function runs FIRST in `clean_name()`, before emoji removal and other cleaning steps. This ensures leetspeak is converted to proper letters before any other processing.
|
| 51 |
+
|
| 52 |
+
---
|
| 53 |
+
|
| 54 |
+
## Fix 2: Improved Country Mapping
|
| 55 |
+
|
| 56 |
+
### Problem
|
| 57 |
+
The country mapping was stopping at the first match, which meant:
|
| 58 |
+
- **Irene** with tags `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']`
|
| 59 |
+
- The `'korean'` tag wasn't being properly mapped to `'South Korea'`
|
| 60 |
+
- This resulted in incomplete hints being sent to the LLM
|
| 61 |
+
- **Expected**: Deepseek should identify **Bae Joo-hyun (Irene)** from Red Velvet
|
| 62 |
+
|
| 63 |
+
### Solution
|
| 64 |
+
Updated Cell 7 to:
|
| 65 |
+
1. **Check ALL tags** (not just stop at first match)
|
| 66 |
+
2. **Use a priority system** to select the best match:
|
| 67 |
+
- Priority 3: Exact country name match (highest)
|
| 68 |
+
- Priority 2: Nationality match (medium)
|
| 69 |
+
- Priority 1: Word parts (lowest)
|
| 70 |
+
|
| 71 |
+
### How It Works
|
| 72 |
+
|
| 73 |
+
#### Before (Broken)
|
| 74 |
+
```python
|
| 75 |
+
def infer_country_and_nationality(tags):
|
| 76 |
+
for tag in tags:
|
| 77 |
+
if tag in mapping:
|
| 78 |
+
return mapping[tag] # ❌ Stops at first match!
|
| 79 |
+
return ("", "")
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
#### After (Fixed)
|
| 83 |
+
```python
|
| 84 |
+
def infer_country_and_nationality(tags):
|
| 85 |
+
best_match = None
|
| 86 |
+
best_priority = 0
|
| 87 |
+
|
| 88 |
+
for tag in tags: # ✅ Check ALL tags
|
| 89 |
+
if tag in mapping:
|
| 90 |
+
country, nationality, priority = mapping[tag]
|
| 91 |
+
if priority > best_priority:
|
| 92 |
+
best_match = (country, nationality)
|
| 93 |
+
best_priority = priority
|
| 94 |
+
|
| 95 |
+
return best_match or ("", "")
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
### Example: Irene Case
|
| 99 |
+
|
| 100 |
+
**Input Tags**: `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']`
|
| 101 |
+
|
| 102 |
+
**Processing**:
|
| 103 |
+
1. Check `'girl'` → no match
|
| 104 |
+
2. Check `'photorealistic'` → no match
|
| 105 |
+
3. Check `'asian'` → no match (too generic)
|
| 106 |
+
4. Check `'woman'` → no match
|
| 107 |
+
5. Check `'beautiful'` → no match
|
| 108 |
+
6. Check `'celebrity'` → no match
|
| 109 |
+
7. Check `'korean'` → ✅ **MATCH!**
|
| 110 |
+
- Maps to nationality: `'South Korean'`
|
| 111 |
+
- Which maps to country: `'South Korea'`
|
| 112 |
+
- Priority: 2 (nationality match)
|
| 113 |
+
|
| 114 |
+
**Output**:
|
| 115 |
+
- `likely_country`: `'South Korea'`
|
| 116 |
+
- `likely_nationality`: `'South Korean'`
|
| 117 |
+
|
| 118 |
+
**Sent to Deepseek**:
|
| 119 |
+
```
|
| 120 |
+
Given 'Irene' (celebrity, South Korea), provide:
|
| 121 |
+
1. Full legal name
|
| 122 |
+
2. Aliases
|
| 123 |
+
3. Gender
|
| 124 |
+
4. Top 3 professions
|
| 125 |
+
5. Country
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
**Expected Result**: Deepseek can now identify this as **Bae Joo-hyun (Irene)**, a South Korean singer/actress from the K-pop group Red Velvet.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Impact on Results
|
| 133 |
+
|
| 134 |
+
### Better Name Recognition
|
| 135 |
+
- Leetspeak names are now properly translated
|
| 136 |
+
- LLMs receive cleaner, more recognizable names
|
| 137 |
+
|
| 138 |
+
### Better Country Context
|
| 139 |
+
- All tags are now considered for country mapping
|
| 140 |
+
- More accurate country/nationality hints sent to LLMs
|
| 141 |
+
- Better identification of international celebrities
|
| 142 |
+
|
| 143 |
+
### Example Improvements
|
| 144 |
+
|
| 145 |
+
| Name | Tags | Before | After |
|
| 146 |
+
|------|------|--------|-------|
|
| 147 |
+
| `4kira LoRA` | `['japanese', 'actress']` | `'4kira'` + no country | `'Akira'` + `'Japan'` |
|
| 148 |
+
| `Irene` | `['korean', 'celebrity']` | `'Irene'` + no country | `'Irene'` + `'South Korea'` |
|
| 149 |
+
| `1U` | `['korean', 'singer']` | `'1U'` + no country | `'IU'` + `'South Korea'` |
|
| 150 |
+
| `3lsa` | `['model']` | `'3lsa'` + no country | `'Elsa'` + country if tagged |
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Testing Recommendations
|
| 155 |
+
|
| 156 |
+
### Before Running Full Pipeline
|
| 157 |
+
|
| 158 |
+
1. **Test Leetspeak Translation** (Cell 5):
|
| 159 |
+
```python
|
| 160 |
+
# Look for names with numbers in the output
|
| 161 |
+
# Verify they're properly translated
|
| 162 |
+
```
|
| 163 |
+
|
| 164 |
+
2. **Test Country Mapping** (Cell 7):
|
| 165 |
+
```python
|
| 166 |
+
# Check the debug output at the end:
|
| 167 |
+
# "🔍 Checking 'Irene' entries:"
|
| 168 |
+
# Verify country is properly mapped
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
3. **Test Deepseek Results** (Cell 10):
|
| 172 |
+
```python
|
| 173 |
+
# Look for Irene in the results
|
| 174 |
+
# Should now identify as Bae Joo-hyun
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Validation Checklist
|
| 178 |
+
|
| 179 |
+
- [ ] Leetspeak names are translated (check console output in Cell 5)
|
| 180 |
+
- [ ] Country mapping shows high success rate (check stats in Cell 7)
|
| 181 |
+
- [ ] Irene is correctly identified as Bae Joo-hyun (check results in Cell 10)
|
| 182 |
+
- [ ] Other K-pop/Korean celebrities are properly identified
|
| 183 |
+
- [ ] Japanese/Chinese celebrities also benefit from better country mapping
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
## Notes
|
| 188 |
+
|
| 189 |
+
### Why Check ALL Tags?
|
| 190 |
+
|
| 191 |
+
Some entries have many tags, and the most informative tag might not be first:
|
| 192 |
+
```
|
| 193 |
+
tags = ['girl', 'sexy', 'beautiful', 'asian', 'korean', 'celebrity', 'kpop']
|
| 194 |
+
^^^^ Most informative!
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
The old code might stop at `'girl'` or `'asian'` (no country info), missing the `'korean'` tag.
|
| 198 |
+
|
| 199 |
+
### Why Use Priority?
|
| 200 |
+
|
| 201 |
+
Some tags might match multiple countries. Priority ensures we get the best match:
|
| 202 |
+
- `'american'` → exact nationality match (priority 2) → USA
|
| 203 |
+
- `'america'` → could be North/South/Central America (priority 1)
|
| 204 |
+
|
| 205 |
+
The system picks the higher priority match.
|
| 206 |
+
|
| 207 |
+
### Word Length Filter
|
| 208 |
+
|
| 209 |
+
Word parts only match if >4 characters to avoid false positives:
|
| 210 |
+
- ✅ `'china'` → matches China (5 chars)
|
| 211 |
+
- ❌ `'us'` → too short, might be part of other words
|
| 212 |
+
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## Future Improvements
|
| 216 |
+
|
| 217 |
+
Potential enhancements:
|
| 218 |
+
1. **More leetspeak patterns**: `|\/|` for M, `(_)` for U, etc.
|
| 219 |
+
2. **Fuzzy country matching**: Handle typos like `'corean'` → `'korean'`
|
| 220 |
+
3. **Multi-country support**: Some celebrities work in multiple countries
|
| 221 |
+
4. **Language detection**: Use name structure to infer origin
|
| 222 |
+
|
| 223 |
+
---
|
| 224 |
+
|
| 225 |
+
## Summary
|
| 226 |
+
|
| 227 |
+
✅ **Leetspeak translation** ensures names are readable before NER
|
| 228 |
+
✅ **ALL tags checked** ensures no country hints are missed
|
| 229 |
+
✅ **Priority system** ensures best match is selected
|
| 230 |
+
✅ **Better LLM results** from improved name quality and country context
|
| 231 |
+
|
| 232 |
+
These fixes should significantly improve the accuracy of person identification, especially for:
|
| 233 |
+
- International celebrities (K-pop, J-pop, C-pop)
|
| 234 |
+
- Names with leetspeak
|
| 235 |
+
- Entries where country info appears later in tag list
|
misc/assets/fonts/DejaVuSans.ttf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7da195a74c55bef988d0d48f9508bd5d849425c1770dba5d7bfc6ce9ed848954
|
| 3 |
+
size 757076
|
misc/assets/fonts/Noto_Sans.zip
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c7801615f2c5a7a107313cd0a88a15c3b1f15a2da9d4a3648cf49711d8be44da
|
| 3 |
+
size 47636998
|
misc/assets/fonts/Noto_Sans/NotoSans-Italic-VariableFont_wdth,wght.ttf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fe5a1fafb96618733aa4ea4c14a2e76ee65fee0d042b040e374f17575467e433
|
| 3 |
+
size 2637272
|
misc/assets/fonts/Noto_Sans/NotoSans-VariableFont_wdth,wght.ttf
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:74df1f61ab9d4bfaa961c65f8dc991deaae2885b0a6a6d6a60ed23980b3c8554
|
| 3 |
+
size 2490816
|
misc/assets/fonts/Noto_Sans/OFL.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:be2f3f8727ac2e18b714ad1c4336d4ddb3f3adbeb9a7f70bfab74d21f4d2b3fb
|
| 3 |
+
size 4489
|