Spaces:

gyubin02
/

maple-data

Sleeping

+# Python
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+*.egg-info/
+.eggs/
+.venv/
+venv/
+ENV/
+# Env files
+.env
+# Test/coverage
+.pytest_cache/
+.coverage
+htmlcov/
+# Editor/OS
+.DS_Store
+.idea/
+.vscode/
+# Data outputs
+/data/
+# Logs
+*.log

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+# MapleStory Ranking Icon Pipeline
+Data based on NEXON Open API.
+## Overview
+Collects MapleStory overall ranking (1-100) characters and stores:
+- Equipment shape icons (`item_shape_icon`) + metadata
+- Cash item icons (`cash_item_icon`) + metadata
+Output includes:
+- Raw JSON responses for audit/replay
+- SQLite database with idempotent upserts
+- Optional icon downloads with SHA256 integrity tracking
+## Requirements
+- Python 3.11+
+- Nexon Open API key (set `NEXON_API_KEY`)
+## Install
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+## Configuration
+Create `.env` (see `.env.example`) or export env vars:
+- `NEXON_API_KEY` (required)
+- `OUTPUT_DIR` (optional, default `data/`)
+- `DB_PATH` (optional, default `data/YYYY-MM-DD/db.sqlite`)
+## Usage
+```bash
+python -m pipeline run --date 2026-01-10 --top 100 --download-icons --concurrency 8 --rps 5
+python -m pipeline run --top 100 --no-download
+python -m pipeline --top 100
+python -m pipeline --start-rank 101 --end-rank 200 --date 2026-01-10
+```
+The `run` subcommand is optional.
+Optional filters:
+- `--world-name`
+- `--world-type`
+- `--class-name`
+Rank range:
+- `--start-rank` (default 1)
+- `--end-rank` (default = `--top`)
+- `--top` remains as an alias for `--end-rank`
+Merge additional ranges into the same run:
+- Use `--run-id` with a previous `Run ID` from `README_run.md`
+Preset handling:
+- Default: only current preset (or preset 1 if missing)
+- `--all-presets` to store all presets
+## Output Layout
+```
+project/
+  src/
+  data/
+    YYYY-MM-DD/
+      raw/
+        ranking_overall.json
+        ocid/{rank}_{character_name}.json
+        item_equipment/{ocid}.json
+        cashitem_equipment/{ocid}.json
+      db.sqlite
+      icons/
+        equipment_shape/
+        cash/
+```
+## Idempotency
+Runs are keyed by a deterministic `run_id` derived from `target_date` and ranking parameters, so re-running with the same inputs updates existing rows instead of creating duplicates.
+## Compliance
+- Data based on NEXON Open API.
+- Refresh data within 30 days to stay compliant; the CLI is scheduler-friendly (cron, etc.).
+## Tests
+```bash
+pytest
+```
+## Notes
+- Ranking pagination continues until the requested rank range is collected or pages are exhausted.
+- Raw JSON is stored unmodified (for recovery if field names change).

README_labeling.md ADDED Viewed

	@@ -0,0 +1,111 @@

+# Labeling Pipeline (CLIP Text Labels)
+Data based on NEXON Open API.
+## Overview
+This pipeline generates CLIP-ready text labels for MapleStory item icons using Qwen2-VL.
+It consumes either a manifest file or the SQLite DB and writes:
+- `labels.jsonl` (one JSON record per image)
+- `labels.parquet` (optional)
+## Requirements
+- Python 3.11+
+- GPU recommended for Qwen2-VL inference
+## Install
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+Optional (for 4-bit quantization):
+```bash
+pip install bitsandbytes
+```
+## Input Adapters
+You can use one of the following:
+A) Manifest (recommended)
+- `data/<DATE>/manifest.parquet` or `manifest.jsonl`
+- Required columns: `image_path`, `item_name`, `source_type`
+B) SQLite DB
+- `data/<DATE>/db.sqlite`
+- Joins `equipment_shape_items` / `cash_items` with `icon_assets`
+## Run
+```bash
+python -m labeler run \
+  --input data/2026-01-10/manifest.parquet \
+  --outdir data/2026-01-10/labels \
+  --model Qwen/Qwen2-VL-2B-Instruct \
+  --device auto \
+  --batch-size 8 \
+  --upscale 2 \
+  --resume
+```
+Using DB input:
+```bash
+python -m labeler run \
+  --db data/2026-01-10/db.sqlite \
+  --outdir data/2026-01-10/labels
+```
+Range filter by run_id (optional):
+```bash
+python -m labeler run --db data/2026-01-10/db.sqlite --run-id <RUN_ID>
+```
+## Output Schema
+Each line in `labels.jsonl` is a JSON object:
+```json
+{
+  "image_path": "...",
+  "image_sha256": "...",
+  "source_type": "equipment_shape" | "cash",
+  "item_name": "...",
+  "item_description": "...",
+  "label_ko": "...",
+  "label_en": "...",
+  "tags_ko": ["..."],
+  "attributes": {
+    "colors": ["..."],
+    "theme": ["..."],
+    "material": ["..."],
+    "vibe": ["..."],
+    "item_type_guess": "..."
+  },
+  "query_variants_ko": ["..."],
+  "quality_flags": {
+    "is_uncertain": true,
+    "reasons": ["too_small", "ambiguous_icon"]
+  },
+  "model": "Qwen/Qwen2-VL-2B-Instruct",
+  "prompt_version": "v1",
+  "generated_at": "ISO-8601"
+}
+```
+## Prompt Versioning
+- Prompt version is stored as `prompt_version` in each record.
+- Current version: `v1` (see `src/labeler/prompts.py`).
+## Resume / Idempotency
+- If `labels.jsonl` already exists, use `--resume`.
+- The pipeline skips images already labeled by `image_path` or `image_sha256`.
+## Comparisons
+You can compare modes:
+- `--no-image` (metadata only)
+- `--no-metadata` (image only)
+## Example Output (3 lines)
+```json
+{"image_path":"icons/equipment_shape/abc.png","image_sha256":"sha...","source_type":"equipment_shape","item_name":"Sample Hat","item_description":null,"label_ko":"샘플 모자 아이콘, 붉은 색감","label_en":null,"tags_ko":["모자","붉은","아이콘","장비","캐릭터"],"attributes":{"colors":["red"],"theme":["fantasy"],"material":["cloth"],"vibe":["cute"],"item_type_guess":"hat"},"query_variants_ko":["샘플 모자","붉은 모자 아이콘","메이플 모자"],"quality_flags":{"is_uncertain":false,"reasons":[]},"model":"Qwen/Qwen2-VL-2B-Instruct","prompt_version":"v1","generated_at":"2026-01-10T00:00:00Z"}
+{"image_path":"icons/cash/def.png","image_sha256":"sha...","source_type":"cash","item_name":"Sample Cape","item_description":"Example","label_ko":"샘플 망토 아이콘, 푸른 계열","label_en":null,"tags_ko":["망토","푸른","코디","캐시","아이콘"],"attributes":{"colors":["blue"],"theme":["classic"],"material":["silk"],"vibe":["elegant"],"item_type_guess":"cape"},"query_variants_ko":["푸른 망토","샘플 망토 아이콘","메이플 캐시 망토"],"quality_flags":{"is_uncertain":false,"reasons":[]},"model":"Qwen/Qwen2-VL-2B-Instruct","prompt_version":"v1","generated_at":"2026-01-10T00:00:00Z"}
+{"image_path":"icons/equipment_shape/ghi.png","image_sha256":"sha...","source_type":"equipment_shape","item_name":"Sample Sword","item_description":null,"label_ko":"샘플 검 아이콘, 금속 느낌","label_en":null,"tags_ko":["검","무기","금속","아이콘","장비"],"attributes":{"colors":["silver"],"theme":["fantasy"],"material":["metal"],"vibe":["sharp"],"item_type_guess":"sword"},"query_variants_ko":["샘플 검","메이플 검 아이콘","금속 검"],"quality_flags":{"is_uncertain":false,"reasons":[]},"model":"Qwen/Qwen2-VL-2B-Instruct","prompt_version":"v1","generated_at":"2026-01-10T00:00:00Z"}
+```

labeler/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+from pkgutil import extend_path
+_root = Path(__file__).resolve().parent.parent
+_src = _root / "src"
+if _src.exists():
+    src_str = str(_src)
+    if src_str not in sys.path:
+        sys.path.insert(0, src_str)
+__path__ = extend_path(__path__, __name__)

labeler/__main__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from __future__ import annotations
+import sys
+import typer
+from labeler.cli import run
+if len(sys.argv) > 1 and sys.argv[1] == "run":
+    sys.argv.pop(1)
+typer.run(run)

pipeline/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+from pkgutil import extend_path
+_root = Path(__file__).resolve().parent.parent
+_src = _root / "src"
+if _src.exists():
+    src_str = str(_src)
+    if src_str not in sys.path:
+        sys.path.insert(0, src_str)
+__path__ = extend_path(__path__, __name__)

pipeline/__main__.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from __future__ import annotations
+import sys
+import typer
+from pipeline.cli import run
+if len(sys.argv) > 1 and sys.argv[1] == "run":
+    sys.argv.pop(1)
+typer.run(run)

requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+httpx>=0.25
+python-dotenv>=1.0
+typer>=0.9
+pytest>=7.0
+accelerate>=0.27
+pillow>=10.0
+pyarrow>=14.0
+torch>=2.1
+transformers>=4.41

src/labeler/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ __all__ = ["__version__"]
2	+ __version__ = "0.1.0"

src/labeler/adapters.py ADDED Viewed

	@@ -0,0 +1,246 @@

+from __future__ import annotations
+import json
+import logging
+import sqlite3
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, Optional
+logger = logging.getLogger("labeler")
+@dataclass
+class LabelInput:
+    image_path: str
+    image_abspath: Optional[Path]
+    image_url: Optional[str]
+    image_sha256: Optional[str]
+    item_name: str
+    item_description: Optional[str]
+    item_part: Optional[str]
+    source_type: str
+    ocid: Optional[str]
+    ranking: Optional[int]
+def iter_inputs(
+    *,
+    input_path: Optional[Path],
+    db_path: Optional[Path],
+    only_source: str,
+    max_samples: Optional[int],
+    run_id: Optional[str],
+) -> Iterable[LabelInput]:
+    if input_path:
+        if input_path.suffix.lower() in {".jsonl", ".json"}:
+            yield from _iter_manifest_jsonl(input_path, only_source, max_samples)
+        elif input_path.suffix.lower() == ".parquet":
+            yield from _iter_manifest_parquet(input_path, only_source, max_samples)
+        else:
+            raise ValueError(f"Unsupported input format: {input_path}")
+        return
+    if not db_path:
+        raise ValueError("Provide --input or --db")
+    yield from _iter_db(db_path, only_source, max_samples, run_id)
+def _iter_manifest_jsonl(
+    path: Path,
+    only_source: str,
+    max_samples: Optional[int],
+) -> Iterable[LabelInput]:
+    base_dir = path.parent
+    count = 0
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                record = json.loads(line)
+            except json.JSONDecodeError:
+                logger.warning("Skipping invalid JSON line in %s", path)
+                continue
+            sample = _build_from_record(record, base_dir, only_source)
+            if not sample:
+                continue
+            yield sample
+            count += 1
+            if max_samples and count >= max_samples:
+                break
+def _iter_manifest_parquet(
+    path: Path,
+    only_source: str,
+    max_samples: Optional[int],
+) -> Iterable[LabelInput]:
+    try:
+        import pyarrow.parquet as pq
+    except ImportError as exc:
+        raise RuntimeError("pyarrow is required for parquet input") from exc
+    base_dir = path.parent
+    table = pq.read_table(path)
+    rows = table.to_pylist()
+    count = 0
+    for record in rows:
+        sample = _build_from_record(record, base_dir, only_source)
+        if not sample:
+            continue
+        yield sample
+        count += 1
+        if max_samples and count >= max_samples:
+            break
+def _build_from_record(
+    record: dict[str, object],
+    base_dir: Path,
+    only_source: str,
+) -> Optional[LabelInput]:
+    source_type = str(record.get("source_type") or "")
+    if not _source_allowed(source_type, only_source):
+        return None
+    image_path = str(record.get("image_path") or "").strip()
+    if not image_path:
+        logger.warning("Missing image_path in manifest record")
+        return None
+    image_abspath = Path(image_path)
+    if not image_abspath.is_absolute():
+        image_abspath = (base_dir / image_abspath).resolve()
+    item_name = str(record.get("item_name") or "").strip()
+    if not item_name:
+        logger.warning("Missing item_name in manifest record")
+        return None
+    return LabelInput(
+        image_path=image_path,
+        image_abspath=image_abspath,
+        image_url=_optional_str(record.get("image_url")),
+        image_sha256=_optional_str(record.get("image_sha256")),
+        item_name=item_name,
+        item_description=_optional_str(record.get("item_description")),
+        item_part=_optional_str(record.get("item_part")),
+        source_type=source_type,
+        ocid=_optional_str(record.get("ocid")),
+        ranking=_optional_int(record.get("ranking")),
+    )
+def _iter_db(
+    db_path: Path,
+    only_source: str,
+    max_samples: Optional[int],
+    run_id: Optional[str],
+) -> Iterable[LabelInput]:
+    conn = sqlite3.connect(db_path)
+    conn.row_factory = sqlite3.Row
+    base_dir = db_path.parent
+    def stream(query: str, params: tuple[object, ...], source_type: str) -> Iterable[LabelInput]:
+        cursor = conn.execute(query, params)
+        for row in cursor:
+            local_path = row["local_path"]
+            image_path = local_path or ""
+            image_abspath = (base_dir / local_path).resolve() if local_path else None
+            item_name = row["item_name"] or ""
+            if not item_name:
+                continue
+            yield LabelInput(
+                image_path=image_path,
+                image_abspath=image_abspath,
+                image_url=row["image_url"],
+                image_sha256=row["sha256"],
+                item_name=item_name,
+                item_description=row["item_description"],
+                item_part=_build_item_part(row["item_part"], row["item_slot"]),
+                source_type=source_type,
+                ocid=row["ocid"],
+                ranking=None,
+            )
+    count = 0
+    if only_source in ("equipment_shape", "all"):
+        query, params = _equipment_query(run_id)
+        for sample in stream(query, params, "equipment_shape"):
+            yield sample
+            count += 1
+            if max_samples and count >= max_samples:
+                conn.close()
+                return
+    if only_source in ("cash", "all"):
+        query, params = _cash_query(run_id)
+        for sample in stream(query, params, "cash"):
+            yield sample
+            count += 1
+            if max_samples and count >= max_samples:
+                conn.close()
+                return
+    conn.close()
+def _equipment_query(run_id: Optional[str]) -> tuple[str, tuple[object, ...]]:
+    query = (
+        "SELECT e.item_shape_icon_url AS image_url, a.sha256 AS sha256, a.local_path AS local_path, "
+        "e.item_name AS item_name, e.item_description AS item_description, "
+        "e.item_equipment_part AS item_part, e.equipment_slot AS item_slot, e.ocid AS ocid "
+        "FROM equipment_shape_items e "
+        "LEFT JOIN icon_assets a ON a.url = e.item_shape_icon_url "
+        "WHERE e.item_shape_icon_url IS NOT NULL AND e.item_shape_icon_url != ''"
+    )
+    if run_id:
+        query += " AND e.run_id = ?"
+        return query, (run_id,)
+    return query, ()
+def _cash_query(run_id: Optional[str]) -> tuple[str, tuple[object, ...]]:
+    query = (
+        "SELECT c.cash_item_icon_url AS image_url, a.sha256 AS sha256, a.local_path AS local_path, "
+        "c.cash_item_name AS item_name, c.cash_item_description AS item_description, "
+        "c.cash_item_equipment_part AS item_part, c.cash_item_equipment_slot AS item_slot, c.ocid AS ocid "
+        "FROM cash_items c "
+        "LEFT JOIN icon_assets a ON a.url = c.cash_item_icon_url "
+        "WHERE c.cash_item_icon_url IS NOT NULL AND c.cash_item_icon_url != ''"
+    )
+    if run_id:
+        query += " AND c.run_id = ?"
+        return query, (run_id,)
+    return query, ()
+def _source_allowed(source_type: str, only_source: str) -> bool:
+    if only_source == "all":
+        return source_type in {"equipment_shape", "cash"}
+    return source_type == only_source
+def _build_item_part(part: Optional[str], slot: Optional[str]) -> Optional[str]:
+    part = (part or "").strip()
+    slot = (slot or "").strip()
+    if part and slot and part != slot:
+        return f"{part}/{slot}"
+    return part or slot or None
+def _optional_str(value: object) -> Optional[str]:
+    if value is None:
+        return None
+    text = str(value).strip()
+    return text or None
+def _optional_int(value: object) -> Optional[int]:
+    try:
+        return int(value) if value is not None else None
+    except (TypeError, ValueError):
+        return None

src/labeler/cli.py ADDED Viewed

	@@ -0,0 +1,131 @@

+from __future__ import annotations
+import logging
+from pathlib import Path
+from typing import Optional
+import typer
+from .pipeline import LabelingConfig, run_labeling
+def run(
+    input_path: Optional[Path] = typer.Option(
+        None,
+        "--input",
+        help="Path to manifest.jsonl or manifest.parquet",
+    ),
+    db_path: Optional[Path] = typer.Option(
+        None,
+        "--db",
+        help="Path to SQLite db.sqlite (used when --input is not provided)",
+    ),
+    outdir: Optional[Path] = typer.Option(
+        None,
+        "--outdir",
+        help="Output directory for labels (default: input/db parent + /labels)",
+    ),
+    model: str = typer.Option(
+        "Qwen/Qwen2-VL-2B-Instruct",
+        "--model",
+        help="Model ID",
+    ),
+    device: str = typer.Option(
+        "auto",
+        "--device",
+        help="Device string (auto, cpu, cuda, cuda:0)",
+    ),
+    precision: str = typer.Option(
+        "auto",
+        "--precision",
+        help="auto|fp16|bf16|fp32",
+    ),
+    batch_size: int = typer.Option(8, "--batch-size", help="Batch size"),
+    upscale: int = typer.Option(1, "--upscale", help="Upscale factor (e.g. 2 or 4)"),
+    alpha_bg: str = typer.Option(
+        "white",
+        "--alpha-bg",
+        help="Background for transparent PNGs: white|black|none",
+    ),
+    resume: bool = typer.Option(False, "--resume", help="Resume from existing labels.jsonl"),
+    lang: str = typer.Option("ko", "--lang", help="ko|en|both"),
+    only_source: str = typer.Option(
+        "all",
+        "--only-source",
+        help="equipment_shape|cash|all",
+    ),
+    max_samples: Optional[int] = typer.Option(
+        None,
+        "--max-samples",
+        help="Limit number of samples (for testing)",
+    ),
+    no_image: bool = typer.Option(False, "--no-image", help="Use metadata only"),
+    no_metadata: bool = typer.Option(False, "--no-metadata", help="Use image only"),
+    log_level: str = typer.Option("info", "--log-level", help="info|debug"),
+    parquet: bool = typer.Option(False, "--parquet", help="Write labels.parquet"),
+    load_4bit: bool = typer.Option(False, "--load-4bit", help="Enable 4-bit quantization"),
+    max_new_tokens: int = typer.Option(384, "--max-new-tokens", help="Max new tokens"),
+    run_id: Optional[str] = typer.Option(
+        None,
+        "--run-id",
+        help="Filter DB inputs by run_id",
+    ),
+) -> None:
+    """Generate CLIP-ready labels for MapleStory item icons."""
+    logging.basicConfig(level=_parse_log_level(log_level), format="%(levelname)s: %(message)s")
+    if not input_path and not db_path:
+        typer.echo("Provide --input or --db")
+        raise typer.Exit(code=1)
+    if alpha_bg not in {"white", "black", "none"}:
+        typer.echo("--alpha-bg must be white, black, or none")
+        raise typer.Exit(code=1)
+    if lang not in {"ko", "en", "both"}:
+        typer.echo("--lang must be ko, en, or both")
+        raise typer.Exit(code=1)
+    if only_source not in {"equipment_shape", "cash", "all"}:
+        typer.echo("--only-source must be equipment_shape, cash, or all")
+        raise typer.Exit(code=1)
+    if precision not in {"auto", "fp16", "bf16", "fp32"}:
+        typer.echo("--precision must be auto, fp16, bf16, or fp32")
+        raise typer.Exit(code=1)
+    resolved_outdir = outdir
+    if not resolved_outdir:
+        if input_path:
+            resolved_outdir = input_path.parent / "labels"
+        else:
+            resolved_outdir = db_path.parent / "labels"
+    config = LabelingConfig(
+        input_path=input_path,
+        db_path=db_path,
+        outdir=resolved_outdir,
+        model_id=model,
+        device=device,
+        precision=precision,
+        batch_size=batch_size,
+        upscale=upscale,
+        alpha_bg=alpha_bg,
+        resume=resume,
+        lang=lang,
+        only_source=only_source,
+        max_samples=max_samples,
+        no_image=no_image,
+        no_metadata=no_metadata,
+        log_level=log_level,
+        parquet=parquet,
+        load_4bit=load_4bit,
+        max_new_tokens=max_new_tokens,
+        run_id=run_id,
+    )
+    run_labeling(config)
+def _parse_log_level(value: str) -> int:
+    value = value.lower()
+    if value == "debug":
+        return logging.DEBUG
+    return logging.INFO

src/labeler/image_utils.py ADDED Viewed

	@@ -0,0 +1,32 @@

+from __future__ import annotations
+from pathlib import Path
+from typing import Optional
+from PIL import Image
+def load_image(
+    path: Path,
+    upscale: int,
+    alpha_background: str,
+) -> Image.Image:
+    image = Image.open(path)
+    image = _apply_alpha(image, alpha_background)
+    if upscale and upscale > 1:
+        image = image.resize(
+            (image.width * upscale, image.height * upscale),
+            resample=Image.BICUBIC,
+        )
+    return image
+def _apply_alpha(image: Image.Image, alpha_background: str) -> Image.Image:
+    if image.mode in ("RGBA", "LA") or (image.mode == "P" and "transparency" in image.info):
+        if alpha_background == "none":
+            return image.convert("RGBA")
+        color = (255, 255, 255, 255) if alpha_background == "white" else (0, 0, 0, 255)
+        background = Image.new("RGBA", image.size, color)
+        foreground = image.convert("RGBA")
+        return Image.alpha_composite(background, foreground).convert("RGB")
+    return image.convert("RGB")

src/labeler/model.py ADDED Viewed

	@@ -0,0 +1,122 @@

+from __future__ import annotations
+import logging
+from dataclasses import dataclass
+from typing import Iterable, Optional
+import torch
+from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
+logger = logging.getLogger("labeler")
+@dataclass
+class ModelConfig:
+    model_id: str
+    device: str
+    precision: str
+    max_new_tokens: int
+    load_4bit: bool
+class LabelerModel:
+    def __init__(self, config: ModelConfig) -> None:
+        self.config = config
+        self.device = _resolve_device(config.device)
+        self.dtype = _resolve_dtype(config.precision, self.device)
+        quantization_config = None
+        load_kwargs: dict[str, object] = {}
+        if config.load_4bit:
+            try:
+                from transformers import BitsAndBytesConfig
+            except ImportError as exc:
+                raise RuntimeError("bitsandbytes is required for 4-bit loading") from exc
+            quantization_config = BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_use_double_quant=True,
+                bnb_4bit_compute_dtype=self.dtype,
+            )
+            load_kwargs["quantization_config"] = quantization_config
+            load_kwargs["device_map"] = "auto"
+        elif self.device.startswith("cuda"):
+            load_kwargs["device_map"] = "auto"
+        self.processor = AutoProcessor.from_pretrained(config.model_id)
+        self.model = Qwen2VLForConditionalGeneration.from_pretrained(
+            config.model_id,
+            torch_dtype=self.dtype,
+            low_cpu_mem_usage=True,
+            **load_kwargs,
+        )
+        if not load_kwargs.get("device_map"):
+            self.model.to(self.device)
+        self.model.eval()
+    def generate_texts(
+        self,
+        messages_list: list[list[dict[str, object]]],
+        images: Optional[list[object]],
+    ) -> list[str]:
+        prompts = [
+            self.processor.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+            )
+            for messages in messages_list
+        ]
+        if images is None:
+            inputs = self.processor(
+                text=prompts,
+                padding=True,
+                return_tensors="pt",
+            )
+        else:
+            inputs = self.processor(
+                text=prompts,
+                images=images,
+                padding=True,
+                return_tensors="pt",
+            )
+        inputs = _move_to_device(inputs, self.model.device)
+        with torch.inference_mode():
+            output_ids = self.model.generate(
+                **inputs,
+                max_new_tokens=self.config.max_new_tokens,
+                do_sample=False,
+            )
+        prompt_length = inputs["input_ids"].shape[1]
+        generated_ids = output_ids[:, prompt_length:]
+        return self.processor.batch_decode(generated_ids, skip_special_tokens=True)
+def _resolve_device(device: str) -> str:
+    if device == "auto":
+        return "cuda" if torch.cuda.is_available() else "cpu"
+    return device
+def _resolve_dtype(precision: str, device: str) -> torch.dtype:
+    if precision == "fp32":
+        return torch.float32
+    if precision == "bf16":
+        if device.startswith("cuda") and torch.cuda.is_bf16_supported():
+            return torch.bfloat16
+        return torch.float16
+    if precision == "fp16":
+        return torch.float16
+    if device.startswith("cuda"):
+        return torch.float16
+    return torch.float32
+def _move_to_device(inputs: dict[str, object], device: torch.device | str) -> dict[str, object]:
+    moved = {}
+    for key, value in inputs.items():
+        if hasattr(value, "to"):
+            moved[key] = value.to(device)
+        else:
+            moved[key] = value
+    return moved

src/labeler/pipeline.py ADDED Viewed

	@@ -0,0 +1,379 @@

+from __future__ import annotations
+import json
+import logging
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable, Optional
+from .adapters import LabelInput, iter_inputs
+from .image_utils import load_image
+from .model import LabelerModel, ModelConfig
+from .prompts import PROMPT_VERSION, PromptInputs, build_messages, build_user_prompt
+from pipeline.utils import ensure_dir, utc_now_iso
+logger = logging.getLogger("labeler")
+@dataclass
+class LabelingConfig:
+    input_path: Optional[Path]
+    db_path: Optional[Path]
+    outdir: Path
+    model_id: str
+    device: str
+    precision: str
+    batch_size: int
+    upscale: int
+    alpha_bg: str
+    resume: bool
+    lang: str
+    only_source: str
+    max_samples: Optional[int]
+    no_image: bool
+    no_metadata: bool
+    log_level: str
+    parquet: bool
+    load_4bit: bool
+    max_new_tokens: int
+    run_id: Optional[str]
+def run_labeling(config: LabelingConfig) -> None:
+    ensure_dir(config.outdir)
+    output_jsonl = config.outdir / "labels.jsonl"
+    output_parquet = config.outdir / "labels.parquet"
+    error_log = config.outdir / "labels_errors.log"
+    if output_jsonl.exists() and not config.resume:
+        raise RuntimeError("labels.jsonl already exists; use --resume or remove the file")
+    existing_paths: set[str] = set()
+    existing_sha: set[str] = set()
+    if config.resume and output_jsonl.exists():
+        existing_paths, existing_sha = _load_existing(output_jsonl)
+    model = LabelerModel(
+        ModelConfig(
+            model_id=config.model_id,
+            device=config.device,
+            precision=config.precision,
+            max_new_tokens=config.max_new_tokens,
+            load_4bit=config.load_4bit,
+        )
+    )
+    records_for_parquet: list[dict[str, object]] = []
+    seen_paths: set[str] = set()
+    seen_sha: set[str] = set()
+    with output_jsonl.open("a", encoding="utf-8") as out_handle, error_log.open(
+        "a", encoding="utf-8"
+    ) as err_handle:
+        batch: list[LabelInput] = []
+        for sample in iter_inputs(
+            input_path=config.input_path,
+            db_path=config.db_path,
+            only_source=config.only_source,
+            max_samples=config.max_samples,
+            run_id=config.run_id,
+        ):
+            if sample.image_path in existing_paths or sample.image_path in seen_paths:
+                continue
+            if sample.image_sha256 and (
+                sample.image_sha256 in existing_sha or sample.image_sha256 in seen_sha
+            ):
+                continue
+            if not config.no_image:
+                if not sample.image_abspath or not sample.image_abspath.exists():
+                    _log_error(err_handle, sample.image_path, "missing_image")
+                    continue
+            batch.append(sample)
+            if len(batch) >= config.batch_size:
+                _process_batch(
+                    batch,
+                    model,
+                    config,
+                    out_handle,
+                    err_handle,
+                    records_for_parquet,
+                    seen_paths,
+                    seen_sha,
+                )
+                batch = []
+        if batch:
+            _process_batch(
+                batch,
+                model,
+                config,
+                out_handle,
+                err_handle,
+                records_for_parquet,
+                seen_paths,
+                seen_sha,
+            )
+    if config.parquet:
+        _write_parquet(output_parquet, records_for_parquet)
+def _process_batch(
+    batch: list[LabelInput],
+    model: LabelerModel,
+    config: LabelingConfig,
+    out_handle,
+    err_handle,
+    parquet_buffer: list[dict[str, object]],
+    seen_paths: set[str],
+    seen_sha: set[str],
+) -> None:
+    include_image = not config.no_image
+    include_metadata = not config.no_metadata
+    messages_list = []
+    images = []
+    active_samples: list[LabelInput] = []
+    for sample in batch:
+        prompt_inputs = PromptInputs(
+            item_name=sample.item_name,
+            item_description=sample.item_description,
+            item_part=sample.item_part,
+            source_type=sample.source_type,
+            include_image=include_image,
+            include_metadata=include_metadata,
+            lang=config.lang,
+        )
+        user_prompt = build_user_prompt(prompt_inputs)
+        messages_list.append(build_messages(user_prompt, include_image, strict=False))
+        if include_image:
+            try:
+                images.append(load_image(sample.image_abspath, config.upscale, config.alpha_bg))
+            except Exception:
+                _log_error(err_handle, sample.image_path, "image_load_failed")
+                messages_list.pop()
+                continue
+        active_samples.append(sample)
+    if not active_samples:
+        return
+    outputs = model.generate_texts(messages_list, images if include_image else None)
+    for sample, raw_text in zip(active_samples, outputs):
+        record = _parse_and_build(
+            sample,
+            raw_text,
+            model,
+            config,
+            err_handle,
+            include_image,
+            include_metadata,
+        )
+        if not record:
+            continue
+        out_handle.write(json.dumps(record, ensure_ascii=False) + "\n")
+        out_handle.flush()
+        parquet_buffer.append(record)
+        seen_paths.add(sample.image_path)
+        if sample.image_sha256:
+            seen_sha.add(sample.image_sha256)
+def _parse_and_build(
+    sample: LabelInput,
+    raw_text: str,
+    model: LabelerModel,
+    config: LabelingConfig,
+    err_handle,
+    include_image: bool,
+    include_metadata: bool,
+) -> Optional[dict[str, object]]:
+    parsed = _try_parse(raw_text)
+    if parsed is None:
+        parsed = _retry_strict(sample, model, config, include_image, include_metadata)
+    if parsed is None:
+        _log_error(err_handle, sample.image_path, "invalid_json")
+        return None
+    try:
+        record = _normalize_record(parsed, sample, config)
+    except ValueError as exc:
+        _log_error(err_handle, sample.image_path, str(exc))
+        return None
+    return record
+def _retry_strict(
+    sample: LabelInput,
+    model: LabelerModel,
+    config: LabelingConfig,
+    include_image: bool,
+    include_metadata: bool,
+) -> Optional[dict[str, object]]:
+    prompt_inputs = PromptInputs(
+        item_name=sample.item_name,
+        item_description=sample.item_description,
+        item_part=sample.item_part,
+        source_type=sample.source_type,
+        include_image=include_image,
+        include_metadata=include_metadata,
+        lang=config.lang,
+    )
+    user_prompt = build_user_prompt(prompt_inputs)
+    messages = [build_messages(user_prompt, include_image, strict=True)]
+    images = None
+    if include_image:
+        try:
+            images = [load_image(sample.image_abspath, config.upscale, config.alpha_bg)]
+        except Exception:
+            return None
+    output = model.generate_texts(messages, images)
+    if not output:
+        return None
+    return _try_parse(output[0])
+def _try_parse(raw_text: str) -> Optional[dict[str, object]]:
+    text = raw_text.strip()
+    if text.startswith("```"):
+        text = text.split("\n", 1)[-1]
+        if text.endswith("```"):
+            text = text.rsplit("```", 1)[0]
+        text = text.strip()
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        start = text.find("{")
+        end = text.rfind("}")
+        if start == -1 or end == -1 or end <= start:
+            return None
+        try:
+            return json.loads(text[start : end + 1])
+        except json.JSONDecodeError:
+            return None
+def _normalize_record(
+    parsed: dict[str, object],
+    sample: LabelInput,
+    config: LabelingConfig,
+) -> dict[str, object]:
+    label_ko = _clean_text(parsed.get("label_ko"))
+    if not label_ko:
+        raise ValueError("label_ko_missing")
+    label_en = _clean_text(parsed.get("label_en"))
+    if config.lang == "ko":
+        label_en = None
+    tags = _normalize_list(parsed.get("tags_ko"), max_items=15)
+    queries = _normalize_list(parsed.get("query_variants_ko"), max_items=8)
+    attributes = parsed.get("attributes") if isinstance(parsed.get("attributes"), dict) else {}
+    normalized_attributes = {
+        "colors": _normalize_list(attributes.get("colors"), max_items=10),
+        "theme": _normalize_list(attributes.get("theme"), max_items=10),
+        "material": _normalize_list(attributes.get("material"), max_items=10),
+        "vibe": _normalize_list(attributes.get("vibe"), max_items=10),
+        "item_type_guess": _clean_text(attributes.get("item_type_guess")),
+    }
+    quality = parsed.get("quality_flags") if isinstance(parsed.get("quality_flags"), dict) else {}
+    is_uncertain = bool(quality.get("is_uncertain", False))
+    reasons = _normalize_list(quality.get("reasons"))
+    if len(tags) < 5:
+        is_uncertain = True
+        reasons.append("few_tags")
+    if len(queries) < 3:
+        is_uncertain = True
+        reasons.append("few_queries")
+    reasons = _unique_list(reasons)
+    return {
+        "image_path": sample.image_path,
+        "image_sha256": sample.image_sha256,
+        "source_type": sample.source_type,
+        "item_name": sample.item_name,
+        "item_description": sample.item_description,
+        "label_ko": label_ko,
+        "label_en": label_en,
+        "tags_ko": tags,
+        "attributes": normalized_attributes,
+        "query_variants_ko": queries,
+        "quality_flags": {"is_uncertain": is_uncertain, "reasons": reasons},
+        "model": config.model_id,
+        "prompt_version": PROMPT_VERSION,
+        "generated_at": utc_now_iso(),
+    }
+def _normalize_list(value: object, max_items: Optional[int] = None) -> list[str]:
+    if value is None:
+        items: list[str] = []
+    elif isinstance(value, list):
+        items = [str(item).strip() for item in value if str(item).strip()]
+    else:
+        items = [item.strip() for item in str(value).split(",") if item.strip()]
+    if max_items:
+        items = items[:max_items]
+    return items
+def _clean_text(value: object) -> Optional[str]:
+    if value is None:
+        return None
+    text = str(value).strip()
+    return text or None
+def _unique_list(values: Iterable[str]) -> list[str]:
+    seen = set()
+    result = []
+    for value in values:
+        if value in seen:
+            continue
+        seen.add(value)
+        result.append(value)
+    return result
+def _log_error(handle, image_path: str, message: str) -> None:
+    handle.write(f"{image_path}\t{message}\n")
+    handle.flush()
+def _load_existing(path: Path) -> tuple[set[str], set[str]]:
+    paths: set[str] = set()
+    shas: set[str] = set()
+    with path.open("r", encoding="utf-8") as handle:
+        for line in handle:
+            line = line.strip()
+            if not line:
+                continue
+            try:
+                record = json.loads(line)
+            except json.JSONDecodeError:
+                continue
+            image_path = record.get("image_path")
+            image_sha = record.get("image_sha256")
+            if isinstance(image_path, str):
+                paths.add(image_path)
+            if isinstance(image_sha, str):
+                shas.add(image_sha)
+    return paths, shas
+def _write_parquet(path: Path, records: list[dict[str, object]]) -> None:
+    if not records:
+        return
+    try:
+        import pyarrow as pa
+        import pyarrow.parquet as pq
+    except ImportError:
+        logger.warning("pyarrow not installed; skipping parquet output")
+        return
+    table = pa.Table.from_pylist(records)
+    pq.write_table(table, path)

src/labeler/prompts.py ADDED Viewed

	@@ -0,0 +1,80 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+PROMPT_VERSION = "v1"
+SYSTEM_PROMPT_BASE = (
+    "You are generating labels for MapleStory item icons for CLIP training. "
+    "Return a single JSON object only. Do not output markdown or extra text. "
+    "Use item_name as the primary identifier; keep it intact. "
+    "Use metadata if provided and avoid guessing. "
+    "If uncertain, set quality_flags.is_uncertain=true and add reasons. "
+    "label_ko must be one short Korean sentence with key visual keywords. "
+    "tags_ko must be 5-15 short Korean keywords. "
+    "query_variants_ko must be 3-8 natural Korean search queries. "
+    "attributes must include colors/theme/material/vibe lists and item_type_guess string or null. "
+    "If label_en is not requested, set it to null."
+)
+SYSTEM_PROMPT_STRICT = (
+    SYSTEM_PROMPT_BASE
+    + " Output must be valid JSON with double quotes and no trailing commas."
+)
+@dataclass
+class PromptInputs:
+    item_name: str
+    item_description: Optional[str]
+    item_part: Optional[str]
+    source_type: str
+    include_image: bool
+    include_metadata: bool
+    lang: str
+def build_user_prompt(inputs: PromptInputs) -> str:
+    lines = []
+    if inputs.include_metadata:
+        lines.append(f"item_name: {inputs.item_name}")
+        if inputs.item_description:
+            lines.append(f"item_description: {inputs.item_description}")
+        else:
+            lines.append("item_description: (none)")
+        if inputs.item_part:
+            lines.append(f"item_part: {inputs.item_part}")
+        else:
+            lines.append("item_part: (none)")
+        lines.append(f"source_type: {inputs.source_type}")
+    else:
+        lines.append("metadata: (not provided)")
+        lines.append(f"source_type: {inputs.source_type}")
+    if inputs.include_image:
+        lines.append("image: provided")
+    else:
+        lines.append("image: not provided (metadata-only)")
+    lines.append(f"language: {inputs.lang}")
+    lines.append(
+        "Return JSON with keys: label_ko, label_en, tags_ko, attributes, "
+        "query_variants_ko, quality_flags."
+    )
+    return "\n".join(lines)
+def build_messages(user_prompt: str, include_image: bool, strict: bool) -> list[dict[str, object]]:
+    system_prompt = SYSTEM_PROMPT_STRICT if strict else SYSTEM_PROMPT_BASE
+    if include_image:
+        content: list[dict[str, object]] = [
+            {"type": "image"},
+            {"type": "text", "text": user_prompt},
+        ]
+    else:
+        content = [{"type": "text", "text": user_prompt}]
+    return [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": content},
+    ]

src/pipeline/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ __all__ = ["__version__"]
2	+ __version__ = "0.1.0"

src/pipeline/api.py ADDED Viewed

	@@ -0,0 +1,166 @@

+from __future__ import annotations
+import asyncio
+import json
+import random
+from dataclasses import dataclass
+from typing import Any, Optional
+import httpx
+from .utils import ApiMetrics, RateLimiter
+BASE_URL = "https://open.api.nexon.com"
+class ApiError(RuntimeError):
+    def __init__(self, status_code: int, message: str, payload: Any = None) -> None:
+        super().__init__(message)
+        self.status_code = status_code
+        self.payload = payload
+class RateLimitError(ApiError):
+    def __init__(self, status_code: int, message: str, payload: Any = None, retry_after: Optional[float] = None) -> None:
+        super().__init__(status_code, message, payload)
+        self.retry_after = retry_after
+class ServerError(ApiError):
+    pass
+class DataPreparingError(ApiError):
+    pass
+class TransportError(RuntimeError):
+    pass
+@dataclass
+class ApiClient:
+    api_key: str
+    concurrency: int = 8
+    rps: float = 5.0
+    timeout_seconds: float = 30.0
+    max_attempts: int = 5
+    def __post_init__(self) -> None:
+        self._client: Optional[httpx.AsyncClient] = None
+        self._semaphore = asyncio.Semaphore(self.concurrency)
+        self._rate_limiter = RateLimiter(self.rps)
+        self.metrics = ApiMetrics()
+    async def __aenter__(self) -> "ApiClient":
+        headers = {"x-nxopen-api-key": self.api_key}
+        self._client = httpx.AsyncClient(base_url=BASE_URL, headers=headers)
+        return self
+    async def __aexit__(self, exc_type, exc, tb) -> None:
+        if self._client:
+            await self._client.aclose()
+    async def get(self, path: str, params: dict[str, Any]) -> dict[str, Any]:
+        return await self._request_json("GET", path, params=params)
+    async def _request_json(self, method: str, path: str, params: dict[str, Any]) -> dict[str, Any]:
+        attempt = 0
+        while True:
+            attempt += 1
+            try:
+                await self._rate_limiter.acquire()
+                async with self._semaphore:
+                    assert self._client is not None
+                    response = await self._client.request(
+                        method,
+                        path,
+                        params=params,
+                        timeout=self.timeout_seconds,
+                    )
+                self.metrics.total_requests += 1
+                if 200 <= response.status_code < 300:
+                    return response.json()
+                payload = _safe_json(response)
+                message = _extract_message(payload)
+                if response.status_code == 400 and _is_data_preparing(payload):
+                    self.metrics.data_preparing_hits += 1
+                    raise DataPreparingError(response.status_code, message, payload)
+                if response.status_code == 429:
+                    self.metrics.rate_limit_hits += 1
+                    retry_after = _retry_after_seconds(response)
+                    raise RateLimitError(response.status_code, message, payload, retry_after=retry_after)
+                if response.status_code >= 500:
+                    self.metrics.server_errors += 1
+                    raise ServerError(response.status_code, message, payload)
+                self.metrics.other_errors += 1
+                raise ApiError(response.status_code, message, payload)
+            except (httpx.TimeoutException, httpx.TransportError) as exc:
+                self.metrics.other_errors += 1
+                error = TransportError(str(exc))
+            except (RateLimitError, ServerError, DataPreparingError) as exc:
+                error = exc
+            except ApiError:
+                raise
+            if attempt >= self.max_attempts:
+                raise error
+            await asyncio.sleep(_compute_wait_seconds(error, attempt))
+def _safe_json(response: httpx.Response) -> Any:
+    try:
+        return response.json()
+    except json.JSONDecodeError:
+        return {"message": response.text}
+def _extract_message(payload: Any) -> str:
+    if isinstance(payload, dict):
+        if isinstance(payload.get("error"), dict):
+            return payload["error"].get("message") or "API error"
+        return payload.get("message") or "API error"
+    return "API error"
+def _extract_code(payload: Any) -> Optional[str]:
+    if isinstance(payload, dict):
+        if isinstance(payload.get("error"), dict):
+            return payload["error"].get("code") or payload["error"].get("error_code")
+        return payload.get("code") or payload.get("error_code")
+    return None
+def _is_data_preparing(payload: Any) -> bool:
+    code = _extract_code(payload)
+    message = _extract_message(payload)
+    if code and code.upper() == "OPENAPI00009":
+        return True
+    if message and "Data being prepared" in message:
+        return True
+    return False
+def _retry_after_seconds(response: httpx.Response) -> Optional[float]:
+    value = response.headers.get("Retry-After")
+    if not value:
+        return None
+    try:
+        return float(value)
+    except ValueError:
+        return None
+def _compute_wait_seconds(error: Exception, attempt: int) -> float:
+    if isinstance(error, DataPreparingError):
+        return random.uniform(30, 120)
+    base = 1.0
+    max_wait = 30.0
+    wait = min(max_wait, base * (2 ** (attempt - 1)))
+    jitter = random.uniform(0, 0.5)
+    if isinstance(error, RateLimitError) and error.retry_after:
+        return max(wait + jitter, error.retry_after)
+    return wait + jitter

src/pipeline/cli.py ADDED Viewed

	@@ -0,0 +1,104 @@

+from __future__ import annotations
+import asyncio
+import logging
+from pathlib import Path
+from typing import Optional
+import typer
+from .pipeline import run_pipeline
+from .utils import get_env_or_none, kst_yesterday_date, load_dotenv_if_available
+def run(
+    date: Optional[str] = typer.Option(
+        None,
+        "--date",
+        help="Target date in YYYY-MM-DD (KST). Defaults to yesterday in KST.",
+    ),
+    top: int = typer.Option(100, "--top", help="Alias for --end-rank (default 100)."),
+    start_rank: int = typer.Option(1, "--start-rank", help="Starting ranking (inclusive)."),
+    end_rank: Optional[int] = typer.Option(
+        None,
+        "--end-rank",
+        help="Ending ranking (inclusive). Defaults to --top.",
+    ),
+    download_icons: bool = typer.Option(
+        True,
+        "--download-icons/--no-download",
+        help="Download icon assets locally.",
+    ),
+    concurrency: int = typer.Option(8, "--concurrency", help="Max concurrent requests."),
+    rps: float = typer.Option(5.0, "--rps", help="Requests per second throttle."),
+    output_dir: Optional[Path] = typer.Option(
+        None,
+        "--output-dir",
+        help="Base output directory for data storage.",
+    ),
+    db_path: Optional[Path] = typer.Option(
+        None,
+        "--db-path",
+        help="Override SQLite DB path (default is output_dir/date/db.sqlite).",
+    ),
+    world_name: Optional[str] = typer.Option(None, "--world-name", help="Ranking world name filter."),
+    world_type: Optional[int] = typer.Option(None, "--world-type", help="Ranking world type filter."),
+    class_name: Optional[str] = typer.Option(None, "--class-name", help="Ranking class filter."),
+    all_presets: bool = typer.Option(
+        False,
+        "--all-presets",
+        help="Store all cash equipment presets instead of current/default.",
+    ),
+    run_id: Optional[str] = typer.Option(
+        None,
+        "--run-id",
+        help="Reuse an existing run_id to merge additional ranking ranges.",
+    ),
+    api_key: Optional[str] = typer.Option(
+        None,
+        "--api-key",
+        help="Override NEXON_API_KEY environment variable.",
+    ),
+) -> None:
+    """Collect MapleStory ranking equipment and cash icons via Nexon Open API."""
+    load_dotenv_if_available()
+    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+    resolved_key = api_key or get_env_or_none("NEXON_API_KEY")
+    if not resolved_key:
+        typer.echo("Missing NEXON_API_KEY. Set it in the environment or pass --api-key.")
+        raise typer.Exit(code=1)
+    resolved_date = date or kst_yesterday_date()
+    resolved_output_dir = output_dir or Path(get_env_or_none("OUTPUT_DIR") or "data")
+    resolved_end_rank = end_rank or top
+    if start_rank < 1:
+        typer.echo("--start-rank must be >= 1")
+        raise typer.Exit(code=1)
+    if resolved_end_rank < start_rank:
+        typer.echo("--end-rank must be >= --start-rank")
+        raise typer.Exit(code=1)
+    report = asyncio.run(
+        run_pipeline(
+            api_key=resolved_key,
+            target_date=resolved_date,
+            start_rank=start_rank,
+            end_rank=resolved_end_rank,
+            download_icon_assets=download_icons,
+            output_dir=resolved_output_dir,
+            db_path=db_path,
+            concurrency=concurrency,
+            rps=rps,
+            world_name=world_name,
+            world_type=world_type,
+            class_name=class_name,
+            all_presets=all_presets,
+            run_id_override=run_id,
+        )
+    )
+    typer.echo("Run complete")
+    typer.echo(report.to_markdown())

src/pipeline/db.py ADDED Viewed

	@@ -0,0 +1,344 @@

+from __future__ import annotations
+import sqlite3
+from pathlib import Path
+from typing import Any, Iterable
+def connect(db_path: Path) -> sqlite3.Connection:
+    db_path.parent.mkdir(parents=True, exist_ok=True)
+    conn = sqlite3.connect(db_path)
+    conn.execute("PRAGMA foreign_keys = ON")
+    return conn
+def init_db(conn: sqlite3.Connection) -> None:
+    conn.executescript(
+        """
+        CREATE TABLE IF NOT EXISTS runs (
+            run_id TEXT PRIMARY KEY,
+            target_date TEXT NOT NULL,
+            created_at TEXT NOT NULL,
+            params_json TEXT NOT NULL
+        );
+        CREATE TABLE IF NOT EXISTS ranking_entries (
+            run_id TEXT NOT NULL,
+            ranking INTEGER NOT NULL,
+            character_name TEXT,
+            world_name TEXT,
+            class_name TEXT,
+            sub_class_name TEXT,
+            character_level INTEGER,
+            character_exp INTEGER,
+            character_popularity INTEGER,
+            character_guildname TEXT,
+            UNIQUE(run_id, ranking)
+        );
+        CREATE TABLE IF NOT EXISTS characters (
+            ocid TEXT PRIMARY KEY,
+            character_name TEXT,
+            first_seen_at TEXT,
+            last_seen_at TEXT
+        );
+        CREATE TABLE IF NOT EXISTS equipment_shape_items (
+            run_id TEXT NOT NULL,
+            ocid TEXT NOT NULL,
+            item_equipment_part TEXT,
+            equipment_slot TEXT,
+            item_name TEXT,
+            item_icon_url TEXT,
+            item_description TEXT,
+            item_shape_name TEXT,
+            item_shape_icon_url TEXT,
+            raw_json TEXT,
+            UNIQUE(run_id, ocid, item_equipment_part, equipment_slot)
+        );
+        CREATE TABLE IF NOT EXISTS cash_items (
+            run_id TEXT NOT NULL,
+            ocid TEXT NOT NULL,
+            preset_no INTEGER,
+            cash_item_equipment_part TEXT,
+            cash_item_equipment_slot TEXT,
+            cash_item_name TEXT,
+            cash_item_icon_url TEXT,
+            cash_item_description TEXT,
+            cash_item_label TEXT,
+            date_expire TEXT,
+            date_option_expire TEXT,
+            raw_json TEXT,
+            UNIQUE(run_id, ocid, preset_no, cash_item_equipment_part, cash_item_equipment_slot)
+        );
+        CREATE TABLE IF NOT EXISTS icon_assets (
+            url TEXT PRIMARY KEY,
+            sha256 TEXT,
+            local_path TEXT,
+            content_type TEXT,
+            byte_size INTEGER,
+            fetched_at TEXT,
+            error TEXT
+        );
+        """
+    )
+def fetch_run(conn: sqlite3.Connection, run_id: str) -> dict[str, Any] | None:
+    cursor = conn.execute(
+        "SELECT run_id, target_date, created_at, params_json FROM runs WHERE run_id = ?",
+        (run_id,),
+    )
+    row = cursor.fetchone()
+    if not row:
+        return None
+    return {
+        "run_id": row[0],
+        "target_date": row[1],
+        "created_at": row[2],
+        "params_json": row[3],
+    }
+def insert_run(
+    conn: sqlite3.Connection,
+    run_id: str,
+    target_date: str,
+    created_at: str,
+    params_json: str,
+) -> None:
+    conn.execute(
+        """
+        INSERT INTO runs (run_id, target_date, created_at, params_json)
+        VALUES (?, ?, ?, ?)
+        ON CONFLICT(run_id) DO UPDATE SET
+            target_date = excluded.target_date,
+            created_at = excluded.created_at,
+            params_json = excluded.params_json
+        """,
+        (run_id, target_date, created_at, params_json),
+    )
+def upsert_ranking_entries(
+    conn: sqlite3.Connection,
+    run_id: str,
+    entries: Iterable[dict[str, Any]],
+) -> None:
+    rows = [
+        (
+            run_id,
+            entry.get("ranking"),
+            entry.get("character_name"),
+            entry.get("world_name"),
+            entry.get("class_name"),
+            entry.get("sub_class_name"),
+            entry.get("character_level"),
+            entry.get("character_exp"),
+            entry.get("character_popularity"),
+            entry.get("character_guildname"),
+        )
+        for entry in entries
+    ]
+    conn.executemany(
+        """
+        INSERT INTO ranking_entries (
+            run_id,
+            ranking,
+            character_name,
+            world_name,
+            class_name,
+            sub_class_name,
+            character_level,
+            character_exp,
+            character_popularity,
+            character_guildname
+        )
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(run_id, ranking) DO UPDATE SET
+            character_name = excluded.character_name,
+            world_name = excluded.world_name,
+            class_name = excluded.class_name,
+            sub_class_name = excluded.sub_class_name,
+            character_level = excluded.character_level,
+            character_exp = excluded.character_exp,
+            character_popularity = excluded.character_popularity,
+            character_guildname = excluded.character_guildname
+        """,
+        rows,
+    )
+def upsert_characters(
+    conn: sqlite3.Connection,
+    rows: Iterable[dict[str, Any]],
+) -> None:
+    prepared = [
+        (
+            row.get("ocid"),
+            row.get("character_name"),
+            row.get("first_seen_at"),
+            row.get("last_seen_at"),
+        )
+        for row in rows
+    ]
+    conn.executemany(
+        """
+        INSERT INTO characters (ocid, character_name, first_seen_at, last_seen_at)
+        VALUES (?, ?, ?, ?)
+        ON CONFLICT(ocid) DO UPDATE SET
+            character_name = excluded.character_name,
+            last_seen_at = excluded.last_seen_at
+        """,
+        prepared,
+    )
+def upsert_equipment_items(
+    conn: sqlite3.Connection,
+    run_id: str,
+    rows: Iterable[dict[str, Any]],
+) -> None:
+    prepared = [
+        (
+            run_id,
+            row.get("ocid"),
+            row.get("item_equipment_part"),
+            row.get("equipment_slot"),
+            row.get("item_name"),
+            row.get("item_icon_url"),
+            row.get("item_description"),
+            row.get("item_shape_name"),
+            row.get("item_shape_icon_url"),
+            row.get("raw_json"),
+        )
+        for row in rows
+    ]
+    conn.executemany(
+        """
+        INSERT INTO equipment_shape_items (
+            run_id,
+            ocid,
+            item_equipment_part,
+            equipment_slot,
+            item_name,
+            item_icon_url,
+            item_description,
+            item_shape_name,
+            item_shape_icon_url,
+            raw_json
+        )
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(run_id, ocid, item_equipment_part, equipment_slot) DO UPDATE SET
+            item_name = excluded.item_name,
+            item_icon_url = excluded.item_icon_url,
+            item_description = excluded.item_description,
+            item_shape_name = excluded.item_shape_name,
+            item_shape_icon_url = excluded.item_shape_icon_url,
+            raw_json = excluded.raw_json
+        """,
+        prepared,
+    )
+def upsert_cash_items(
+    conn: sqlite3.Connection,
+    run_id: str,
+    rows: Iterable[dict[str, Any]],
+) -> None:
+    prepared = [
+        (
+            run_id,
+            row.get("ocid"),
+            row.get("preset_no"),
+            row.get("cash_item_equipment_part"),
+            row.get("cash_item_equipment_slot"),
+            row.get("cash_item_name"),
+            row.get("cash_item_icon_url"),
+            row.get("cash_item_description"),
+            row.get("cash_item_label"),
+            row.get("date_expire"),
+            row.get("date_option_expire"),
+            row.get("raw_json"),
+        )
+        for row in rows
+    ]
+    conn.executemany(
+        """
+        INSERT INTO cash_items (
+            run_id,
+            ocid,
+            preset_no,
+            cash_item_equipment_part,
+            cash_item_equipment_slot,
+            cash_item_name,
+            cash_item_icon_url,
+            cash_item_description,
+            cash_item_label,
+            date_expire,
+            date_option_expire,
+            raw_json
+        )
+        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(run_id, ocid, preset_no, cash_item_equipment_part, cash_item_equipment_slot)
+        DO UPDATE SET
+            cash_item_name = excluded.cash_item_name,
+            cash_item_icon_url = excluded.cash_item_icon_url,
+            cash_item_description = excluded.cash_item_description,
+            cash_item_label = excluded.cash_item_label,
+            date_expire = excluded.date_expire,
+            date_option_expire = excluded.date_option_expire,
+            raw_json = excluded.raw_json
+        """,
+        prepared,
+    )
+def fetch_icon_assets(
+    conn: sqlite3.Connection,
+    urls: list[str],
+) -> dict[str, dict[str, Any]]:
+    if not urls:
+        return {}
+    placeholders = ",".join(["?"] * len(urls))
+    query = f"SELECT url, sha256, local_path, error FROM icon_assets WHERE url IN ({placeholders})"
+    cursor = conn.execute(query, urls)
+    result = {}
+    for row in cursor.fetchall():
+        result[row[0]] = {"sha256": row[1], "local_path": row[2], "error": row[3]}
+    return result
+def upsert_icon_asset(conn: sqlite3.Connection, record: dict[str, Any]) -> None:
+    conn.execute(
+        """
+        INSERT INTO icon_assets (
+            url,
+            sha256,
+            local_path,
+            content_type,
+            byte_size,
+            fetched_at,
+            error
+        )
+        VALUES (?, ?, ?, ?, ?, ?, ?)
+        ON CONFLICT(url) DO UPDATE SET
+            sha256 = excluded.sha256,
+            local_path = excluded.local_path,
+            content_type = excluded.content_type,
+            byte_size = excluded.byte_size,
+            fetched_at = excluded.fetched_at,
+            error = excluded.error
+        """,
+        (
+            record.get("url"),
+            record.get("sha256"),
+            record.get("local_path"),
+            record.get("content_type"),
+            record.get("byte_size"),
+            record.get("fetched_at"),
+            record.get("error"),
+        ),
+    )

src/pipeline/downloader.py ADDED Viewed

	@@ -0,0 +1,163 @@

+from __future__ import annotations
+import asyncio
+import hashlib
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Optional
+import httpx
+from .utils import DownloadResult, RateLimiter, guess_extension, random_wait, utc_now_iso
+@dataclass
+class DownloadRecord:
+    url: str
+    sha256: Optional[str]
+    local_path: Optional[str]
+    content_type: Optional[str]
+    byte_size: Optional[int]
+    fetched_at: Optional[str]
+    error: Optional[str]
+async def download_icons(
+    urls_by_category: dict[str, str],
+    output_root: Path,
+    icons_root: Path,
+    existing_assets: dict[str, dict[str, Any]],
+    rps: float,
+    concurrency: int,
+    max_attempts: int = 5,
+) -> tuple[list[DownloadRecord], DownloadResult]:
+    limiter = RateLimiter(rps)
+    semaphore = asyncio.Semaphore(concurrency)
+    results: list[DownloadRecord] = []
+    counters = DownloadResult()
+    async with httpx.AsyncClient() as client:
+        tasks = []
+        for url, category in urls_by_category.items():
+            if not url:
+                continue
+            existing = existing_assets.get(url)
+            if existing and existing.get("sha256"):
+                counters.skipped += 1
+                continue
+            tasks.append(
+                asyncio.create_task(
+                    _download_one(
+                        client,
+                        limiter,
+                        semaphore,
+                        url,
+                        category,
+                        output_root,
+                        icons_root,
+                        max_attempts,
+                    )
+                )
+            )
+        if tasks:
+            completed = await asyncio.gather(*tasks, return_exceptions=True)
+            for item in completed:
+                if isinstance(item, Exception):
+                    counters.failed += 1
+                    results.append(
+                        DownloadRecord(
+                            url="unknown",
+                            sha256=None,
+                            local_path=None,
+                            content_type=None,
+                            byte_size=None,
+                            fetched_at=utc_now_iso(),
+                            error=str(item),
+                        )
+                    )
+                    continue
+                record, status = item
+                results.append(record)
+                if status == "downloaded":
+                    counters.downloaded += 1
+                elif status == "failed":
+                    counters.failed += 1
+    return results, counters
+async def _download_one(
+    client: httpx.AsyncClient,
+    limiter: RateLimiter,
+    semaphore: asyncio.Semaphore,
+    url: str,
+    category: str,
+    output_root: Path,
+    icons_root: Path,
+    max_attempts: int,
+) -> tuple[DownloadRecord, str]:
+    attempt = 0
+    while True:
+        attempt += 1
+        try:
+            await limiter.acquire()
+            async with semaphore:
+                response = await client.get(url, timeout=30)
+            if 200 <= response.status_code < 300:
+                content = response.content
+                sha256 = hashlib.sha256(content).hexdigest()
+                extension = guess_extension(response.headers.get("content-type"), url)
+                target_dir = icons_root / category
+                target_dir.mkdir(parents=True, exist_ok=True)
+                filename = f"{sha256}{extension}"
+                path = target_dir / filename
+                if not path.exists():
+                    path.write_bytes(content)
+                local_path = str(path.relative_to(output_root))
+                return (
+                    DownloadRecord(
+                        url=url,
+                        sha256=sha256,
+                        local_path=local_path,
+                        content_type=response.headers.get("content-type"),
+                        byte_size=len(content),
+                        fetched_at=utc_now_iso(),
+                        error=None,
+                    ),
+                    "downloaded",
+                )
+            if response.status_code == 429 or response.status_code >= 500:
+                raise RuntimeError(f"HTTP {response.status_code}")
+            return (
+                DownloadRecord(
+                    url=url,
+                    sha256=None,
+                    local_path=None,
+                    content_type=response.headers.get("content-type"),
+                    byte_size=None,
+                    fetched_at=utc_now_iso(),
+                    error=f"HTTP {response.status_code}",
+                ),
+                "failed",
+            )
+        except (httpx.TimeoutException, httpx.TransportError, RuntimeError) as exc:
+            if attempt >= max_attempts:
+                return (
+                    DownloadRecord(
+                        url=url,
+                        sha256=None,
+                        local_path=None,
+                        content_type=None,
+                        byte_size=None,
+                        fetched_at=utc_now_iso(),
+                        error=str(exc),
+                    ),
+                    "failed",
+                )
+            await asyncio.sleep(_download_backoff(attempt))
+def _download_backoff(attempt: int) -> float:
+    base = 1.0
+    max_wait = 20.0
+    wait = min(max_wait, base * (2 ** (attempt - 1)))
+    return wait + random_wait(0, 0.5)

src/pipeline/parsers.py ADDED Viewed

	@@ -0,0 +1,102 @@

+from __future__ import annotations
+from typing import Any
+from .utils import json_dumps, to_int
+def extract_equipment_items(data: dict[str, Any], ocid: str) -> list[dict[str, Any]]:
+    items = []
+    for item in data.get("item_equipment", []) or []:
+        items.append(
+            {
+                "ocid": ocid,
+                "item_equipment_part": _coalesce(item.get("item_equipment_part")),
+                "equipment_slot": _coalesce(item.get("equipment_slot")),
+                "item_name": item.get("item_name"),
+                "item_icon_url": item.get("item_icon"),
+                "item_description": item.get("item_description"),
+                "item_shape_name": item.get("item_shape_name"),
+                "item_shape_icon_url": item.get("item_shape_icon"),
+                "raw_json": json_dumps(item),
+            }
+        )
+    return items
+def extract_cash_items(
+    data: dict[str, Any],
+    ocid: str,
+    all_presets: bool,
+) -> list[dict[str, Any]]:
+    presets = _extract_presets(data)
+    if not presets:
+        return []
+    if all_presets:
+        selected = presets
+    else:
+        selected = _select_current_or_default_presets(data, presets)
+    items: list[dict[str, Any]] = []
+    for preset_no, preset_items in selected:
+        for item in preset_items:
+            items.append(
+                {
+                    "ocid": ocid,
+                    "preset_no": preset_no,
+                    "cash_item_equipment_part": _coalesce(item.get("cash_item_equipment_part")),
+                    "cash_item_equipment_slot": _coalesce(item.get("cash_item_equipment_slot")),
+                    "cash_item_name": item.get("cash_item_name"),
+                    "cash_item_icon_url": item.get("cash_item_icon"),
+                    "cash_item_description": item.get("cash_item_description"),
+                    "cash_item_label": item.get("cash_item_label"),
+                    "date_expire": item.get("date_expire"),
+                    "date_option_expire": item.get("date_option_expire"),
+                    "raw_json": json_dumps(item),
+                }
+            )
+    return items
+def _extract_presets(data: dict[str, Any]) -> list[tuple[int, list[dict[str, Any]]]]:
+    presets: list[tuple[int, list[dict[str, Any]]]] = []
+    for key, value in data.items():
+        if not key.startswith("cash_item_equipment_preset_"):
+            continue
+        if not isinstance(value, list):
+            continue
+        preset_no = _parse_preset_no(key)
+        if preset_no is None:
+            continue
+        presets.append((preset_no, value))
+    presets.sort(key=lambda item: item[0])
+    return presets
+def _parse_preset_no(key: str) -> int | None:
+    try:
+        return int(key.rsplit("_", 1)[-1])
+    except (IndexError, ValueError):
+        return None
+def _coalesce(value: Any) -> str:
+    return value if value is not None else ""
+def _select_current_or_default_presets(
+    data: dict[str, Any],
+    presets: list[tuple[int, list[dict[str, Any]]]],
+) -> list[tuple[int, list[dict[str, Any]]]]:
+    current = to_int(data.get("preset_no"))
+    if current is not None:
+        for preset_no, items in presets:
+            if preset_no == current:
+                return [(preset_no, items)]
+    for preset_no, items in presets:
+        if preset_no == 1:
+            return [(preset_no, items)]
+    return presets

src/pipeline/pipeline.py ADDED Viewed

	@@ -0,0 +1,378 @@

+from __future__ import annotations
+import asyncio
+import json
+import logging
+import time
+from pathlib import Path
+from typing import Any, Iterable
+from . import db
+from .api import ApiClient
+from .downloader import download_icons
+from .parsers import extract_cash_items, extract_equipment_items
+from .utils import (
+    PipelineReport,
+    compute_run_id,
+    ensure_dir,
+    get_env_or_none,
+    kst_yesterday_date,
+    safe_filename,
+    to_int,
+    utc_now_iso,
+    write_json,
+)
+logger = logging.getLogger("pipeline")
+async def run_pipeline(
+    *,
+    api_key: str,
+    target_date: str | None,
+    start_rank: int,
+    end_rank: int,
+    download_icon_assets: bool,
+    output_dir: Path,
+    db_path: Path | None,
+    concurrency: int,
+    rps: float,
+    world_name: str | None,
+    world_type: int | None,
+    class_name: str | None,
+    all_presets: bool,
+    run_id_override: str | None,
+) -> PipelineReport:
+    start = time.monotonic()
+    resolved_date = target_date or kst_yesterday_date()
+    output_root = output_dir / resolved_date
+    raw_root = output_root / "raw"
+    raw_ocid = raw_root / "ocid"
+    raw_item = raw_root / "item_equipment"
+    raw_cash = raw_root / "cashitem_equipment"
+    icons_root = output_root / "icons"
+    ensure_dir(raw_ocid)
+    ensure_dir(raw_item)
+    ensure_dir(raw_cash)
+    ensure_dir(icons_root / "equipment_shape")
+    ensure_dir(icons_root / "cash")
+    resolved_db_path = db_path or Path(get_env_or_none("DB_PATH") or output_root / "db.sqlite")
+    conn = db.connect(resolved_db_path)
+    db.init_db(conn)
+    run_params = {
+        "start_rank": start_rank,
+        "end_rank": end_rank,
+        "world_name": world_name,
+        "world_type": world_type,
+        "class_name": class_name,
+        "all_presets": all_presets,
+    }
+    run_id = run_id_override or compute_run_id(resolved_date, run_params)
+    existing_run = db.fetch_run(conn, run_id)
+    if existing_run:
+        if existing_run["target_date"] != resolved_date:
+            raise ValueError(
+                f"run_id {run_id} target_date mismatch ({existing_run['target_date']} != {resolved_date})"
+            )
+    else:
+        db.insert_run(conn, run_id, resolved_date, utc_now_iso(), json.dumps(run_params, ensure_ascii=False))
+        conn.commit()
+    equipment_items: list[dict[str, Any]] = []
+    cash_items: list[dict[str, Any]] = []
+    ocid_results: list[dict[str, Any]] = []
+    async with ApiClient(api_key=api_key, concurrency=concurrency, rps=rps) as api:
+        ranking_entries, ranking_raw = await _fetch_ranking_entries(
+            api=api,
+            target_date=resolved_date,
+            start_rank=start_rank,
+            end_rank=end_rank,
+            world_name=world_name,
+            world_type=world_type,
+            class_name=class_name,
+        )
+        write_json(raw_root / "ranking_overall.json", ranking_raw)
+        db.upsert_ranking_entries(conn, run_id, ranking_entries)
+        conn.commit()
+        ocid_results = await _fetch_ocids(api, ranking_entries, raw_ocid)
+        now_iso = utc_now_iso()
+        character_rows = [
+            {
+                "ocid": row["ocid"],
+                "character_name": row["character_name"],
+                "first_seen_at": now_iso,
+                "last_seen_at": now_iso,
+            }
+            for row in ocid_results
+            if row.get("ocid")
+        ]
+        if character_rows:
+            db.upsert_characters(conn, character_rows)
+            conn.commit()
+        ocids = [row["ocid"] for row in ocid_results if row.get("ocid")]
+        equipment_items = await _fetch_equipment(api, ocids, resolved_date, raw_item)
+        if equipment_items:
+            db.upsert_equipment_items(conn, run_id, equipment_items)
+            conn.commit()
+        cash_items = await _fetch_cash_items(api, ocids, resolved_date, raw_cash, all_presets)
+        if cash_items:
+            db.upsert_cash_items(conn, run_id, cash_items)
+            conn.commit()
+        metrics = api.metrics
+    icon_results = None
+    download_counts = None
+    if download_icon_assets:
+        urls_by_category = _collect_icon_urls(equipment_items, cash_items)
+        existing = db.fetch_icon_assets(conn, list(urls_by_category.keys()))
+        icon_results, download_counts = await download_icons(
+            urls_by_category=urls_by_category,
+            output_root=output_root,
+            icons_root=icons_root,
+            existing_assets=existing,
+            rps=rps,
+            concurrency=concurrency,
+        )
+        for record in icon_results:
+            db.upsert_icon_asset(conn, record.__dict__)
+        conn.commit()
+    elapsed = time.monotonic() - start
+    report = PipelineReport(
+        run_id=run_id,
+        target_date=resolved_date,
+        start_rank=start_rank,
+        end_rank=end_rank,
+        ranking_count=len(ranking_entries),
+        ocid_count=len(ocid_results),
+        equipment_items_count=len(equipment_items),
+        cash_items_count=len(cash_items),
+        icons_downloaded=download_counts.downloaded if download_counts else 0,
+        icons_skipped=download_counts.skipped if download_counts else 0,
+        icons_failed=download_counts.failed if download_counts else 0,
+        rate_limit_hits=metrics.rate_limit_hits,
+        server_errors=metrics.server_errors,
+        data_preparing_hits=metrics.data_preparing_hits,
+        elapsed_seconds=elapsed,
+    )
+    report_path = output_root / "README_run.md"
+    report_path.write_text(report.to_markdown(), encoding="utf-8")
+    return report
+async def _fetch_ranking_entries(
+    *,
+    api: ApiClient,
+    target_date: str,
+    start_rank: int,
+    end_rank: int,
+    world_name: str | None,
+    world_type: int | None,
+    class_name: str | None,
+) -> tuple[list[dict[str, Any]], dict[str, Any]]:
+    page = 1
+    collected: dict[int, dict[str, Any]] = {}
+    pages: list[dict[str, Any]] = []
+    max_pages = 50
+    target_count = end_rank - start_rank + 1
+    while len(collected) < target_count and page <= max_pages:
+        params = {"date": target_date, "page": page}
+        if world_name:
+            params["world_name"] = world_name
+        if world_type is not None:
+            params["world_type"] = world_type
+        if class_name:
+            params["character_class"] = class_name
+        data = await api.get("/maplestory/v1/ranking/overall", params=params)
+        pages.append({"page": page, "data": data})
+        ranking_list = data.get("ranking") or []
+        if not ranking_list:
+            break
+        new_count = 0
+        max_rank_in_page = None
+        for entry in ranking_list:
+            ranking = to_int(entry.get("ranking"))
+            if ranking is None:
+                continue
+            max_rank_in_page = ranking if max_rank_in_page is None else max(max_rank_in_page, ranking)
+            if ranking < start_rank or ranking > end_rank:
+                continue
+            if ranking not in collected:
+                collected[ranking] = {
+                    "ranking": ranking,
+                    "character_name": entry.get("character_name"),
+                    "world_name": entry.get("world_name"),
+                    "class_name": entry.get("class_name"),
+                    "sub_class_name": entry.get("sub_class_name"),
+                    "character_level": to_int(entry.get("character_level")),
+                    "character_exp": to_int(entry.get("character_exp")),
+                    "character_popularity": to_int(entry.get("character_popularity")),
+                    "character_guildname": entry.get("character_guildname"),
+                }
+                new_count += 1
+        if len(collected) >= target_count:
+            break
+        if max_rank_in_page is not None and max_rank_in_page < start_rank:
+            page += 1
+            continue
+        if new_count == 0:
+            logger.warning("No new ranking entries found on page %s", page)
+            break
+        page += 1
+    entries = [collected[key] for key in sorted(collected.keys())]
+    raw = {
+        "target_date": target_date,
+        "start_rank": start_rank,
+        "end_rank": end_rank,
+        "pages": pages,
+        "collected_count": len(entries),
+    }
+    if len(entries) < target_count:
+        logger.warning("Ranking entries collected %s < requested %s", len(entries), target_count)
+    return entries, raw
+async def _fetch_ocids(
+    api: ApiClient,
+    ranking_entries: Iterable[dict[str, Any]],
+    raw_dir: Path,
+) -> list[dict[str, Any]]:
+    tasks = []
+    for entry in ranking_entries:
+        character_name = entry.get("character_name")
+        rank = entry.get("ranking")
+        if not character_name:
+            continue
+        tasks.append(
+            asyncio.create_task(_fetch_single_ocid(api, character_name, rank, raw_dir))
+        )
+    results: list[dict[str, Any]] = []
+    if tasks:
+        completed = await asyncio.gather(*tasks, return_exceptions=True)
+        for item in completed:
+            if isinstance(item, Exception):
+                logger.warning("OCID fetch failed: %s", item)
+                continue
+            if item:
+                results.append(item)
+    return results
+async def _fetch_single_ocid(
+    api: ApiClient,
+    character_name: str,
+    rank: int | None,
+    raw_dir: Path,
+) -> dict[str, Any] | None:
+    data = await api.get("/maplestory/v1/id", params={"character_name": character_name})
+    ocid = data.get("ocid")
+    filename = f"{rank:03d}" if rank is not None else "unknown"
+    filename = f"{filename}_{safe_filename(character_name)}.json"
+    write_json(raw_dir / filename, data)
+    if not ocid:
+        logger.warning("No OCID for %s", character_name)
+        return None
+    return {"ocid": ocid, "character_name": character_name}
+async def _fetch_equipment(
+    api: ApiClient,
+    ocids: list[str],
+    target_date: str,
+    raw_dir: Path,
+) -> list[dict[str, Any]]:
+    tasks = [
+        asyncio.create_task(_fetch_single_equipment(api, ocid, target_date, raw_dir))
+        for ocid in ocids
+    ]
+    results: list[dict[str, Any]] = []
+    if tasks:
+        completed = await asyncio.gather(*tasks, return_exceptions=True)
+        for item in completed:
+            if isinstance(item, Exception):
+                logger.warning("Equipment fetch failed: %s", item)
+                continue
+            results.extend(item)
+    return results
+async def _fetch_single_equipment(
+    api: ApiClient,
+    ocid: str,
+    target_date: str,
+    raw_dir: Path,
+) -> list[dict[str, Any]]:
+    data = await api.get(
+        "/maplestory/v1/character/item-equipment",
+        params={"ocid": ocid, "date": target_date},
+    )
+    write_json(raw_dir / f"{ocid}.json", data)
+    return extract_equipment_items(data, ocid)
+async def _fetch_cash_items(
+    api: ApiClient,
+    ocids: list[str],
+    target_date: str,
+    raw_dir: Path,
+    all_presets: bool,
+) -> list[dict[str, Any]]:
+    tasks = [
+        asyncio.create_task(_fetch_single_cash_item(api, ocid, target_date, raw_dir, all_presets))
+        for ocid in ocids
+    ]
+    results: list[dict[str, Any]] = []
+    if tasks:
+        completed = await asyncio.gather(*tasks, return_exceptions=True)
+        for item in completed:
+            if isinstance(item, Exception):
+                logger.warning("Cash item fetch failed: %s", item)
+                continue
+            results.extend(item)
+    return results
+async def _fetch_single_cash_item(
+    api: ApiClient,
+    ocid: str,
+    target_date: str,
+    raw_dir: Path,
+    all_presets: bool,
+) -> list[dict[str, Any]]:
+    data = await api.get(
+        "/maplestory/v1/character/cashitem-equipment",
+        params={"ocid": ocid, "date": target_date},
+    )
+    write_json(raw_dir / f"{ocid}.json", data)
+    return extract_cash_items(data, ocid, all_presets=all_presets)
+def _collect_icon_urls(
+    equipment_items: Iterable[dict[str, Any]],
+    cash_items: Iterable[dict[str, Any]],
+) -> dict[str, str]:
+    urls: dict[str, str] = {}
+    for item in equipment_items:
+        url = item.get("item_shape_icon_url")
+        if url and url not in urls:
+            urls[url] = "equipment_shape"
+    for item in cash_items:
+        url = item.get("cash_item_icon_url")
+        if url and url not in urls:
+            urls[url] = "cash"
+    return urls

src/pipeline/utils.py ADDED Viewed

	@@ -0,0 +1,177 @@

+from __future__ import annotations
+import hashlib
+import json
+import mimetypes
+import os
+import random
+import time
+from dataclasses import dataclass
+from datetime import datetime, timedelta, timezone
+from pathlib import Path
+from typing import Any, Optional
+from urllib.parse import quote
+from zoneinfo import ZoneInfo
+def load_dotenv_if_available() -> bool:
+    try:
+        from dotenv import load_dotenv
+    except ImportError:
+        return False
+    env_path = Path(".env")
+    if env_path.exists():
+        load_dotenv(dotenv_path=env_path)
+        return True
+    return False
+def kst_yesterday_date() -> str:
+    tz = ZoneInfo("Asia/Seoul")
+    now = datetime.now(tz)
+    yesterday = (now - timedelta(days=1)).date()
+    return yesterday.isoformat()
+def utc_now_iso() -> str:
+    return datetime.now(timezone.utc).isoformat()
+def safe_filename(value: str) -> str:
+    if not value:
+        return "unknown"
+    return quote(value, safe="-_.")
+def json_dumps(value: Any) -> str:
+    return json.dumps(value, ensure_ascii=False, separators=(",", ":"))
+def ensure_dir(path: Path) -> None:
+    path.mkdir(parents=True, exist_ok=True)
+def write_json(path: Path, data: Any) -> None:
+    ensure_dir(path.parent)
+    path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
+def compute_run_id(target_date: str, params: dict[str, Any]) -> str:
+    payload = json.dumps(params, sort_keys=True, ensure_ascii=True)
+    digest = hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12]
+    return f"{target_date}-{digest}"
+def to_int(value: Any) -> Optional[int]:
+    if value is None:
+        return None
+    try:
+        return int(value)
+    except (TypeError, ValueError):
+        return None
+def guess_extension(content_type: Optional[str], url: str) -> str:
+    if content_type:
+        ext = mimetypes.guess_extension(content_type.split(";")[0].strip())
+        if ext:
+            return ext
+    suffix = Path(url).suffix
+    if suffix:
+        return suffix
+    return ".bin"
+def random_wait(min_seconds: float, max_seconds: float) -> float:
+    return random.uniform(min_seconds, max_seconds)
+@dataclass
+class RateLimiter:
+    rps: float
+    def __post_init__(self) -> None:
+        self._lock = None
+        self._next_time: Optional[float] = None
+    async def acquire(self) -> None:
+        if self.rps <= 0:
+            return
+        if self._lock is None:
+            import asyncio
+            self._lock = asyncio.Lock()
+        async with self._lock:
+            now = time.monotonic()
+            min_interval = 1 / self.rps
+            if self._next_time is None:
+                self._next_time = now
+            if now < self._next_time:
+                sleep_for = self._next_time - now
+                if sleep_for > 0:
+                    import asyncio
+                    await asyncio.sleep(sleep_for)
+                now = time.monotonic()
+            self._next_time = max(now, self._next_time) + min_interval
+@dataclass
+class DownloadResult:
+    downloaded: int = 0
+    skipped: int = 0
+    failed: int = 0
+@dataclass
+class ApiMetrics:
+    total_requests: int = 0
+    rate_limit_hits: int = 0
+    server_errors: int = 0
+    data_preparing_hits: int = 0
+    other_errors: int = 0
+@dataclass
+class PipelineReport:
+    run_id: str
+    target_date: str
+    start_rank: int
+    end_rank: int
+    ranking_count: int
+    ocid_count: int
+    equipment_items_count: int
+    cash_items_count: int
+    icons_downloaded: int
+    icons_skipped: int
+    icons_failed: int
+    rate_limit_hits: int
+    server_errors: int
+    data_preparing_hits: int
+    elapsed_seconds: float
+    def to_markdown(self) -> str:
+        return "\n".join(
+            [
+                f"Run ID: {self.run_id}",
+                f"Target date (KST): {self.target_date}",
+                f"Rank range: {self.start_rank}-{self.end_rank}",
+                f"Ranking entries: {self.ranking_count}",
+                f"OCIDs resolved: {self.ocid_count}",
+                f"Equipment shape items: {self.equipment_items_count}",
+                f"Cash items: {self.cash_items_count}",
+                f"Icons downloaded: {self.icons_downloaded}",
+                f"Icons skipped: {self.icons_skipped}",
+                f"Icons failed: {self.icons_failed}",
+                f"429 retries: {self.rate_limit_hits}",
+                f"5xx retries: {self.server_errors}",
+                f"Data preparing retries: {self.data_preparing_hits}",
+                f"Elapsed seconds: {self.elapsed_seconds:.2f}",
+            ]
+        )
+def get_env_or_none(key: str) -> Optional[str]:
+    value = os.getenv(key)
+    return value if value else None

tests/test_db_idempotent.py ADDED Viewed

	@@ -0,0 +1,54 @@

+from pathlib import Path
+from pipeline import db
+def test_db_idempotent_inserts(tmp_path: Path) -> None:
+    db_path = tmp_path / "test.sqlite"
+    conn = db.connect(db_path)
+    db.init_db(conn)
+    run_id = "2025-01-01-acde"
+    db.insert_run(conn, run_id, "2025-01-01", "2025-01-02T00:00:00Z", "{}")
+    equipment = [
+        {
+            "ocid": "ocid-1",
+            "item_equipment_part": "head",
+            "equipment_slot": "slot",
+            "item_name": "Hat",
+            "item_icon_url": "http://example.com/hat.png",
+            "item_description": "desc",
+            "item_shape_name": "Shape Hat",
+            "item_shape_icon_url": "http://example.com/shape.png",
+            "raw_json": "{}",
+        }
+    ]
+    cash = [
+        {
+            "ocid": "ocid-1",
+            "preset_no": 1,
+            "cash_item_equipment_part": "hat",
+            "cash_item_equipment_slot": "slot",
+            "cash_item_name": "Cash Hat",
+            "cash_item_icon_url": "http://example.com/cash.png",
+            "cash_item_description": "desc",
+            "cash_item_label": "label",
+            "date_expire": None,
+            "date_option_expire": None,
+            "raw_json": "{}",
+        }
+    ]
+    db.upsert_equipment_items(conn, run_id, equipment)
+    db.upsert_equipment_items(conn, run_id, equipment)
+    db.upsert_cash_items(conn, run_id, cash)
+    db.upsert_cash_items(conn, run_id, cash)
+    conn.commit()
+    eq_count = conn.execute("SELECT COUNT(*) FROM equipment_shape_items").fetchone()[0]
+    cash_count = conn.execute("SELECT COUNT(*) FROM cash_items").fetchone()[0]
+    assert eq_count == 1
+    assert cash_count == 1

tests/test_parsers.py ADDED Viewed

	@@ -0,0 +1,96 @@

+from pipeline.parsers import extract_cash_items, extract_equipment_items
+def test_extract_equipment_items_maps_shape_icon():
+    data = {
+        "item_equipment": [
+            {
+                "item_equipment_part": "head",
+                "equipment_slot": "slot1",
+                "item_name": "Test Hat",
+                "item_icon": "http://example.com/icon.png",
+                "item_description": "desc",
+                "item_shape_name": "Shape Hat",
+                "item_shape_icon": "http://example.com/shape.png",
+            }
+        ]
+    }
+    items = extract_equipment_items(data, "ocid-1")
+    assert len(items) == 1
+    assert items[0]["item_shape_icon_url"] == "http://example.com/shape.png"
+    assert items[0]["item_icon_url"] == "http://example.com/icon.png"
+    assert items[0]["ocid"] == "ocid-1"
+def test_extract_cash_items_selects_current_preset():
+    data = {
+        "preset_no": 2,
+        "cash_item_equipment_preset_1": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot1",
+                "cash_item_name": "Hat 1",
+                "cash_item_icon": "http://example.com/hat1.png",
+            }
+        ],
+        "cash_item_equipment_preset_2": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot2",
+                "cash_item_name": "Hat 2",
+                "cash_item_icon": "http://example.com/hat2.png",
+            }
+        ],
+    }
+    items = extract_cash_items(data, "ocid-1", all_presets=False)
+    assert len(items) == 1
+    assert items[0]["preset_no"] == 2
+    assert items[0]["cash_item_icon_url"] == "http://example.com/hat2.png"
+def test_extract_cash_items_defaults_to_preset1():
+    data = {
+        "cash_item_equipment_preset_1": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot1",
+                "cash_item_name": "Hat 1",
+                "cash_item_icon": "http://example.com/hat1.png",
+            }
+        ],
+        "cash_item_equipment_preset_2": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot2",
+                "cash_item_name": "Hat 2",
+                "cash_item_icon": "http://example.com/hat2.png",
+            }
+        ],
+    }
+    items = extract_cash_items(data, "ocid-1", all_presets=False)
+    assert len(items) == 1
+    assert items[0]["preset_no"] == 1
+def test_extract_cash_items_all_presets():
+    data = {
+        "cash_item_equipment_preset_1": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot1",
+                "cash_item_name": "Hat 1",
+                "cash_item_icon": "http://example.com/hat1.png",
+            }
+        ],
+        "cash_item_equipment_preset_2": [
+            {
+                "cash_item_equipment_part": "hat",
+                "cash_item_equipment_slot": "slot2",
+                "cash_item_name": "Hat 2",
+                "cash_item_icon": "http://example.com/hat2.png",
+            }
+        ],
+    }
+    items = extract_cash_items(data, "ocid-1", all_presets=True)
+    assert len(items) == 2
+    assert {item["preset_no"] for item in items} == {1, 2}