yb1n's picture
Upload folder using huggingface_hub
29d1fb6 verified
|
Raw
History Blame Contribute Delete
3.85 kB

Hugging Face Processor Practice

이 ν”„λ‘œμ νŠΈλŠ” Hugging Face Tokenizer, ImageProcessor, Processor μ‹€μŠ΅ μ½”λ“œλ₯Ό VS Codeμ—μ„œ μ‹€ν–‰ν•˜κΈ° μ’‹κ²Œ μ •λ¦¬ν•œ μ˜ˆμ œμž…λ‹ˆλ‹€.

μ°Έκ³  νŽ˜μ΄μ§€: https://ds31x.github.io/wiki/hf/hf_processor/

ꡬ성

hf_processor_vscode/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ 00_check_environment.py
β”‚   β”œβ”€β”€ 01_tokenizer.py
β”‚   β”œβ”€β”€ 02_image_processor.py
β”‚   β”œβ”€β”€ 03_processor_clip.py
β”‚   β”œβ”€β”€ 04_custom_image_processor_roundtrip.py
β”‚   └── run_all.py
β”œβ”€β”€ src/
β”‚   └── hf_processor_practice/
β”‚       β”œβ”€β”€ image_processing_simple_vision.py
β”‚       └── utils.py
β”œβ”€β”€ notebooks/
β”‚   └── hf_processor_practice.ipynb
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ result_analysis.md
β”‚   └── agent_log.md
β”œβ”€β”€ data/
└── outputs/

ν™˜κ²½ λ§Œλ“€κΈ°

방법 1: venv

python -m venv .venv

# Windows PowerShell
.venv\Scripts\Activate.ps1

# macOS/Linux
source .venv/bin/activate

pip install -U pip
pip install -r requirements.txt
pip install -e .

방법 2: conda

conda create -n hfprocessor python=3.11 -y
conda activate hfprocessor
pip install -r requirements.txt
pip install -e .

μ‹€ν–‰ 방법

전체 μ‹€ν–‰:

python scripts/run_all.py

κ°œλ³„ μ‹€ν–‰:

python scripts/00_check_environment.py
python scripts/01_tokenizer.py
python scripts/02_image_processor.py
python scripts/03_processor_clip.py
python scripts/04_custom_image_processor_roundtrip.py

인터넷이 없을 λ•Œ

κΈ°λ³Έμ μœΌλ‘œλŠ” Hugging Face Hubμ—μ„œ bert-base-uncased, google/vit-base-patch16-224, openai/clip-vit-base-patch32λ₯Ό 뢈러였렀고 μ‹œλ„ν•©λ‹ˆλ‹€.

인터넷이 μ—†κ±°λ‚˜ λͺ¨λΈ λ‹€μš΄λ‘œλ“œκ°€ μ‹€νŒ¨ν•˜λ©΄ λ‹€μŒ fallback이 μžλ™μœΌλ‘œ λ™μž‘ν•©λ‹ˆλ‹€.

  • Tokenizer: 둜컬 tiny BERT tokenizer 생성
  • ImageProcessor: 둜컬 ViTImageProcessor 생성
  • Processor: 둜컬 CLIPProcessor ꡬ성
  • 이미지 파일: λ‹€μš΄λ‘œλ“œ μ‹€νŒ¨ μ‹œ placeholder 이미지 생성

μ˜€ν”„λΌμΈ λͺ¨λ“œλ‘œ λͺ…ν™•νžˆ μ‹€ν–‰ν•˜λ €λ©΄:

# macOS/Linux
TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1 python scripts/run_all.py

# Windows PowerShell
$env:TRANSFORMERS_OFFLINE="1"
$env:HF_HUB_OFFLINE="1"
python scripts/run_all.py

μ‹€μŠ΅ λ‚΄μš©

01. Tokenizer

  • AutoTokenizer.from_pretrained() μ‚¬μš©
  • λ¬Έμž₯을 input_ids, token_type_ids, attention_mask둜 λ³€ν™˜
  • batch_decode()둜 토큰화 κ²°κ³Ό 확인
  • save_pretrained() / from_pretrained() μ €μž₯ 및 볡원

02. ImageProcessor

  • AutoImageProcessor.from_pretrained() μ‚¬μš©
  • PIL 이미지λ₯Ό pixel_values ν…μ„œλ‘œ λ³€ν™˜
  • 좜λ ₯ shape 확인: 일반적으둜 (batch, channel, height, width)
  • preprocessor_config.json μ €μž₯ 및 볡원

03. Processor / CLIP

  • AutoProcessor λ˜λŠ” CLIPProcessor μ‚¬μš©
  • ν…μŠ€νŠΈμ™€ 이미지λ₯Ό λ™μ‹œμ— μ „μ²˜λ¦¬
  • 좜λ ₯ key: pixel_values, input_ids, attention_mask λ“±
  • Processor μ €μž₯ 및 볡원

04. Custom ImageProcessor

  • ImageProcessingMixin 기반 μ»€μŠ€ν…€ 이미지 ν”„λ‘œμ„Έμ„œ κ΅¬ν˜„
  • resize, normalize, tensor λ³€ν™˜ μˆ˜ν–‰
  • save_pretrained() / from_pretrained() round-trip 확인
  • ν™˜κ²½μ— 따라 AutoImageProcessor(..., trust_remote_code=True) λ‘œλ“œλ„ μ‹œλ„

ν…ŒμŠ€νŠΈ λ©”λͺ¨

이 ν”„λ‘œμ νŠΈλŠ” λ‹€μŒμ„ ν™•μΈν–ˆμŠ΅λ‹ˆλ‹€.

  • λͺ¨λ“  .py 파일 문법 검사 톡과
  • 인터넷이 μ—†λŠ” ν™˜κ²½μ—μ„œ fallback λͺ¨λ“œλ‘œ python scripts/run_all.py μ‹€ν–‰ 성곡
  • ν…ŒμŠ€νŠΈ 둜그: outputs/logs/run_all_test.log

일뢀 ν™˜κ²½, 특히 Python 3.13 + torchvision μ‘°ν•©μ—μ„œλŠ” torchvision::nms κ΄€λ ¨ 였λ₯˜κ°€ λ‚  수 μžˆμŠ΅λ‹ˆλ‹€. 그런 경우 Python 3.10 λ˜λŠ” 3.11 ν™˜κ²½μ„ ꢌμž₯ν•©λ‹ˆλ‹€.