yb1n's picture
Upload folder using huggingface_hub
29d1fb6 verified
|
Raw
History Blame Contribute Delete
3.85 kB
# Hugging Face Processor Practice
이 ν”„λ‘œμ νŠΈλŠ” Hugging Face `Tokenizer`, `ImageProcessor`, `Processor` μ‹€μŠ΅ μ½”λ“œλ₯Ό VS Codeμ—μ„œ μ‹€ν–‰ν•˜κΈ° μ’‹κ²Œ μ •λ¦¬ν•œ μ˜ˆμ œμž…λ‹ˆλ‹€.
μ°Έκ³  νŽ˜μ΄μ§€: `https://ds31x.github.io/wiki/hf/hf_processor/`
## ꡬ성
```text
hf_processor_vscode/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ 00_check_environment.py
β”‚ β”œβ”€β”€ 01_tokenizer.py
β”‚ β”œβ”€β”€ 02_image_processor.py
β”‚ β”œβ”€β”€ 03_processor_clip.py
β”‚ β”œβ”€β”€ 04_custom_image_processor_roundtrip.py
β”‚ └── run_all.py
β”œβ”€β”€ src/
β”‚ └── hf_processor_practice/
β”‚ β”œβ”€β”€ image_processing_simple_vision.py
β”‚ └── utils.py
β”œβ”€β”€ notebooks/
β”‚ └── hf_processor_practice.ipynb
β”œβ”€β”€ reports/
β”‚ β”œβ”€β”€ result_analysis.md
β”‚ └── agent_log.md
β”œβ”€β”€ data/
└── outputs/
```
## ν™˜κ²½ λ§Œλ“€κΈ°
### 방법 1: venv
```bash
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install -e .
```
### 방법 2: conda
```bash
conda create -n hfprocessor python=3.11 -y
conda activate hfprocessor
pip install -r requirements.txt
pip install -e .
```
## μ‹€ν–‰ 방법
전체 μ‹€ν–‰:
```bash
python scripts/run_all.py
```
κ°œλ³„ μ‹€ν–‰:
```bash
python scripts/00_check_environment.py
python scripts/01_tokenizer.py
python scripts/02_image_processor.py
python scripts/03_processor_clip.py
python scripts/04_custom_image_processor_roundtrip.py
```
## 인터넷이 없을 λ•Œ
κΈ°λ³Έμ μœΌλ‘œλŠ” Hugging Face Hubμ—μ„œ `bert-base-uncased`, `google/vit-base-patch16-224`, `openai/clip-vit-base-patch32`λ₯Ό 뢈러였렀고 μ‹œλ„ν•©λ‹ˆλ‹€.
인터넷이 μ—†κ±°λ‚˜ λͺ¨λΈ λ‹€μš΄λ‘œλ“œκ°€ μ‹€νŒ¨ν•˜λ©΄ λ‹€μŒ fallback이 μžλ™μœΌλ‘œ λ™μž‘ν•©λ‹ˆλ‹€.
- Tokenizer: 둜컬 tiny BERT tokenizer 생성
- ImageProcessor: 둜컬 ViTImageProcessor 생성
- Processor: 둜컬 CLIPProcessor ꡬ성
- 이미지 파일: λ‹€μš΄λ‘œλ“œ μ‹€νŒ¨ μ‹œ placeholder 이미지 생성
μ˜€ν”„λΌμΈ λͺ¨λ“œλ‘œ λͺ…ν™•νžˆ μ‹€ν–‰ν•˜λ €λ©΄:
```bash
# macOS/Linux
TRANSFORMERS_OFFLINE=1 HF_HUB_OFFLINE=1 python scripts/run_all.py
# Windows PowerShell
$env:TRANSFORMERS_OFFLINE="1"
$env:HF_HUB_OFFLINE="1"
python scripts/run_all.py
```
## μ‹€μŠ΅ λ‚΄μš©
### 01. Tokenizer
- `AutoTokenizer.from_pretrained()` μ‚¬μš©
- λ¬Έμž₯을 `input_ids`, `token_type_ids`, `attention_mask`둜 λ³€ν™˜
- `batch_decode()`둜 토큰화 κ²°κ³Ό 확인
- `save_pretrained()` / `from_pretrained()` μ €μž₯ 및 볡원
### 02. ImageProcessor
- `AutoImageProcessor.from_pretrained()` μ‚¬μš©
- PIL 이미지λ₯Ό `pixel_values` ν…μ„œλ‘œ λ³€ν™˜
- 좜λ ₯ shape 확인: 일반적으둜 `(batch, channel, height, width)`
- `preprocessor_config.json` μ €μž₯ 및 볡원
### 03. Processor / CLIP
- `AutoProcessor` λ˜λŠ” `CLIPProcessor` μ‚¬μš©
- ν…μŠ€νŠΈμ™€ 이미지λ₯Ό λ™μ‹œμ— μ „μ²˜λ¦¬
- 좜λ ₯ key: `pixel_values`, `input_ids`, `attention_mask` λ“±
- Processor μ €μž₯ 및 볡원
### 04. Custom ImageProcessor
- `ImageProcessingMixin` 기반 μ»€μŠ€ν…€ 이미지 ν”„λ‘œμ„Έμ„œ κ΅¬ν˜„
- resize, normalize, tensor λ³€ν™˜ μˆ˜ν–‰
- `save_pretrained()` / `from_pretrained()` round-trip 확인
- ν™˜κ²½μ— 따라 `AutoImageProcessor(..., trust_remote_code=True)` λ‘œλ“œλ„ μ‹œλ„
## ν…ŒμŠ€νŠΈ λ©”λͺ¨
이 ν”„λ‘œμ νŠΈλŠ” λ‹€μŒμ„ ν™•μΈν–ˆμŠ΅λ‹ˆλ‹€.
- λͺ¨λ“  `.py` 파일 문법 검사 톡과
- 인터넷이 μ—†λŠ” ν™˜κ²½μ—μ„œ fallback λͺ¨λ“œλ‘œ `python scripts/run_all.py` μ‹€ν–‰ 성곡
- ν…ŒμŠ€νŠΈ 둜그: `outputs/logs/run_all_test.log`
일뢀 ν™˜κ²½, 특히 Python 3.13 + torchvision μ‘°ν•©μ—μ„œλŠ” `torchvision::nms` κ΄€λ ¨ 였λ₯˜κ°€ λ‚  수 μžˆμŠ΅λ‹ˆλ‹€. 그런 경우 Python 3.10 λ˜λŠ” 3.11 ν™˜κ²½μ„ ꢌμž₯ν•©λ‹ˆλ‹€.