hanjunlee's picture
Upload 23 files
3a36548 verified
# ์ •๋‹น ๋ณด๋„์ž๋ฃŒ ํฌ๋กค๋Ÿฌ
6๊ฐœ ์ •๋‹น ์›น์‚ฌ์ดํŠธ์—์„œ ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰/๋ธŒ๋ฆฌํ•‘, ๋ชจ๋‘๋ฐœ์–ธ์„ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘ํ•˜๊ณ  ํ—ˆ๊น…ํŽ˜์ด์Šค์— ์—…๋กœ๋“œํ•˜๋Š” ํฌ๋กค๋Ÿฌ์ž…๋‹ˆ๋‹ค.
**์ง€์› ์ •๋‹น**: ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น, ๊ตญ๋ฏผ์˜ํž˜, ์กฐ๊ตญํ˜์‹ ๋‹น, ๊ฐœํ˜์‹ ๋‹น, ๊ธฐ๋ณธ์†Œ๋“๋‹น, ์ง„๋ณด๋‹น
## ์ฃผ์š” ํŠน์ง•
- **๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ (asyncio + aiohttp)**: ๊ธฐ์กด ๋Œ€๋น„ 10-20๋ฐฐ ๋น ๋ฅธ ์†๋„
- **6๊ฐœ ์ •๋‹น ๋ณ‘๋ ฌ ํฌ๋กค๋ง**: ๋™์‹œ์— ์‹คํ–‰ํ•˜์—ฌ ์‹œ๊ฐ„ ๋‹จ์ถ•
- **์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ**: ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ์ดํ›„ ๋ฐ์ดํ„ฐ๋งŒ ์ˆ˜์ง‘
- **ํ—ˆ๊น…ํŽ˜์ด์Šค ์ž๋™ ์—…๋กœ๋“œ**: ์ •๋‹น๋ณ„ ๋…๋ฆฝ ์ €์žฅ์†Œ์— ์ž๋™ ๋ณ‘ํ•ฉ
## ์„ค์น˜
```bash
pip install -r requirements.txt
```
๋˜๋Š” Windows:
```bash
setup.bat
```
## ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •
`.env` ํŒŒ์ผ ์ƒ์„ฑ ํ›„ ์•„๋ž˜ ๋‚ด์šฉ ์ž…๋ ฅ:
```
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx
# ๊ฐ ์ •๋‹น๋ณ„ ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ฐ์ดํ„ฐ์…‹ ์ €์žฅ์†Œ
HF_REPO_ID=your_username/minjoo-press-releases
HF_REPO_ID_PPP=your_username/ppp-press-releases
HF_REPO_ID_REBUILDING=your_username/rebuilding-press-releases
HF_REPO_ID_REFORM=your_username/reform-press-releases
HF_REPO_ID_BASIC_INCOME=your_username/basic-income-press-releases
HF_REPO_ID_JINBO=your_username/jinbo-press-releases
```
## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### main.py - ํ†ตํ•ฉ ์ง„์ž…์  (์ถ”์ฒœ)
```bash
# ์ „์ฒด ์ •๋‹น ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ (๊ธฐ๋ณธ)
python main.py
# ํŠน์ • ์ •๋‹น๋งŒ
python main.py --party minjoo # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น
python main.py --party ppp # ๊ตญ๋ฏผ์˜ํž˜
python main.py --party rebuilding # ์กฐ๊ตญํ˜์‹ ๋‹น
python main.py --party reform # ๊ฐœํ˜์‹ ๋‹น
python main.py --party basic_income # ๊ธฐ๋ณธ์†Œ๋“๋‹น
python main.py --party jinbo # ์ง„๋ณด๋‹น
# ๋‚ ์งœ ๋ฒ”์œ„ ์ง€์ •
python main.py --start-date 2024-01-01
python main.py --party reform --start-date 2024-01-01 --end-date 2024-06-30
# ๋„์›€๋ง
python main.py --help
```
### ๊ฐœ๋ณ„ ํฌ๋กค๋Ÿฌ ์ง์ ‘ ์‹คํ–‰
```bash
python minjoo_crawler_async.py
python ppp_crawler_async.py
python rebuilding_crawler_async.py
python reform_crawler_async.py
python basic_income_crawler_async.py
python jinbo_crawler_async.py
```
### ๋งค์ผ ์ž๋™ ์‹คํ–‰ (์Šค์ผ€์ค„๋Ÿฌ)
```bash
python unified_scheduler.py # ๋งค์ผ ์˜ค์ „ 9์‹œ ์ „์ฒด ์ž๋™ ์‹คํ–‰
```
### Windows ๋ฐฐ์น˜ ํŒŒ์ผ
| ํŒŒ์ผ | ์„ค๋ช… |
|------|------|
| `run_unified.bat` | ์ „์ฒด ๋™์‹œ ํฌ๋กค๋ง (ํ•œ ๋ฒˆ) |
| `run_unified_scheduler.bat` | ์ „์ฒด ๋งค์ผ ์ž๋™ ํฌ๋กค๋ง |
| `run_once.bat` | ๋ฏผ์ฃผ๋‹น๋งŒ |
| `run_ppp.bat` | ๊ตญ๋ฏผ์˜ํž˜๋งŒ |
## ์ˆ˜์ง‘ ๋ฐ์ดํ„ฐ
| ์ •๋‹น | ๊ฒŒ์‹œํŒ | ์ˆ˜์ง‘ ์‹œ์ž‘์ผ |
|------|--------|------------|
| ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰/๋ธŒ๋ฆฌํ•‘, ๋ชจ๋‘๋ฐœ์–ธ | 2003-11-11 |
| ๊ตญ๋ฏผ์˜ํž˜ | ๋Œ€๋ณ€์ธ ๋…ผํ‰๋ณด๋„์ž๋ฃŒ, ์›๋‚ด ๋ณด๋„์ž๋ฃŒ, ๋ฏธ๋””์–ดํŠน์œ„ | 2000-03-10 |
| ์กฐ๊ตญํ˜์‹ ๋‹น | ๊ธฐ์žํšŒ๊ฒฌ๋ฌธ, ๋…ผํ‰๋ธŒ๋ฆฌํ•‘, ๋ณด๋„์ž๋ฃŒ | 2024-03-04 |
| ๊ฐœํ˜์‹ ๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰๋ธŒ๋ฆฌํ•‘ | 2024-02-13 |
| ๊ธฐ๋ณธ์†Œ๋“๋‹น | ๋…ผํ‰ยท๋ณด๋„์ž๋ฃŒ (๋…ผํ‰/๋ฐœ์–ธ/๋ณด๋„์ž๋ฃŒ) | 2020-01-08 |
| ์ง„๋ณด๋‹น | ๋ณด๋„์ž๋ฃŒ, ๋…ผํ‰, ๋ชจ๋‘๋ฐœ์–ธ | 2017-10-14 |
## ์„ค์ • (crawler_config.json)
๊ฐ ์ •๋‹น๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ์„ค์ • ๊ฐ€๋Šฅ:
```json
{
"minjoo": { ... },
"ppp": { ... },
"rebuilding": { ... },
"reform": { ... },
"basic_income": { ... },
"jinbo": { ... }
}
```
| ์„ค์ • | ์„ค๋ช… |
|------|------|
| `boards` | ์ˆ˜์ง‘ํ•  ๊ฒŒ์‹œํŒ ๋ชฉ๋ก |
| `start_date` | ์ตœ์ดˆ ํฌ๋กค๋ง ์‹œ์ž‘ ๋‚ ์งœ |
| `max_pages` | ์ตœ๋Œ€ ํŽ˜์ด์ง€ ์ˆ˜ |
| `concurrent_requests` | ๋™์‹œ ์š”์ฒญ ์ˆ˜ (์„œ๋ฒ„ ๋ถ€๋‹ด ๊ณ ๋ ค) |
| `request_delay` | ์š”์ฒญ ๊ฐ„ ๋Œ€๊ธฐ ์‹œ๊ฐ„(์ดˆ) |
| `output_path` | ๋กœ์ปฌ ์ €์žฅ ๊ฒฝ๋กœ |
## ํŒŒ์ผ ๊ตฌ์กฐ
```
์ •๋‹นํฌ๋กค๋Ÿฌ/
โ”œโ”€โ”€ main.py # ํ†ตํ•ฉ ์ง„์ž…์  (CLI ์ธ์ž ์ง€์›)
โ”œโ”€โ”€ unified_crawler.py # 6๊ฐœ ์ •๋‹น ํ†ตํ•ฉ ํฌ๋กค๋Ÿฌ
โ”œโ”€โ”€ unified_scheduler.py # ํ†ตํ•ฉ ์Šค์ผ€์ค„๋Ÿฌ
โ”œโ”€โ”€ minjoo_crawler_async.py # ๋”๋ถˆ์–ด๋ฏผ์ฃผ๋‹น
โ”œโ”€โ”€ ppp_crawler_async.py # ๊ตญ๋ฏผ์˜ํž˜
โ”œโ”€โ”€ rebuilding_crawler_async.py # ์กฐ๊ตญํ˜์‹ ๋‹น
โ”œโ”€โ”€ reform_crawler_async.py # ๊ฐœํ˜์‹ ๋‹น
โ”œโ”€โ”€ basic_income_crawler_async.py # ๊ธฐ๋ณธ์†Œ๋“๋‹น
โ”œโ”€โ”€ jinbo_crawler_async.py # ์ง„๋ณด๋‹น
โ”œโ”€โ”€ scheduler.py # ๋ฏผ์ฃผ๋‹น ์ „์šฉ ์Šค์ผ€์ค„๋Ÿฌ (๋ ˆ๊ฑฐ์‹œ)
โ”œโ”€โ”€ crawler_config.json # ํฌ๋กค๋ง ์„ค์ • (6๊ฐœ ์ •๋‹น)
โ”œโ”€โ”€ crawler_state.json # ํฌ๋กค๋ง ์ƒํƒœ (์ž๋™ ์ƒ์„ฑ)
โ”œโ”€โ”€ requirements.txt # Python ์˜์กด์„ฑ
โ””โ”€โ”€ .env # ํ™˜๊ฒฝ ๋ณ€์ˆ˜ (์ง์ ‘ ์ƒ์„ฑ)
```
## ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ (๊ณตํ†ต)
| ์ปฌ๋Ÿผ | ์„ค๋ช… |
|------|------|
| `board_name` | ๊ฒŒ์‹œํŒ ์ด๋ฆ„ |
| `title` | ์ œ๋ชฉ |
| `category` | ์นดํ…Œ๊ณ ๋ฆฌ/๋ถ„๋ฅ˜ |
| `date` | ๊ฒŒ์‹œ ๋‚ ์งœ |
| `writer` | ์ž‘์„ฑ์ž |
| `text` | ๋ณธ๋ฌธ |
| `url` | ์›๋ฌธ URL |
> **์ฐธ๊ณ **: ๊ตญ๋ฏผ์˜ํž˜์€ `category` ๋Œ€์‹  `section`, `no` ์ปฌ๋Ÿผ ์ถ”๊ฐ€ ํฌํ•จ
## ์„ฑ๋Šฅ
| ํ•ญ๋ชฉ | ๋น„๋™๊ธฐ ๋ฒ„์ „ | ๊ธฐ์กด ๋™๊ธฐ ๋ฒ„์ „ |
|------|------------|--------------|
| ์ •๋‹น 1๊ฐœ (1000๊ฐœ) | ~5๋ถ„ | ~80๋ถ„ |
| 6๊ฐœ ์ •๋‹น ๋™์‹œ | ~5-10๋ถ„ | ~480๋ถ„ |
## ์ฆ๋ถ„ ์—…๋ฐ์ดํŠธ ์ž‘๋™ ๋ฐฉ์‹
1. **์ฒซ ์‹คํ–‰**: `start_date`๋ถ€ํ„ฐ ์˜ค๋Š˜๊นŒ์ง€ ์ „์ฒด ์ˆ˜์ง‘
2. **์ดํ›„ ์‹คํ–‰**: ๋งˆ์ง€๋ง‰ ํฌ๋กค๋ง ๋‚ ์งœ ๋‹ค์Œ๋‚ ๋ถ€ํ„ฐ๋งŒ ์ˆ˜์ง‘
3. **ํ—ˆ๊น…ํŽ˜์ด์Šค ๋ณ‘ํ•ฉ**: ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹๊ณผ ์ž๋™ ๋ณ‘ํ•ฉ + URL ๊ธฐ์ค€ ์ค‘๋ณต ์ œ๊ฑฐ
4. **์ƒํƒœ ๊ด€๋ฆฌ**: ์ •๋‹น๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ `crawler_state.json`์— ๊ธฐ๋ก
## ๋ฌธ์ œ ํ•ด๊ฒฐ
| ๋ฌธ์ œ | ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• |
|------|----------|
| `HF_TOKEN์ด ์„ค์ •๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค` | `.env` ํŒŒ์ผ์˜ `HF_TOKEN` ํ™•์ธ |
| ํฌ๋กค๋ง์ด ๋А๋ฆผ | `crawler_config.json`์—์„œ `concurrent_requests` ์ฆ๊ฐ€ |
| ์„œ๋ฒ„ ์—ฐ๊ฒฐ ์˜ค๋ฅ˜ | `crawler_config.json`์—์„œ `request_delay` ์ฆ๊ฐ€ |
| ํŠน์ • ์ •๋‹น๋งŒ ์‹คํŒจ | `python main.py --party [์ •๋‹น์ฝ”๋“œ]` ๋กœ ๊ฐœ๋ณ„ ์‹คํ–‰ํ•˜์—ฌ ํ™•์ธ |
## ๋กœ๊ทธ ํ™•์ธ
```bash
type main.log # main.py ์‹คํ–‰ ๋กœ๊ทธ
type unified_crawler.log # ํ†ตํ•ฉ ํฌ๋กค๋Ÿฌ ๋กœ๊ทธ
type unified_scheduler.log # ์Šค์ผ€์ค„๋Ÿฌ ๋กœ๊ทธ
```
## Windows ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์‹คํ–‰
```bash
# ๋ฐฐ์น˜ ํŒŒ์ผ
start /B python main.py > main.log 2>&1
# ๋˜๋Š” Windows ์ž‘์—… ์Šค์ผ€์ค„๋Ÿฌ
# ํŠธ๋ฆฌ๊ฑฐ: ๋งค์ผ ์˜ค์ „ 9์‹œ โ†’ ๋™์ž‘: python unified_scheduler.py
```
## ์ฃผ์˜์‚ฌํ•ญ
1. `concurrent_requests`๋Š” 10-20 ์ดํ•˜ ๊ถŒ์žฅ (์„œ๋ฒ„ ๋ถ€๋‹ด ์ตœ์†Œํ™”)
2. ์ˆ˜์ง‘ ์ „ ์›น์‚ฌ์ดํŠธ robots.txt ํ™•์ธ
3. ๊ณต๊ฐœ ์‹œ ๊ฐœ์ธ์ •๋ณด ํฌํ•จ ์—ฌ๋ถ€ ํ™•์ธ ๋ฐ ์ถœ์ฒ˜ ๋ช…์‹œ
## ๋ผ์ด์„ ์Šค
MIT License