Spaces:

alex4cip
/

simple-chat

Sleeping

App Files Files Community

simple-chat / README.md

alex4cip

docs: Update README with multi-environment support and remove redundant footer

f1ac66c about 1 month ago

preview code

raw

history blame contribute delete

18.1 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Multi-Model Korean LLM Chatbot
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

🤖 Multi-Model Korean LLM Chatbot

13개의 다양한 한국어 LLM 모델을 선택하여 대화할 수 있는 멀티모델 챗봇입니다. **로컬 환경(CPU/GPU)**과 **Hugging Face Spaces(CPU Basic/Upgrade, ZeroGPU)**를 자동 감지하여 최적 설정을 적용합니다.

✨ 주요 특징

🎯 13개 모델 선택: 다양한 크기와 특성의 LLM 모델 지원
🇰🇷 한글 최적화: 한국어 성능이 우수한 모델들로 구성
🖥️ 멀티 환경 지원: 로컬(CPU/GPU) + HF Spaces(CPU Basic/Upgrade, ZeroGPU) 자동 감지
💾 캐시 시스템: 모델 재다운로드 방지, 빠른 로딩
🔄 Lazy Loading: 선택한 모델만 로드하여 리소스 절약
🛡️ 안정성: RTX 5080 등 최신 GPU 지원, CUDA 호환성 자동 테스트

🎯 지원 모델 (13개)

🌟 추천 한국어 모델

모델	크기	특징	상태
EXAONE 3.5 7.8B	7.3GB	⭐ 파라미터 대비 최고 효율	Public
EXAONE 3.5 2.4B	2.2GB	⚡ 초경량, 빠른 응답	Public
Llama-3 Open-Ko 8B	7.5GB	🔥 Llama 3 생태계	Public

📚 전체 모델 목록

Public 모델 (10개)

LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct
LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct
beomi/Llama-3-Open-Ko-8B
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-14B-Instruct
01-ai/Yi-1.5-9B-Chat
01-ai/Yi-1.5-34B-Chat
mistralai/Mistral-7B-Instruct-v0.3
upstage/SOLAR-10.7B-Instruct-v1.0
EleutherAI/polyglot-ko-5.8b

Gated 모델 (3개) 🔒

meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-70B-Instruct
CohereForAI/aya-23-8B

참고: Gated 모델은 Hugging Face에서 별도 승인 필요

🚀 지원 환경

로컬 환경 (개발/개인 사용)

1. Local GPU (권장)

장점:
- ⚡ 빠른 응답 (5-10초, GPU 가속)
- 🔓 무제한 사용
- 💰 비용 없음
지원 GPU:
- NVIDIA CUDA 지원 GPU (RTX 시리즈, A100 등)
- Apple Silicon GPU (M1/M2/M3 - MPS 가속)
- RTX 5080 등 최신 Blackwell GPU (PyTorch nightly 필요)
요구사항: CUDA 12.0+ 또는 Apple Silicon

2. Local CPU

장점:
- 🖥️ GPU 없이도 실행 가능
- 🔧 간단한 설정
제약:
- ⏳ 느린 응답 (1~3분)
- 🔒 경량 모델 권장 (EXAONE 2.4B, Mistral 7B)

Hugging Face Spaces (클라우드 배포)

1. ZeroGPU (추천)

장점:
- ⚡ 빠른 응답 (3-10초, NVIDIA H200 GPU 가속)
- 💰 저렴한 비용 ($9/month)
- 🔋 자동 GPU 할당/해제
제약:
- 하루 25분 무료 사용 (PRO 구독 필요)
- 대기열 가능 (사용자 많을 경우)
비용: $9/month (PRO 구독)

2. CPU Upgrade

장점:
- ⏰ 무제한 사용
- 📊 예측 가능한 성능
- 🔧 간단한 설정
제약:
- 🐢 느린 응답 (30초~1분)
- 💵 상대적으로 비싼 비용
비용: $0.03/hour (월 약 $22)

3. CPU Basic (무료)

장점:
- 💡 무료 티어
- 🧪 테스트/학습 용도
제약:
- ⏳ 매우 느린 응답 (1~2분)
- 🔒 경량 모델만 권장
- ⚠️ 제한적 사용

⚙️ 환경별 설정 방법

로컬 실행 (자동 감지)

앱이 자동으로 로컬 환경을 감지하고 최적 설정을 적용합니다:

python app.py

자동 감지 로직:

GPU 감지: CUDA/MPS 사용 가능 여부 확인
CUDA 호환성 테스트: 텐서 연산으로 실제 GPU 작동 검증
CPU 폴백: GPU 오류 시 자동 CPU 모드 전환
환경 정보 출력: 시작 시 감지된 환경 정보 표시

HF Spaces 배포 (자동 감지)

Space Settings에서 하드웨어를 변경하면 앱이 자동으로 감지:

ZeroGPU로 변경:

Space Settings → Hardware
ZeroGPU 선택
Confirm → 빌드 완료 대기 (1-2분)
UI에 "🚀 HF Spaces - ZeroGPU" 표시 확인

CPU Upgrade로 변경:

Space Settings → Hardware
CPU Upgrade (8 vCPU / 32 GB) 선택
Confirm → 빌드 완료 대기 (1-2분)
UI에 "⚙️ HF Spaces - CPU Upgrade" 표시 확인

CPU Basic (무료):

기본 설정, 별도 변경 불필요
UI에 "💻 HF Spaces - CPU Basic" 표시

📊 성능 비교

항목	Local GPU	Local CPU	ZeroGPU	CPU Upgrade	CPU Basic
첫 응답	10-20초	2-5분	10-20초	1-2분	2-3분
이후 응답	5-10초	1-3분	3-10초	30초~1분	1-2분
일일 한도	무제한	무제한	25분	무제한	제한적
월 비용	$0	$0	$9	$22	$0
GPU	사용자 GPU	없음	H200 (70GB)	없음	없음
권장 모델	전체	경량	전체	전체	경량

🔧 기술 구조

멀티 환경 자동 감지 시스템

# 1. CUDA 초기화 오류 방지: spaces를 먼저 import
try:
    import spaces
    ZEROGPU_AVAILABLE = True
except ImportError:
    ZEROGPU_AVAILABLE = False

# 2. 이후 CUDA 관련 패키지 import
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# 3. 하드웨어 환경 감지
def detect_hardware_environment():
    """
    Returns: {
        'platform': 'hf_spaces' | 'local',
        'hardware': 'zerogpu' | 'cpu_upgrade' | 'cpu_basic' | 'local_gpu' | 'local_cpu',
        'gpu_available': bool,
        'gpu_name': str or None,
        'cuda_compatible': bool
    }
    """
    # HF Spaces 감지
    if os.environ.get('SPACE_ID'):
        if ZEROGPU_AVAILABLE:
            return 'zerogpu'
        elif cpu_count >= 8:
            return 'cpu_upgrade'
        else:
            return 'cpu_basic'

    # 로컬 환경 감지
    if torch.cuda.is_available():
        # CUDA 호환성 테스트 (RTX 5080 등 최신 GPU 지원)
        if test_cuda_compatibility():
            return 'local_gpu'
        else:
            return 'local_cpu'  # CUDA 오류 → CPU 폴백
    elif torch.backends.mps.is_available():
        return 'local_gpu'  # Apple Silicon
    else:
        return 'local_cpu'

# 4. 조건부 GPU decorator 적용
if ZEROGPU_AVAILABLE:
    @spaces.GPU(duration=120)
    def generate_response(message, history):
        return generate_response_impl(message, history)
else:
    def generate_response(message, history):
        return generate_response_impl(message, history)

Lazy Loading & 캐시 시스템

스마트 모델 로딩:

def load_model_once(model_index=None):
    """모델 변경 시에만 로드 (Lazy Loading)"""
    global model, tokenizer, loaded_model_name

    model_name = MODEL_CONFIGS[model_index]["MODEL_NAME"]

    # 1. 이미 로드된 모델이면 재사용
    if loaded_model_name == model_name:
        print(f"ℹ️ Model {model_name} already loaded, reusing...")
        return model, tokenizer

    # 2. 캐시 확인 → UI에 다운로드 vs 로딩 메시지 표시
    is_cached = check_model_cached(model_name)
    if is_cached:
        print(f"✅ Model found in cache, loading from disk...")
    else:
        print(f"📥 Model not in cache, downloading (~4-14GB)...")

    # 3. 이전 모델 메모리 해제
    if model is not None:
        del model, tokenizer
        if HW_ENV['cuda_compatible']:
            torch.cuda.empty_cache()

    # 4. 새 모델 로드 (환경별 최적화)
    device = "cuda" if HW_ENV['gpu_available'] and HW_ENV['cuda_compatible'] else "cpu"

    if device == "cuda":
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            dtype=torch.float16,  # GPU: float16
            device_map="auto",
        )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            dtype=torch.float32,  # CPU: float32
        )

    loaded_model_name = model_name
    return model, tokenizer

캐시 상태 확인:

사용자에게 "💾 캐시된 모델 로딩 중" vs "📥 모델 다운로드 중" 실시간 표시
다운로드 시간 예측 정보 제공 (첫 사용 시 5-20분)

📝 사용 방법

1. Space 접속

https://huggingface.co/spaces/catchitplay/simple-chat

2. 모델 선택

드롭다운에서 원하는 모델 선택
캐시 상태 확인 (💾 캐시됨 / 📥 다운로드 필요)
첫 사용 시 모델 다운로드 (2-14GB, 5-20분)

3. 대화 시작

안녕하세요
인공지능에 대해 설명해주세요
한국의 수도는 어디인가요?

💡 모델 선택 가이드

빠른 응답이 필요한 경우

EXAONE 3.5 2.4B ⚡ (2.2GB) - 가장 빠름
Mistral 7B (7GB) - 경량 모델

품질 중시

EXAONE 3.5 7.8B ⭐ (7.3GB) - 효율성 최고
Qwen2.5 14B (14GB) - 다국어 강점
SOLAR 10.7B (10GB) - 한국어 특화

최고 성능 (느림)

Llama 3.1 70B 🔒 (70GB) - 최고 품질
Yi 1.5 34B (34GB) - 긴 문맥

Llama 생태계

Llama-3 Open-Ko 8B 🔥 (7.5GB)
Llama 3.1 8B 🔒 (8GB)

📦 로컬 실행

설치

# 저장소 클론
git clone https://github.com/catchitplay/simple-chatbot-gradio.git
cd simple-chatbot-gradio

# 가상환경 생성 (권장)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 의존성 설치
pip install -r requirements.txt

RTX 5080 등 최신 GPU 사용 시:

# PyTorch nightly 설치 (CUDA 12.8+ 지원)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

.env 파일 설정

# .env 파일 생성
echo "HF_TOKEN=your_hugging_face_token" > .env

HF_TOKEN 발급 방법:

https://huggingface.co/settings/tokens 접속
"New token" 클릭
"Read" 권한 선택
생성된 토큰 복사

실행

python app.py

브라우저에서 http://localhost:7860 접속

시작 시 자동 환경 감지 출력: ```

Hardware Environment Detection

Platform: local Hardware: local_gpu GPU Available: True GPU Name: NVIDIA GeForce RTX 5080 CPU Cores: 16 OS: Linux Description: 🖥️ Local - GPU (NVIDIA GeForce RTX 5080)


**참고**:
- 로컬 환경 자동 감지: CPU/GPU/Apple Silicon MPS
- CUDA 호환성 자동 테스트 (GPU 오류 시 CPU 폴백)
- 첫 실행 시 모델 다운로드 (4-14GB, 5-20분 소요)
- GPU 권장 (RTX 시리즈, A100, Apple Silicon 등)

### 리눅스 시스템 서비스로 설치 (자동 시작)

서버 부팅 시 챗봇을 자동으로 실행하려면 systemd 서비스로 설치할 수 있습니다.

#### 1. 설치 스크립트 실행

```bash
# 프로젝트 디렉토리에서 실행
sudo ./install-service.sh

설치 스크립트가 자동으로:

현재 사용자와 디렉토리 경로를 감지
systemd 서비스 파일을 /etc/systemd/system/chatbot.service에 설치
로그 파일 생성 (/var/log/chatbot.log, /var/log/chatbot-error.log)
부팅 시 자동 시작 활성화
서비스 즉시 시작 여부 확인

2. 서비스 관리 명령어

# 서비스 시작
sudo systemctl start chatbot

# 서비스 중지
sudo systemctl stop chatbot

# 서비스 재시작
sudo systemctl restart chatbot

# 서비스 상태 확인
sudo systemctl status chatbot

# 실시간 로그 보기
sudo journalctl -u chatbot -f

# 애플리케이션 로그 보기
tail -f /var/log/chatbot.log

# 에러 로그 보기
tail -f /var/log/chatbot-error.log

# 부팅 시 자동 시작 활성화
sudo systemctl enable chatbot

# 부팅 시 자동 시작 비활성화
sudo systemctl disable chatbot

3. 서비스 삭제

서비스를 완전히 제거하려면:

# 서비스 중지 및 비활성화
sudo systemctl stop chatbot
sudo systemctl disable chatbot

# 서비스 파일 삭제
sudo rm /etc/systemd/system/chatbot.service

# systemd 데몬 재로드
sudo systemctl daemon-reload

# 로그 파일 삭제 (선택사항)
sudo rm /var/log/chatbot.log /var/log/chatbot-error.log

4. 주의사항

가상환경 필수: 서비스 설치 전에 venv 디렉토리가 존재해야 합니다
포트 충돌: 기존 프로세스가 7860 포트를 사용 중이면 서비스가 시작되지 않습니다
권한: 설치 스크립트는 반드시 sudo로 실행해야 합니다
재시작: 앱 코드 변경 후에는 sudo systemctl restart chatbot 실행 필요
로그 확인: 문제 발생 시 로그 파일을 먼저 확인하세요

5. 수동 서비스 설정 (고급)

자동 설치 스크립트 대신 수동으로 설정하려면:

# 1. chatbot.service 파일 편집
sudo nano /etc/systemd/system/chatbot.service

# 2. 다음 내용 입력 (경로와 사용자명 수정 필요)
[Unit]
Description=Multi-Model Chatbot Gradio Service
After=network.target

[Service]
Type=simple
User=YOUR_USERNAME
WorkingDirectory=/path/to/simple-chatbot-gradio
Environment="PATH=/path/to/simple-chatbot-gradio/venv/bin:/usr/bin:/bin"
ExecStart=/path/to/simple-chatbot-gradio/venv/bin/python app.py
Restart=on-failure
RestartSec=10
StandardOutput=append:/var/log/chatbot.log
StandardError=append:/var/log/chatbot-error.log

[Install]
WantedBy=multi-user.target

# 3. 로그 파일 생성
sudo touch /var/log/chatbot.log /var/log/chatbot-error.log
sudo chown YOUR_USERNAME:YOUR_USERNAME /var/log/chatbot.log /var/log/chatbot-error.log

# 4. systemd 데몬 재로드 및 서비스 활성화
sudo systemctl daemon-reload
sudo systemctl enable chatbot
sudo systemctl start chatbot

6. 트러블슈팅

서비스가 시작되지 않는 경우:

# 서비스 상태 확인
sudo systemctl status chatbot

# 에러 로그 확인
sudo journalctl -u chatbot -n 50

# 수동 실행으로 에러 확인
cd /path/to/simple-chatbot-gradio
source venv/bin/activate
python app.py

포트가 이미 사용 중인 경우:

# 포트 7860을 사용하는 프로세스 확인
sudo lsof -i :7860

# 프로세스 종료 (PID 확인 후)
sudo kill -9 <PID>

가상환경 경로 문제:

# 가상환경 재생성
python -m venv venv
source venv/bin/activate
pip install -r requirements-local.txt

🛠️ 기술 스택

프레임워크: Gradio 5.49.1
ML 라이브러리: Transformers 4.57.1, PyTorch 2.2.0+
GPU 지원:
- HF Spaces: ZeroGPU (NVIDIA H200)
- 로컬: CUDA 12.0+, Apple Silicon MPS
- 최신 GPU: PyTorch nightly (CUDA 12.8+) 지원
언어: Python 3.10+

📚 Dependencies

# Core
gradio==5.49.1
transformers==4.57.1
torch>=2.2.0  # HF Spaces: 2.2.0 (ZeroGPU), Local: 2.2.0+ or nightly
safetensors==0.6.2
accelerate==0.26.1
sentencepiece==0.2.0
protobuf==4.25.1
huggingface-hub>=0.19.0
python-dotenv==1.0.0
spaces  # ZeroGPU support (HF Spaces only)

환경별 PyTorch 버전:

HF Spaces: PyTorch 2.2.0 (ZeroGPU 호환)
로컬 일반 GPU: PyTorch 2.2.0+ (CUDA 12.0+)
로컬 최신 GPU (RTX 5080 등): PyTorch nightly (CUDA 12.8+)
로컬 CPU: PyTorch 2.2.0+ (CPU-only build)

🔒 Gated 모델 사용법

1. 모델 승인 요청

각 Gated 모델 페이지에서 "Request Access" 클릭:

2. HF_TOKEN 설정

승인 후 HF_TOKEN을 .env 파일에 설정 (위 참조)

3. Space Secrets 설정 (HF Spaces)

Space Settings → Repository secrets:

Name: HF_TOKEN
Value: your_token_here

⚠️ 제한사항 및 알려진 이슈

공통

모델 크기: 2-70GB (로딩 시간 필요)
컨텍스트: 대화 히스토리 유지 (최근 3턴)
메모리: 큰 모델은 GPU/고용량 RAM 필요

환경별 제약

HF Spaces - ZeroGPU:

일일 한도: 25분 (PRO 구독 필요)
대기열: 사용자 많을 경우 대기
비용: $9/month

HF Spaces - CPU Upgrade:

느린 속도: GPU 대비 10-30배 느림
비용: 시간당 $0.03 ($22/month)
메모리: 32GB RAM (대형 모델 제약)

HF Spaces - CPU Basic:

매우 느림: 1-2분 응답
제한적 사용
경량 모델 권장

로컬 환경:

GPU 메모리: 큰 모델은 VRAM 부족 가능
최신 GPU: PyTorch nightly 필요 (RTX 5080 등)
CPU 모드: 매우 느림 (1-3분 응답)

알려진 이슈 및 해결방법

"CUDA has been initialized" 오류 (ZeroGPU):

원인: torch 전에 spaces import 필요
해결: app.py에서 spaces를 가장 먼저 import (이미 적용됨)

RTX 5080 등 Blackwell GPU에서 CUDA 오류:

원인: CUDA 12.8+ 필요 (PyTorch 2.2.0은 미지원)
해결: PyTorch nightly 설치 (위 설치 섹션 참조)

GPU 감지되지만 CPU 모드로 동작:

원인: CUDA 호환성 테스트 실패
해결: PyTorch 버전 확인, CUDA 드라이버 업데이트

🔗 관련 리소스

모델 카드

문서

📄 라이선스

MIT License

🙋‍♂️ 문의

이슈나 질문이 있으시면 GitHub Issues를 통해 문의해주세요.

💡 TIP:

빠른 테스트: EXAONE 2.4B ⚡
균형잡힌 성능: EXAONE 7.8B ⭐
최고 품질: Llama 3.1 70B 🔒 (느림)