Spaces:

alex4cip
/

simple-chat

Sleeping

alex4cip Claude commited on Oct 29

Commit

9171967

1 Parent(s): a93accb

feat: Add ZeroGPU support for Llama-2-Ko 7B

✨ Features:
- ZeroGPU integration with @spaces.GPU decorator
- Llama-2-Ko 7B model with dynamic GPU allocation
- NVIDIA H200 hardware (70GB VRAM)
- Float16 optimization for GPU efficiency

🔧 Technical Changes:
- Add 'spaces' package to requirements.txt
- Implement GPU request with 120s duration
- Global model caching for efficiency
- Korean-optimized conversation formatting

📚 Documentation:
- Updated README with ZeroGPU setup guide
- Added cost comparison and optimization tips
- Included usage examples and limitations

⚠️ Requirements:
- Hugging Face PRO subscription ($9/month)
- ZeroGPU hardware selection in Space settings
- Daily limit: 25 minutes free usage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (3) hide show

README.md +109 -111
app.py +98 -88
requirements.txt +1 -0

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: LLM Chatbot
 emoji: 🤖
 colorFrom: blue
 colorTo: purple
@@ -10,162 +10,160 @@ pinned: false
 license: mit
 ---
-# 🤖 Hugging Face LLM Chatbot
-다양한 오픈소스 LLM 모델과 대화할 수 있는 웹 기반 챗봇 애플리케이션입니다.
-## ✨ 주요 기능
-- **다중 모델 지원**: 7개 모델 (영어 3개, 한글 4개)
-- **로컬 실행**: Transformers 라이브러리로 로컬에서 모델 실행
-- **API 제한 없음**: 인터넷 연결 없이도 작동 (첫 다운로드 후)
-- **자동 세션 관리**: 모델 변경 시 대화 자동 초기화
-- **완전 무료**: API 비용 없음, 오픈소스
-## 🎯 지원 모델
-### 영어 모델
-1. **DialoGPT Small** - 빠른 대화형 모델 (~350MB)
-2. **DialoGPT Medium** - 고품질 대화형 모델 (~800MB)
-3. **GPT-2** - 범용 텍스트 생성 모델 (~500MB)
-### 한글 모델
-4. **Llama-2-Ko 7B** - Llama 2 기반 한글 대화형 모델 (~14GB, 고사양)
-5. **KoT-Llama2-7B-Chat** - 한국어 최적화 Llama 2 대화 모델 (~14GB, 고사양)
-6. **KoAlpaca 5.8B** - 한글 대화형 모델 (~12GB, 고사양)
-7. **KULLM-Polyglot 5.8B** - 고려대 NLP 연구실 한글 대화 모델 (~12GB, 고사양)
-## 🚀 로컬 실행 방법
-### 1. 저장소 클론
-```bash
-git clone <repository-url>
-cd simple-chatbot-gradio
-```
-### 2. 의존성 설치
-```bash
-pip install -r requirements.txt
 ```
-### 3. 환경 변수 설정 (선택사항)
-Public 모델만 사용하는 경우 이 단계를 건너뛸 수 있습니다.
-Private 모델 접근이 필요한 경우, 환경 변수로 HF_TOKEN을 설정하세요:
-```bash
-export HF_TOKEN=your_hugging_face_token_here
-```
-**Hugging Face 토큰 발급 방법:**
-1. [Hugging Face](https://huggingface.co)에 로그인
-2. Settings → Access Tokens 메뉴로 이동
-3. "New token" 클릭하여 토큰 생성
-### 4. 애플리케이션 실행
-```bash
-python app.py
 ```
-브라우저에서 `http://localhost:7860`으로 접속하세요.
-## 🌐 Hugging Face Spaces 배포
-### 방법 1: 웹 UI 사용
-1. [Hugging Face Spaces](https://huggingface.co/spaces)에 접속
-2. "Create new Space" 클릭
-3. SDK로 "Gradio" 선택
-4. 파일 업로드:
-   - `app.py`
-   - `requirements.txt`
-   - `README.md`
-5. (선택사항) Private 모델 사용 시: Settings → Repository secrets에서 `HF_TOKEN` 추가
-6. 자동 빌드 및 배포 대기 (첫 빌드는 5-10분 소요)
-### 방법 2: Git 사용
 ```bash
-# Hugging Face Space 저장소를 remote로 추가
-git remote add space https://huggingface.co/spaces/<username>/<space-name>
-# 파일 푸시
-git add .
-git commit -m "Initial commit"
-git push space main
 ```
-## ⚙️ 기술 스택
-- **프레임워크**: Gradio 5.x
-- **ML 라이브러리**: Transformers, PyTorch
-- **언어**: Python 3.10+
-- **주요 라이브러리**:
-  - `gradio` - 웹 인터페이스
-  - `transformers` - 모델 로딩 및 추론
-  - `torch` - 딥러닝 프레임워크
-  - `python-dotenv` - 환경 변수 관리
-## 📝 프로젝트 구조
-```
-simple-chatbot-gradio/
-├── app.py              # 메인 애플리케이션
-├── requirements.txt    # Python 의존성
-├── README.md          # 프로젝트 문서
-├── .env               # 환경 변수 (git ignored)
-└── CLAUDE.md          # 개발 가이드
-```
-## ⚠️ 제한사항 및 주의사항
-### 성능
-- **CPU 실행**: GPU 없이 CPU에서 실행되므로 응답이 느릴 수 있습니다 (5-30초)
-- **메모리**: 모델 크기에 따라 1-16GB RAM 필요
-- **첫 실행**: 모델 다운로드로 시간 소요 (350MB~14GB)
-### 모델별 특성
-- **영어 모델**: 한글 입력 시 부자연스러운 응답
-- **한글 모델 (Llama 2 기반)**: 대화 품질 우수하지만 메모리 많이 필요 (14GB+)
-- **한글 모델 (Polyglot 기반)**: 중간 크기, 대화 품질 양호 (12GB+)
-- **모든 한글 모델**: CPU 환경에서 매우 느림, GPU 권장
-### Hugging Face Spaces 배포
-- **무료 tier**: CPU 인스턴스만 제공 (16GB RAM)
-- **Space Sleep**: 48시간 비활성 시 자동 sleep, 첫 로딩 느림
-- **메모리 제한**: 한글 모델들은 무료 tier에서 실행 불가 (12-14GB 필요)
-- **첫 실행**: 모델 다운로드로 1-5분 소요
-- **권장 모델**: DialoGPT Small/Medium, GPT-2만 무료 tier에서 안정적
-- **한글 대화**: 무료 tier에서는 한글 모델 사용 불가, 유료 GPU tier 필요
-## 🔧 개발 및 커스터마이징
-### 모델 추가
-`app.py`의 `MODELS` 딕셔너리에 새 모델을 추가하세요:
-```python
-MODELS = {
-    "your-model-id": {
-        "name": "모델 표시 이름",
-        "max_length": 512,
-        "temperature": 0.7,
-    },
-}
-```
-### UI 커스터마이징
-Gradio Blocks와 ChatInterface를 수정하여 UI를 변경할 수 있습니다. 자세한 내용은 [Gradio 문서](https://www.gradio.app/docs)를 참고하세요.
 ## 📄 라이선스
 MIT License
-## 🙋‍♂️ 지원
 이슈나 질문이 있으시면 GitHub Issues를 통해 문의해주세요.

 ---
+title: Llama-2-Ko Chatbot
 emoji: 🤖
 colorFrom: blue
 colorTo: purple
 license: mit
 ---
+# 🤖 Llama-2-Ko 7B Chatbot (ZeroGPU)
+한국어에 최적화된 Llama-2-Ko 7B 모델을 사용한 대화형 챗봇입니다. ZeroGPU를 활용하여 무료로 GPU 가속 추론을 제공합니다.
+## ✨ 주요 특징
+- **🇰🇷 한글 대화 최적화**: Llama-2-Ko 7B 모델 사용
+- **⚡ GPU 가속**: NVIDIA H200 ZeroGPU로 빠른 응답 (3-5초)
+- **💰 경제적**: PRO 구독 시 하루 25분 무료 사용 가능
+- **🔄 자동 GPU 할당**: 요청 시 자동으로 GPU 할당 및 해제
+## 🎯 모델 정보
+- **모델**: `beomi/llama-2-ko-7b`
+- **크기**: ~14GB
+- **특징**: 한글 대화에 특화된 Llama 2 기반 모델
+- **하드웨어**: NVIDIA H200 (70GB VRAM)
+## 🚀 사용 방법
+### Hugging Face Spaces에서 사용
+1. 이 Space에 접속
+2. 한글로 메시지 입력
+3. 첫 응답은 모델 로딩으로 10-15초 소요
+4. 이후 응답은 3-5초 내로 생성
+### 테스트 예시
 ```
+안녕하세요
+인공지능에 대해 설명해주세요
+오늘 날씨가 어때요?
+```
+## ⚙️ 기술 스택
+- **프레임워크**: Gradio 5.x
+- **ML 라이브러리**: Transformers, PyTorch
+- **GPU 인프라**: Hugging Face ZeroGPU
+- **언어**: Python 3.10+
+## 📝 ZeroGPU 설정 방법
+### 1. 필수 요구사항
+- Hugging Face PRO 구독 ($9/month)
+- ZeroGPU 하드웨어 선택 (Space Settings에서)
+### 2. 코드 구조
+```python
+import spaces  # ZeroGPU 데코레이터
+@spaces.GPU(duration=120)  # GPU 요청 (최대 120초)
+def generate_response(message, history):
+    model.to('cuda')  # GPU로 모델 이동
+    # ... 추론 로직 ...
+    return response
 ```
+### 3. requirements.txt
+```
+gradio==5.9.1
+transformers==4.46.0
+torch==2.1.0
+spaces  # ZeroGPU 필수
+```
+### 4. Space 설정
+1. Space Settings → Hardware 선택
+2. **ZeroGPU** 선택 (PRO 구독자만 가능)
+3. Deploy
+## 🔧 로컬 실행 (GPU 필요)
 ```bash
+# 저장소 클론
+git clone <repository-url>
+cd simple-chatbot-gradio
+# 의존성 설치
+pip install -r requirements.txt
+# HF 토큰 설정 (필수)
+export HF_TOKEN=your_hugging_face_token
+# 실행 (CUDA GPU 필요)
+python app.py
 ```
+**참고**: 로컬 실행 시 CUDA GPU가 필요합니다 (최소 16GB VRAM 권장)
+## ⚠️ 제한사항
+### ZeroGPU 사용 제한
+- **PRO 구독**: 하루 25분 무료 사용
+- **첫 로딩**: 모델 다운로드로 초기 응답 느림 (~10-15초)
+- **대기열**: 사용자가 많을 경우 대기 발생 가능
+### 모델 특성
+- **한글 특화**: 영어 입력 시 한글보다 품질 낮음
+- **대화 길이**: 긴 대화 시 컨텍스트 제한 (최근 3턴만 유지)
+- **응답 길이**: 최대 150 토큰
+## 💡 최적화 팁
+### ZeroGPU 효율적 사용
+1. **Duration 설정**: 실제 필요한 시간만큼만 요청
+   ```python
+   @spaces.GPU(duration=60)  # 짧은 응답용
+   ```
+2. **모델 캐싱**: 글로벌 변수로 모델 재사용
+   ```python
+   model = None  # 전역 변수
+   def load_model_once():
+       global model
+       if model is None:
+           model = AutoModelForCausalLM.from_pretrained(...)
+   ```
+3. **Float16 사용**: GPU 메모리 절약
+   ```python
+   model = AutoModelForCausalLM.from_pretrained(
+       ...,
+       torch_dtype=torch.float16,
+   )
+   ```
+## 📊 비용 비교
+| 옵션 | 월 비용 | 하드웨어 | 제약사항 |
+|------|---------|---------|----------|
+| **ZeroGPU (PRO)** | $9 | H200 (70GB) | 하루 25분 |
+| CPU Upgrade (32GB) | $22 | 8 vCPU | 느림 (30초~1분) |
+| T4 Medium GPU | $438 | T4 (30GB) | 제약 없음 |
+## 🔗 관련 리소스
+- [Llama-2-Ko Model Card](https://huggingface.co/beomi/llama-2-ko-7b)
+- [ZeroGPU Documentation](https://huggingface.co/docs/hub/spaces-zerogpu)
+- [Gradio Documentation](https://www.gradio.app/docs)
 ## 📄 라이선스
 MIT License
+## 🙋‍♂️ 문의
 이슈나 질문이 있으시면 GitHub Issues를 통해 문의해주세요.

app.py CHANGED Viewed

@@ -1,122 +1,117 @@
 """
-Incremental version: Single model (DialoGPT-small only)
-Testing model loading on HF Spaces
 """
 import os
 import gradio as gr
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
-import warnings
-# Suppress torch_dtype deprecation warning
-warnings.filterwarnings('ignore', message='.*torch_dtype.*deprecated.*')
 # Get HF token from environment
 HF_TOKEN = os.getenv("HF_TOKEN", None)
-# Check device
-device = "cuda" if torch.cuda.is_available() else "cpu"
-print(f"Using device: {device}")
-# Single model only for testing
-MODELS = {
-    "microsoft/DialoGPT-small": {
-        "name": "DialoGPT Small (영어, 빠름)",
-        "max_length": 80,
-    },
 }
-# Model cache
-loaded_models = {}
-loaded_tokenizers = {}
-def load_model(model_name):
-    """Load model and tokenizer"""
-    if model_name not in loaded_models:
-        try:
-            print(f"Loading model: {model_name}")
-            # Load tokenizer
-            tokenizer = AutoTokenizer.from_pretrained(
-                model_name,
-                token=HF_TOKEN,
-                padding_side='left',
-            )
-            if tokenizer.pad_token is None:
-                tokenizer.pad_token = tokenizer.eos_token
-            # Load model
-            model = AutoModelForCausalLM.from_pretrained(
-                model_name,
-                token=HF_TOKEN,
-                torch_dtype=torch.float32,
-                low_cpu_mem_usage=True,
-            )
-            model.to(device)
-            model.eval()
-            loaded_models[model_name] = model
-            loaded_tokenizers[model_name] = tokenizer
-            print(f"✅ Model {model_name} loaded successfully")
-        except Exception as e:
-            print(f"❌ Failed to load model {model_name}: {e}")
-            import traceback
-            print(traceback.format_exc())
-            return None, None
-    return loaded_models.get(model_name), loaded_tokenizers.get(model_name)
-def chat_response(message, history):
-    """Generate chatbot response"""
     if not message or not message.strip():
         return history
     try:
-        model_name = "microsoft/DialoGPT-small"
-        model, tokenizer = load_model(model_name)
-        if model is None or tokenizer is None:
             return history + [[message, "❌ 모델을 로드할 수 없습니다."]]
-        model_config = MODELS[model_name]
         # Build conversation context
         conversation = ""
-        for user_msg, bot_msg in history:
             if user_msg:
-                conversation += f"{user_msg}\n"
             if bot_msg:
-                conversation += f"{bot_msg}\n"
-        conversation += f"{message}\n"
         # Tokenize
-        inputs = tokenizer.encode(conversation, return_tensors="pt").to(device)
         # Generate response
         with torch.no_grad():
-            outputs = model.generate(
                 inputs,
-                max_new_tokens=model_config["max_length"],
-                temperature=0.9,
                 do_sample=True,
-                pad_token_id=tokenizer.pad_token_id,
-                eos_token_id=tokenizer.eos_token_id,
             )
         # Decode response
-        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-        response = response[len(conversation):].strip()
         if not response:
-            response = "I understand. Could you tell me more?"
         return history + [[message, response]]
@@ -130,43 +125,58 @@ def chat_response(message, history):
         return history + [[message, f"❌ 오류: {error_msg[:200]}"]]
-print("✅ App initialized - model will load on first use")
 # Create Gradio interface
-with gr.Blocks(title="🤖 Simple Chatbot") as demo:
     gr.Markdown("""
-    # 🤖 Simple Chatbot (Single Model Test)
-    **Model**: DialoGPT Small (English conversation)
-    - First message will be slow (model loading)
-    - Subsequent messages will be faster
     """)
     chatbot = gr.Chatbot(height=400, type="tuples", show_label=False)
     with gr.Row():
         msg = gr.Textbox(
-            placeholder="Type a message in English...",
             show_label=False,
             scale=9,
         )
-        btn = gr.Button("Send", scale=1, variant="primary")
-    clear = gr.Button("🗑️ Clear Chat", size="sm")
     def submit(message, history):
-        return chat_response(message, history), ""
-    btn.click(submit, [msg, chatbot], [chatbot, msg], queue=False)
-    msg.submit(submit, [msg, chatbot], [chatbot, msg], queue=False)
-    clear.click(lambda: [], outputs=chatbot, queue=False)
     gr.Markdown("""
     ---
-    **Note**:
-    - This is a test version with only one model
-    - First response will take 5-10 seconds (model loading)
-    - Uses DialoGPT-small (~350MB)
     """)
 if __name__ == "__main__":

 """
+ZeroGPU version: Llama-2-Ko 7B with dynamic GPU allocation
+Requires: PRO subscription + ZeroGPU hardware selection in Space settings
 """
 import os
 import gradio as gr
 from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
+import spaces  # ZeroGPU decorator
 # Get HF token from environment
 HF_TOKEN = os.getenv("HF_TOKEN", None)
+# Model configuration
+MODEL_NAME = "beomi/llama-2-ko-7b"
+MODEL_CONFIG = {
+    "name": "Llama-2-Ko 7B (한글 대화형)",
+    "max_length": 150,
 }
+# Global model cache (loaded once)
+model = None
+tokenizer = None
+def load_model_once():
+    """Load model and tokenizer once at startup"""
+    global model, tokenizer
+    if model is None:
+        print(f"🔄 Loading model: {MODEL_NAME}")
+        # Load tokenizer
+        tokenizer = AutoTokenizer.from_pretrained(
+            MODEL_NAME,
+            token=HF_TOKEN,
+            trust_remote_code=True,
+        )
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        # Load model - will be moved to GPU by @spaces.GPU decorator
+        model = AutoModelForCausalLM.from_pretrained(
+            MODEL_NAME,
+            token=HF_TOKEN,
+            torch_dtype=torch.float16,  # Use float16 for GPU
+            low_cpu_mem_usage=True,
+            trust_remote_code=True,
+        )
+        print(f"✅ Model {MODEL_NAME} loaded successfully")
+    return model, tokenizer
+@spaces.GPU(duration=120)  # Request GPU for 120 seconds max
+def generate_response(message, history):
+    """Generate chatbot response with GPU acceleration"""
     if not message or not message.strip():
         return history
     try:
+        # Ensure model is loaded
+        current_model, current_tokenizer = load_model_once()
+        if current_model is None or current_tokenizer is None:
             return history + [[message, "❌ 모델을 로드할 수 없습니다."]]
+        # Move model to GPU (ZeroGPU handles this automatically)
+        current_model.to('cuda')
         # Build conversation context
         conversation = ""
+        for user_msg, bot_msg in history[-3:]:  # Last 3 turns for context
             if user_msg:
+                conversation += f"사용자: {user_msg}\n"
             if bot_msg:
+                conversation += f"어시스턴트: {bot_msg}\n"
+        conversation += f"사용자: {message}\n어시스턴트:"
         # Tokenize
+        inputs = current_tokenizer.encode(
+            conversation,
+            return_tensors="pt",
+            truncate=True,
+            max_length=512,
+        ).to('cuda')
         # Generate response
         with torch.no_grad():
+            outputs = current_model.generate(
                 inputs,
+                max_new_tokens=MODEL_CONFIG["max_length"],
+                temperature=0.7,
+                top_p=0.9,
                 do_sample=True,
+                pad_token_id=current_tokenizer.pad_token_id,
+                eos_token_id=current_tokenizer.eos_token_id,
             )
         # Decode response
+        full_response = current_tokenizer.decode(outputs[0], skip_special_tokens=True)
+        # Extract only the assistant's response
+        if "어시스턴트:" in full_response:
+            response = full_response.split("어시스턴트:")[-1].strip()
+        else:
+            response = full_response[len(conversation):].strip()
         if not response:
+            response = "죄송합니다. 응답을 생성할 수 없었습니다."
         return history + [[message, response]]
         return history + [[message, f"❌ 오류: {error_msg[:200]}"]]
+def chat_wrapper(message, history):
+    """Wrapper for Gradio ChatInterface"""
+    return generate_response(message, history)
+print("✅ App initialized - ZeroGPU will allocate GPU on demand")
 # Create Gradio interface
+with gr.Blocks(title="🤖 Llama-2-Ko Chatbot") as demo:
     gr.Markdown("""
+    # 🤖 Llama-2-Ko 7B Chatbot (ZeroGPU)
+    **모델**: Llama-2-Ko 7B (한글 대화형 모델)
+    **하드웨어**: NVIDIA H200 (ZeroGPU - 자동 할당)
+    **특징**:
+    - ⚡ GPU 가속으로 빠른 응답 (3-5초)
+    - 🇰🇷 한글 대화에 최적화
+    - 🔄 첫 응답은 모델 로딩으로 조금 더 소요될 수 있습니다
     """)
     chatbot = gr.Chatbot(height=400, type="tuples", show_label=False)
     with gr.Row():
         msg = gr.Textbox(
+            placeholder="한글로 메시지를 입력하세요...",
             show_label=False,
             scale=9,
         )
+        btn = gr.Button("전송", scale=1, variant="primary")
+    clear = gr.Button("🗑️ 대화 초기화", size="sm")
     def submit(message, history):
+        return chat_wrapper(message, history), ""
+    btn.click(submit, [msg, chatbot], [chatbot, msg])
+    msg.submit(submit, [msg, chatbot], [chatbot, msg])
+    clear.click(lambda: [], outputs=chatbot)
     gr.Markdown("""
     ---
+    **참고사항**:
+    - ZeroGPU는 요청 시 자동으로 GPU를 할당합니다
+    - PRO 구독자는 하루 25분 무료 사용 가능
+    - 첫 응답은 모델 로딩 시간 포함 (~10-15초)
+    - 이후 응답은 빠르게 생성됩니다 (~3-5초)
+    **테스트 예시**:
+    - "안녕하세요"
+    - "인공지능에 대해 설명해주세요"
+    - "오늘 날씨가 어때요?"
     """)
 if __name__ == "__main__":

requirements.txt CHANGED Viewed

@@ -3,3 +3,4 @@ transformers==4.46.0
 torch==2.1.0
 safetensors==0.4.5
 accelerate==0.26.1

 torch==2.1.0
 safetensors==0.4.5
 accelerate==0.26.1
+spaces