Spaces:

alex4cip
/

simple-chat

Sleeping

alex4cip Claude commited on Oct 29

Commit

b44b68c

1 Parent(s): aa3c1df

feat: Add flexible hardware support (ZeroGPU + CPU Upgrade)

✨ Features:
- Automatic hardware detection (ZeroGPU vs CPU Upgrade)
- Conditional @spaces.GPU decorator application
- Dynamic UI based on hardware type
- No code changes needed when switching hardware

🔧 Technical Implementation:
- Try/except for 'spaces' import detection
- Shared generation logic (generate_response_impl)
- Conditional decorator wrapper
- Device-aware model loading (float16 for GPU, float32 for CPU)

📚 Documentation:
- Comprehensive README with hardware comparison
- Setup guides for both hardware options
- Performance benchmarks and cost analysis
- Usage scenarios and optimization tips

🎯 Benefits:
- ZeroGPU: Fast (3-5s), cheap ($9/mo), 25min/day limit
- CPU Upgrade: Slower (30s-1m), pricey ($22/mo), unlimited
- Easy switching via Space Settings (no code changes)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (2) hide show

README.md +167 -92
app.py +119 -44

README.md CHANGED Viewed

@@ -10,155 +10,226 @@ pinned: false
 license: mit
 ---
-# 🤖 Llama-2-Ko 7B Chatbot (ZeroGPU)
-한국어에 최적화된 Llama-2-Ko 7B 모델을 사용한 대화형 챗봇입니다. ZeroGPU를 활용하여 무료로 GPU 가속 추론을 제공합니다.
 ## ✨ 주요 특징
 - **🇰🇷 한글 대화 최적화**: Llama-2-Ko 7B 모델 사용
-- **⚡ GPU 가속**: NVIDIA H200 ZeroGPU로 빠른 응답 (3-5초)
-- **💰 경제적**: PRO 구독 시 하루 25분 무료 사용 가능
-- **🔄 자동 GPU 할당**: 요청 시 자동으로 GPU 할당 및 해제
 ## 🎯 모델 정보
 - **모델**: `beomi/llama-2-ko-7b`
 - **크기**: ~14GB
 - **특징**: 한글 대화에 특화된 Llama 2 기반 모델
-- **하드웨어**: NVIDIA H200 (70GB VRAM)
-## 🚀 사용 방법
-### Hugging Face Spaces에서 사용
-1. 이 Space에 접속
-2. 한글로 메시지 입력
-3. 첫 응답은 모델 로딩으로 10-15초 소요
-4. 이후 응답은 3-5초 내로 생성
-### 테스트 예시
-```
-안녕하세요
-인공지능에 대해 설명해주세요
-오늘 날씨가 어때요?
-```
-## ⚙️ 기술 스택
-- **프레임워크**: Gradio 5.x
-- **ML 라이브러리**: Transformers, PyTorch
-- **GPU 인프라**: Hugging Face ZeroGPU
-- **언어**: Python 3.10+
-## 📝 ZeroGPU 설정 방법
-### 1. 필수 요구사항
-- Hugging Face PRO 구독 ($9/month)
-- ZeroGPU 하드웨어 선택 (Space Settings에서)
-### 2. 코드 구조
-```python
-import spaces  # ZeroGPU 데코레이터
-@spaces.GPU(duration=120)  # GPU 요청 (최대 120초)
-def generate_response(message, history):
-    model.to('cuda')  # GPU로 모델 이동
-    # ... 추론 로직 ...
-    return response
 ```
-### 3. requirements.txt
 ```
-gradio==5.9.1
-transformers==4.46.0
-torch==2.1.0
-spaces  # ZeroGPU 필수
 ```
-### 4. Space 설정
-1. Space Settings → Hardware 선택
-2. **ZeroGPU** 선택 (PRO 구독자만 가능)
-3. Deploy
-## 🔧 로컬 실행 (GPU 필요)
-```bash
-# 저장소 클론
-git clone <repository-url>
-cd simple-chatbot-gradio
-# 의존성 설치
-pip install -r requirements.txt
-# HF 토큰 설정 (필수)
-export HF_TOKEN=your_hugging_face_token
-# 실행 (CUDA GPU 필요)
-python app.py
-```
-**참고**: 로컬 실행 시 CUDA GPU가 필요합니다 (최소 16GB VRAM 권장)
 ## ⚠️ 제한사항
-### ZeroGPU 사용 제한
-- **PRO 구독**: 하루 25분 무료 사용
-- **첫 로딩**: 모델 다운로드로 초기 응답 느림 (~10-15초)
-- **대기열**: 사용자가 많을 경우 대기 발생 가능
-### 모델 특성
-- **한글 특화**: 영어 입력 시 한글보다 품질 낮음
-- **대화 길이**: 긴 대화 시 컨텍스트 제한 (최근 3턴만 유지)
-- **응답 길이**: 최대 150 토큰
-## 💡 최적화 팁
-### ZeroGPU 효율적 사용
-1. **Duration 설정**: 실제 필요한 시간만큼만 요청
-   ```python
-   @spaces.GPU(duration=60)  # 짧은 응답용
-   ```
-2. **모델 캐싱**: 글로벌 변수로 모델 재사용
-   ```python
-   model = None  # 전역 변수
-   def load_model_once():
-       global model
-       if model is None:
-           model = AutoModelForCausalLM.from_pretrained(...)
-   ```
-3. **Float16 사용**: GPU 메모리 절약
-   ```python
-   model = AutoModelForCausalLM.from_pretrained(
-       ...,
-       torch_dtype=torch.float16,
-   )
-   ```
-## 📊 비용 비교
-| 옵션 | 월 비용 | 하드웨어 | 제약사항 |
-|------|---------|---------|----------|
-| **ZeroGPU (PRO)** | $9 | H200 (70GB) | 하루 25분 |
-| CPU Upgrade (32GB) | $22 | 8 vCPU | 느림 (30초~1분) |
-| T4 Medium GPU | $438 | T4 (30GB) | 제약 없음 |
 ## 🔗 관련 리소스
 - [Llama-2-Ko Model Card](https://huggingface.co/beomi/llama-2-ko-7b)
 - [ZeroGPU Documentation](https://huggingface.co/docs/hub/spaces-zerogpu)
 - [Gradio Documentation](https://www.gradio.app/docs)
 ## 📄 라이선스
@@ -167,3 +238,7 @@ MIT License
 ## 🙋‍♂️ 문의
 이슈나 질문이 있으시면 GitHub Issues를 통해 문의해주세요.

 license: mit
 ---
+# 🤖 Llama-2-Ko 7B Chatbot (Flexible Hardware)
+한국어에 최적화된 Llama-2-Ko 7B 모델을 사용한 대화형 챗봇입니다. **ZeroGPU**와 **CPU Upgrade** 하드웨어를 모두 지원합니다.
 ## ✨ 주요 특징
 - **🇰🇷 한글 대화 최적화**: Llama-2-Ko 7B 모델 사용
+- **⚡ 유연한 하드웨어 지원**: ZeroGPU/CPU Upgrade 자동 감지
+- **🔄 자동 전환**: 하드웨어 변경 시 코드 수정 불필요
+- **💰 비용 효율적**: 상황에 맞는 하드웨어 선택 가능
 ## 🎯 모델 정보
 - **모델**: `beomi/llama-2-ko-7b`
 - **크기**: ~14GB
 - **특징**: 한글 대화에 특화된 Llama 2 기반 모델
+## 🚀 하드웨어 옵션
+### Option 1: ZeroGPU (추천)
+**장점**:
+- ⚡ 빠른 응답 (3-5초)
+- 💰 저렴한 비용 ($9/month)
+- 🔋 자동 GPU 할당/해제
+**제약**:
+- 하루 25분 무료 사용 (PRO 구독 필요)
+- 대기열 가능 (사용자 많을 경우)
+**비용**: $9/month (PRO 구독)
+### Option 2: CPU Upgrade
+**장점**:
+- ⏰ 무제한 사용
+- 📊 예측 가능한 성능
+- 🔧 간단한 설정
+**제약**:
+- 🐢 느린 응답 (30초~1분)
+- 💵 상대적으로 비싼 비용
+**비용**: $0.03/hour (월 약 $22)
+## ⚙️ 하드웨어 설정 방법
+### ZeroGPU로 변경
+1. Space Settings → Hardware
+2. **ZeroGPU** 선택
+3. Confirm
+4. 빌드 완료 대기 (1-2분)
+→ UI에 "ZeroGPU" 표시 확인
+### CPU Upgrade로 변경
+1. Space Settings → Hardware
+2. **CPU Upgrade (8 vCPU / 32 GB)** 선택
+3. Confirm
+4. 빌드 완료 대기 (1-2분)
+→ UI에 "CPU Upgrade" 표시 확인
+## 📊 성능 비교
+| 항목 | ZeroGPU | CPU Upgrade |
+|------|---------|-------------|
+| **첫 응답** | 10-15초 | 1-2분 |
+| **이후 응답** | 3-5초 | 30초~1분 |
+| **일일 한도** | 25분 | 무제한 |
+| **월 비용** | $9 | $22 |
+| **GPU** | H200 (70GB) | 없음 |
+| **RAM** | - | 32GB |
+## 🔧 기술 구조
+### 자동 하드웨어 감지
+```python
+# ZeroGPU 사용 가능 여부 자동 감지
+try:
+    import spaces
+    ZEROGPU_AVAILABLE = True
+except ImportError:
+    ZEROGPU_AVAILABLE = False
+# 조건부 decorator 적용
+if ZEROGPU_AVAILABLE:
+    @spaces.GPU(duration=120)
+    def generate_response(message, history):
+        return generate_response_impl(message, history)
+else:
+    def generate_response(message, history):
+        return generate_response_impl(message, history)
 ```
+### 동적 UI 생성
+- ZeroGPU 모드: GPU 가속 안내
+- CPU Upgrade 모드: CPU 제약 안내
+- 하드웨어 정보 자동 표시
+## 📝 사용 방법
+### 1. Space 접속
+https://huggingface.co/spaces/alex4cip/simple-chat
+### 2. 하드웨어 확인
+- UI 상단에 현재 하드웨어 표시
+- "ZeroGPU" 또는 "CPU Upgrade"
+### 3. 대화 시작
 ```
+안녕하세요
+인공지능에 대해 설명해주세요
+한국의 수도는 어디인가요?
 ```
+## 💡 최적화 팁
+### ZeroGPU 모드
+1. **짧은 대화**: 긴 대화는 GPU 시간 소모
+2. **효율적 프롬프트**: 명확하고 간결한 질문
+3. **일일 한도 관리**: 25분 내 사용
+### CPU Upgrade 모드
+1. **인내심**: 응답 대기 시간 ���어짐
+2. **배치 질문**: 여러 질문 동시에
+3. **장시간 사용**: 24시간 무제한
+## 🔗 하드웨어 전환 시나리오
+### 시나리오 1: 빠른 데모 (ZeroGPU)
+- 짧은 시간 내 많은 사람에게 시연
+- 빠른 응답으로 좋은 인상
+- 일일 한도 내 충분히 사용
+### 시나리오 2: 장시간 개발 (CPU Upgrade)
+- 지속적인 테스트 필요
+- 일일 한도 걱정 없음
+- 느린 속도 감수
+### 시나리오 3: 혼합 사용
+- 평상시: CPU Upgrade
+- 데모 시: ZeroGPU로 전환
+- 코드 수정 불필요 (자동 감지)
 ## ⚠️ 제한사항
+### 공통
+- **모델 크기**: 14GB (로딩 시간 필요)
+- **컨텍스트**: 최근 3턴만 유지
+- **한글 특화**: 영어 입력 시 품질 낮음
+### ZeroGPU 전용
+- **일일 한도**: 25분 (PRO 구독)
+- **대기열**: 사용자 많을 경우 대기
+- **PRO 필요**: $9/month 구독 필요
+### CPU Upgrade 전용
+- **느린 속도**: 30초~1분 응답
+- **비용**: 시간당 $0.03 ($22/month)
+- **성능**: GPU 대비 10배 이상 느림
+## 📦 로컬 실행
+```bash
+# 저장소 클론
+git clone <repository-url>
+cd simple-chatbot-gradio
+# 의존성 설치
+pip install -r requirements.txt
+# HF 토큰 설정
+export HF_TOKEN=your_hugging_face_token
+# 실행 (GPU 권장)
+python app.py
+```
+**참고**: 로컬은 CPU 모드로 실행됨 (매우 느림)
+## 🛠️ 기술 스택
+- **프레임워크**: Gradio 5.x
+- **ML 라이브러리**: Transformers, PyTorch
+- **GPU 인프라**: Hugging Face ZeroGPU (선택적)
+- **언어**: Python 3.10+
+## 📚 Dependencies
+```txt
+gradio==5.9.1
+transformers==4.46.0
+torch==2.1.0
+safetensors==0.4.5
+accelerate==0.26.1
+spaces  # ZeroGPU support (optional)
+```
 ## 🔗 관련 리소스
 - [Llama-2-Ko Model Card](https://huggingface.co/beomi/llama-2-ko-7b)
 - [ZeroGPU Documentation](https://huggingface.co/docs/hub/spaces-zerogpu)
 - [Gradio Documentation](https://www.gradio.app/docs)
+- [HF Spaces Pricing](https://huggingface.co/pricing)
 ## 📄 라이선스
 ## 🙋‍♂️ 문의
 이슈나 질문이 있으시면 GitHub Issues를 통해 문의해주세요.
+---
+**💡 TIP**: 빠른 데모가 필요하면 ZeroGPU, 장시간 사용이 필요하면 CPU Upgrade를 선택하세요!

app.py CHANGED Viewed

@@ -1,9 +1,17 @@
 """
-ZeroGPU version: Llama-2-Ko 7B with dynamic GPU allocation
-Requires: PRO subscription + ZeroGPU hardware selection in Space settings
 """
-import spaces  # ZeroGPU decorator - MUST be imported first!
 import os
 import gradio as gr
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -41,23 +49,40 @@ def load_model_once():
         if tokenizer.pad_token is None:
             tokenizer.pad_token = tokenizer.eos_token
-        # Load model - will be moved to GPU by @spaces.GPU decorator
-        model = AutoModelForCausalLM.from_pretrained(
-            MODEL_NAME,
-            token=HF_TOKEN,
-            torch_dtype=torch.float16,  # Use float16 for GPU
-            low_cpu_mem_usage=True,
-            trust_remote_code=True,
-        )
         print(f"✅ Model {MODEL_NAME} loaded successfully")
     return model, tokenizer
-@spaces.GPU(duration=120)  # Request GPU for 120 seconds max
-def generate_response(message, history):
-    """Generate chatbot response with GPU acceleration"""
     if not message or not message.strip():
         return history
@@ -68,12 +93,12 @@ def generate_response(message, history):
         if current_model is None or current_tokenizer is None:
             return history + [[message, "❌ 모델을 로드할 수 없습니다."]]
-        # Move model to GPU (ZeroGPU handles this automatically)
-        current_model.to('cuda')
-        # Build conversation context
         conversation = ""
-        for user_msg, bot_msg in history[-3:]:  # Last 3 turns for context
             if user_msg:
                 conversation += f"사용자: {user_msg}\n"
             if bot_msg:
@@ -87,7 +112,7 @@ def generate_response(message, history):
             return_tensors="pt",
             truncation=True,
             max_length=512,
-        ).to('cuda')
         # Generate response
         with torch.no_grad():
@@ -125,26 +150,58 @@ def generate_response(message, history):
         return history + [[message, f"❌ 오류: {error_msg[:200]}"]]
 def chat_wrapper(message, history):
     """Wrapper for Gradio ChatInterface"""
     return generate_response(message, history)
-print("✅ App initialized - ZeroGPU will allocate GPU on demand")
 # Create Gradio interface
 with gr.Blocks(title="🤖 Llama-2-Ko Chatbot") as demo:
-    gr.Markdown("""
-    # 🤖 Llama-2-Ko 7B Chatbot (ZeroGPU)
-    **모델**: Llama-2-Ko 7B (한글 대화형 모델)
-    **하드웨어**: NVIDIA H200 (ZeroGPU - 자동 할당)
-    **특징**:
-    - ⚡ GPU 가속으로 빠른 응답 (3-5초)
-    - 🇰🇷 한글 대화에 최적화
-    - 🔄 첫 응답은 모델 로딩으로 조금 더 소요될 수 있습니다
-    """)
     chatbot = gr.Chatbot(height=400, type="tuples", show_label=False)
@@ -165,19 +222,37 @@ with gr.Blocks(title="🤖 Llama-2-Ko Chatbot") as demo:
     msg.submit(submit, [msg, chatbot], [chatbot, msg])
     clear.click(lambda: [], outputs=chatbot)
-    gr.Markdown("""
-    ---
-    **참고사항**:
-    - ZeroGPU는 요청 시 자동으로 GPU를 할당합니다
-    - PRO 구독자는 하루 25분 무료 사용 가능
-    - 첫 응답은 모델 로딩 시간 포함 (~10-15초)
-    - 이후 응답은 빠르게 생성됩니다 (~3-5초)
-    **테스트 예시**:
-    - "안녕하세요"
-    - "인공지능에 대해 설명해주세요"
-    - "오늘 날씨가 어때요?"
-    """)
 if __name__ == "__main__":
     demo.launch()

 """
+Flexible version: Works on both ZeroGPU and CPU Upgrade hardware
+Automatically detects hardware and adjusts accordingly
 """
+# Try to import spaces for ZeroGPU support
+try:
+    import spaces
+    ZEROGPU_AVAILABLE = True
+    print("✅ ZeroGPU support enabled")
+except ImportError:
+    ZEROGPU_AVAILABLE = False
+    print("ℹ️ ZeroGPU not available, using standard mode")
 import os
 import gradio as gr
 from transformers import AutoModelForCausalLM, AutoTokenizer
         if tokenizer.pad_token is None:
             tokenizer.pad_token = tokenizer.eos_token
+        # Detect device
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        print(f"📍 Using device: {device}")
+        # Load model with appropriate settings
+        if device == "cuda":
+            # GPU available (CPU Upgrade with GPU or ZeroGPU)
+            model = AutoModelForCausalLM.from_pretrained(
+                MODEL_NAME,
+                token=HF_TOKEN,
+                torch_dtype=torch.float16,  # Use float16 for GPU
+                low_cpu_mem_usage=True,
+                trust_remote_code=True,
+                device_map="auto",
+            )
+        else:
+            # CPU only
+            model = AutoModelForCausalLM.from_pretrained(
+                MODEL_NAME,
+                token=HF_TOKEN,
+                torch_dtype=torch.float32,  # Use float32 for CPU
+                low_cpu_mem_usage=True,
+                trust_remote_code=True,
+            )
+            model.to(device)
+        model.eval()
         print(f"✅ Model {MODEL_NAME} loaded successfully")
     return model, tokenizer
+def generate_response_impl(message, history):
+    """Core generation logic (same for both ZeroGPU and CPU)"""
     if not message or not message.strip():
         return history
         if current_model is None or current_tokenizer is None:
             return history + [[message, "❌ 모델을 로드할 수 없습니다."]]
+        # Get device
+        device = next(current_model.parameters()).device
+        # Build conversation context (last 3 turns)
         conversation = ""
+        for user_msg, bot_msg in history[-3:]:
             if user_msg:
                 conversation += f"사용자: {user_msg}\n"
             if bot_msg:
             return_tensors="pt",
             truncation=True,
             max_length=512,
+        ).to(device)
         # Generate response
         with torch.no_grad():
         return history + [[message, f"❌ 오류: {error_msg[:200]}"]]
+# Conditionally apply ZeroGPU decorator
+if ZEROGPU_AVAILABLE:
+    @spaces.GPU(duration=120)
+    def generate_response(message, history):
+        """GPU-accelerated response generation (ZeroGPU mode)"""
+        return generate_response_impl(message, history)
+else:
+    def generate_response(message, history):
+        """Standard response generation (CPU Upgrade mode)"""
+        return generate_response_impl(message, history)
 def chat_wrapper(message, history):
     """Wrapper for Gradio ChatInterface"""
     return generate_response(message, history)
+# Determine hardware info for UI
+hardware_info = "NVIDIA H200 (ZeroGPU)" if ZEROGPU_AVAILABLE else "CPU Upgrade (32GB RAM)"
+print(f"✅ App initialized - Hardware: {hardware_info}")
 # Create Gradio interface
 with gr.Blocks(title="🤖 Llama-2-Ko Chatbot") as demo:
+    # Dynamic header based on hardware
+    if ZEROGPU_AVAILABLE:
+        header = """
+        # 🤖 Llama-2-Ko 7B Chatbot (ZeroGPU)
+        **모델**: Llama-2-Ko 7B (한글 대화형 모델)
+        **하드웨어**: NVIDIA H200 (ZeroGPU - 자동 할당)
+        **특징**:
+        - ⚡ GPU 가속으로 빠른 응답 (3-5초)
+        - 🇰🇷 한글 대화에 최적화
+        - 🔄 첫 응답은 모델 로딩으로 조금 더 소요될 수 있습니다
+        - 💰 PRO 구독 시 하루 25분 무료 사용
+        """
+    else:
+        header = """
+        # 🤖 Llama-2-Ko 7B Chatbot (CPU Upgrade)
+        **모델**: Llama-2-Ko 7B (한글 대화형 모델)
+        **하드웨어**: CPU Upgrade (8 vCPU / 32 GB RAM)
+        **특징**:
+        - 🇰🇷 한글 대화에 최적화
+        - 🔄 첫 응답은 모델 로딩으로 조금 더 소요될 수 있습니다 (10-15초)
+        - ⏳ CPU 환경이므로 응답이 다소 느립니다 (30초~1분)
+        - 💰 시간당 $0.03 (월 약 $22)
+        """
+    gr.Markdown(header)
     chatbot = gr.Chatbot(height=400, type="tuples", show_label=False)
     msg.submit(submit, [msg, chatbot], [chatbot, msg])
     clear.click(lambda: [], outputs=chatbot)
+    # Dynamic footer based on hardware
+    if ZEROGPU_AVAILABLE:
+        footer = """
+        ---
+        **참고사항 (ZeroGPU 모드)**:
+        - ZeroGPU는 요청 시 자동으로 GPU를 할당합니다
+        - PRO 구독자는 하루 25분 무료 사용 가능
+        - 첫 응답은 모델 로딩 시간 포함 (~10-15초)
+        - 이후 응답은 빠르게 생성됩니다 (~3-5초)
+        **테스트 예시**:
+        - "안녕하세요"
+        - "인공지능에 대해 설명해주세요"
+        - "오늘 날씨가 어때요?"
+        """
+    else:
+        footer = """
+        ---
+        **참고사항 (CPU Upgrade 모드)**:
+        - CPU 환경에서 실행되므로 응답이 느립니다 (30초~1분)
+        - 첫 응답은 모델 로딩 시간 포함 (~1-2분)
+        - 24시간 무제한 사용 가능 (시간당 $0.03)
+        - GPU 환경(ZeroGPU)으로 전환 시 더 빠른 응답 가능
+        **테스트 예시**:
+        - "안녕하세요"
+        - "인공지능에 대해 설명해주세요"
+        - "오늘 날씨가 어때요?"
+        """
+    gr.Markdown(footer)
 if __name__ == "__main__":
     demo.launch()