Spaces:
Sleeping
Sleeping
Upload 4 files
Browse files- CHANGELOG.md +45 -34
- README.md +36 -42
- llm_backend.py +81 -25
- requirements.txt +5 -1
CHANGELOG.md
CHANGED
|
@@ -2,46 +2,57 @@
|
|
| 2 |
|
| 3 |
All notable changes to ConversAI will be documented in this file.
|
| 4 |
|
| 5 |
-
## [1.
|
| 6 |
-
|
| 7 |
-
### Changed
|
| 8 |
-
- **✨
|
| 9 |
-
-
|
| 10 |
-
-
|
| 11 |
-
-
|
| 12 |
-
- **100%
|
| 13 |
-
-
|
| 14 |
-
|
| 15 |
-
- **🆓 FOCUS ON FREE MODELS**: Completely revised to use only free, ungated models
|
| 16 |
-
- Removed paid API recommendations (OpenAI, Anthropic)
|
| 17 |
-
- All features work with free HuggingFace Inference API
|
| 18 |
-
- Added comprehensive free models guide
|
| 19 |
-
- Tested and optimized for free tier performance
|
| 20 |
|
| 21 |
### Added
|
| 22 |
-
- **
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
-
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
### Fixed
|
| 35 |
-
-
|
| 36 |
-
-
|
| 37 |
-
-
|
|
|
|
|
|
|
| 38 |
|
| 39 |
### Technical Details
|
| 40 |
-
-
|
| 41 |
-
-
|
| 42 |
-
-
|
| 43 |
-
-
|
| 44 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
---
|
| 47 |
|
|
|
|
| 2 |
|
| 3 |
All notable changes to ConversAI will be documented in this file.
|
| 4 |
|
| 5 |
+
## [1.2.0] - 2025-11-XX
|
| 6 |
+
|
| 7 |
+
### Changed - MAJOR UPDATE
|
| 8 |
+
- **✨ SWITCHED TO LOCAL TRANSFORMERS**: No more API dependencies!
|
| 9 |
+
- Now uses local model loading with transformers library
|
| 10 |
+
- **No API endpoint issues** - everything runs on your Space
|
| 11 |
+
- **Faster after first load** - models cached in memory
|
| 12 |
+
- **100% private** - all processing happens locally
|
| 13 |
+
- Default model: **google/flan-t5-base** (250MB, very fast)
|
| 14 |
+
- Supports all Flan-T5 variants (base, large, xl, xxl)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
### Added
|
| 17 |
+
- **New dependencies**: transformers, torch, accelerate, sentencepiece
|
| 18 |
+
- Enables local model loading and inference
|
| 19 |
+
- No external API calls required
|
| 20 |
+
- Models download and cache automatically
|
| 21 |
+
|
| 22 |
+
- **Local model caching**: Models stay in memory after first load
|
| 23 |
+
- First request: ~1-2 minutes (download + load)
|
| 24 |
+
- Subsequent requests: ~2-5 seconds
|
| 25 |
+
|
| 26 |
+
- **Support for multiple Flan-T5 sizes**: Choose based on your needs
|
| 27 |
+
- flan-t5-base: 250MB (fast, good quality)
|
| 28 |
+
- flan-t5-large: 1.2GB (better quality)
|
| 29 |
+
- flan-t5-xl: 3GB (excellent quality)
|
| 30 |
+
- flan-t5-xxl: 11GB (best quality)
|
| 31 |
|
| 32 |
### Fixed
|
| 33 |
+
- **No more 404 API errors** - eliminated all API endpoint issues
|
| 34 |
+
- **No API token required** - works without any credentials on HF Spaces
|
| 35 |
+
- Faster generation after initial model load
|
| 36 |
+
- More reliable - no network dependencies
|
| 37 |
+
- Better privacy - all processing local
|
| 38 |
|
| 39 |
### Technical Details
|
| 40 |
+
- **Complete rewrite of HuggingFace backend** in `llm_backend.py`
|
| 41 |
+
- Added `_load_local_model()` method for transformers loading
|
| 42 |
+
- Replaced API calls with local inference
|
| 43 |
+
- Added model caching to keep models in memory
|
| 44 |
+
- Auto-detects CUDA/CPU and optimizes accordingly
|
| 45 |
+
|
| 46 |
+
- **Default model**: `google/flan-t5-base` (line 83)
|
| 47 |
+
- Changed from API-based to local transformers
|
| 48 |
+
- Smaller model for faster loading
|
| 49 |
+
- User can upgrade to larger models via LLM_MODEL env var
|
| 50 |
+
|
| 51 |
+
- **New dependencies added** to requirements.txt:
|
| 52 |
+
- transformers>=4.36.0
|
| 53 |
+
- torch>=2.0.0
|
| 54 |
+
- accelerate>=0.25.0
|
| 55 |
+
- sentencepiece>=0.1.99
|
| 56 |
|
| 57 |
---
|
| 58 |
|
README.md
CHANGED
|
@@ -16,7 +16,7 @@ Battle the blank page, reach global audiences, and uncover insights with AI assi
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
-
> **✨ UPDATED (Nov 2025):** Now uses **Google Flan-T5
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
@@ -53,67 +53,61 @@ Battle the blank page, reach global audiences, and uncover insights with AI assi
|
|
| 53 |
|
| 54 |
## 🔧 Configuration
|
| 55 |
|
| 56 |
-
### Default:
|
| 57 |
|
| 58 |
-
**✨ Zero configuration needed!** ConversAI works out-of-the-box on HuggingFace Spaces.
|
| 59 |
|
| 60 |
-
**Default Model:** google/flan-t5-
|
| 61 |
- ✅ **100% Free** - No API keys, no costs, ever
|
| 62 |
-
- ✅ **Fast** -
|
| 63 |
-
- ✅ **
|
| 64 |
-
- ✅ **
|
| 65 |
-
- ✅ **Reliable** - Google's
|
| 66 |
-
|
| 67 |
-
**Setup for
|
| 68 |
-
- Just deploy -
|
| 69 |
-
- **No
|
| 70 |
-
|
| 71 |
-
**Setup for PRIVATE Spaces:**
|
| 72 |
-
1. Go to https://huggingface.co/settings/tokens
|
| 73 |
-
2. Copy your token (read permission is enough)
|
| 74 |
-
3. Add in Space Settings → Variables:
|
| 75 |
-
- Name: `HUGGINGFACE_API_KEY`
|
| 76 |
-
- Value: your_token_here
|
| 77 |
-
4. Restart Space
|
| 78 |
|
| 79 |
### Alternative Free Models
|
| 80 |
|
| 81 |
You can try different free models by setting the `LLM_MODEL` environment variable:
|
| 82 |
|
| 83 |
-
**Recommended Free Models (
|
| 84 |
|
| 85 |
-
| Model | Best For | Speed | Quality |
|
| 86 |
-
|
| 87 |
-
| **google/flan-t5-
|
| 88 |
-
| **google/flan-t5-
|
| 89 |
-
| **google/flan-t5-
|
|
|
|
| 90 |
|
| 91 |
-
**Note:** Flan-T5 models are Google's instruction-tuned models, specifically designed for following instructions. They
|
| 92 |
|
| 93 |
**To change model:**
|
| 94 |
```bash
|
| 95 |
# In Space Settings → Variables
|
| 96 |
-
LLM_MODEL=google/flan-t5-
|
| 97 |
|
| 98 |
-
# Or for
|
| 99 |
-
LLM_MODEL=google/flan-t5-
|
| 100 |
```
|
| 101 |
|
| 102 |
-
**Why
|
| 103 |
-
- ✅ **
|
| 104 |
-
- ✅ **No 404 errors** -
|
| 105 |
-
- ✅ **Fast
|
| 106 |
- ✅ **Instruction-tuned** - designed for following prompts
|
| 107 |
-
- ✅ **
|
| 108 |
|
| 109 |
-
### Tips for Best Performance with
|
| 110 |
|
| 111 |
-
1. **
|
| 112 |
-
2. **
|
| 113 |
-
3. **
|
| 114 |
-
4. **
|
| 115 |
-
5. **
|
| 116 |
-
6. **
|
| 117 |
|
| 118 |
## 📦 Installation
|
| 119 |
|
|
|
|
| 16 |
|
| 17 |
---
|
| 18 |
|
| 19 |
+
> **✨ UPDATED (Nov 2025):** Now uses **local transformers** with **Google Flan-T5** models - Fast, reliable, and **completely FREE**! No API dependencies, runs directly on HuggingFace Spaces.
|
| 20 |
|
| 21 |
---
|
| 22 |
|
|
|
|
| 53 |
|
| 54 |
## 🔧 Configuration
|
| 55 |
|
| 56 |
+
### Default: Local Transformers (Completely FREE!)
|
| 57 |
|
| 58 |
+
**✨ Zero configuration needed!** ConversAI works out-of-the-box on HuggingFace Spaces using local model loading.
|
| 59 |
|
| 60 |
+
**Default Model:** google/flan-t5-base
|
| 61 |
- ✅ **100% Free** - No API keys, no costs, ever
|
| 62 |
+
- ✅ **Fast** - Models load locally, typically 2-5 seconds per request after loading
|
| 63 |
+
- ✅ **No API dependencies** - Runs entirely on your Space's compute
|
| 64 |
+
- ✅ **Private** - All processing happens locally, nothing sent to external APIs
|
| 65 |
+
- ✅ **Reliable** - Google's instruction-tuned model, battle-tested
|
| 66 |
+
|
| 67 |
+
**Setup for HuggingFace Spaces:**
|
| 68 |
+
- Just deploy - models download automatically on first run
|
| 69 |
+
- **No API keys or tokens required!**
|
| 70 |
+
- Models are cached after first download for faster subsequent loads
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
### Alternative Free Models
|
| 73 |
|
| 74 |
You can try different free models by setting the `LLM_MODEL` environment variable:
|
| 75 |
|
| 76 |
+
**Recommended Free Models (Local Transformers):**
|
| 77 |
|
| 78 |
+
| Model | Best For | Speed | Quality | Model Size |
|
| 79 |
+
|-------|----------|-------|---------|------------|
|
| 80 |
+
| **google/flan-t5-base** (default) | Balanced - fast & small | ⚡⚡⚡ Very Fast | ⭐⭐ Good | 250MB |
|
| 81 |
+
| **google/flan-t5-large** | Better quality | ⚡⚡ Fast | ⭐⭐⭐ Better | 1.2GB |
|
| 82 |
+
| **google/flan-t5-xl** | Best quality | ⚡ Medium | ⭐⭐⭐⭐ Excellent | 3GB |
|
| 83 |
+
| **google/flan-t5-xxl** | Maximum quality | ⚡ Slower | ⭐⭐⭐⭐⭐ Best | 11GB |
|
| 84 |
|
| 85 |
+
**Note:** Flan-T5 models are Google's instruction-tuned models, specifically designed for following instructions. They run locally with transformers library.
|
| 86 |
|
| 87 |
**To change model:**
|
| 88 |
```bash
|
| 89 |
# In Space Settings → Variables
|
| 90 |
+
LLM_MODEL=google/flan-t5-large # Better quality
|
| 91 |
|
| 92 |
+
# Or for maximum quality (requires more memory)
|
| 93 |
+
LLM_MODEL=google/flan-t5-xl
|
| 94 |
```
|
| 95 |
|
| 96 |
+
**Why Local Transformers?**
|
| 97 |
+
- ✅ **No API dependencies** - runs entirely on your Space
|
| 98 |
+
- ✅ **No 404 errors** - no network issues
|
| 99 |
+
- ✅ **Fast after loading** - models cached in memory
|
| 100 |
- ✅ **Instruction-tuned** - designed for following prompts
|
| 101 |
+
- ✅ **Privacy** - all processing happens locally
|
| 102 |
|
| 103 |
+
### Tips for Best Performance with Local Models
|
| 104 |
|
| 105 |
+
1. **Start with flan-t5-base** - Fast loading and good results
|
| 106 |
+
2. **First load takes time** - Model downloads and loads (~1-2 minutes for base)
|
| 107 |
+
3. **Subsequent requests are fast** - Model stays in memory (2-5 seconds)
|
| 108 |
+
4. **Upgrade model size for quality** - flan-t5-large or xl for better results
|
| 109 |
+
5. **Keep prompts concise** - Shorter outlines = faster generation
|
| 110 |
+
6. **Monitor memory** - Larger models (XL, XXL) need more RAM
|
| 111 |
|
| 112 |
## 📦 Installation
|
| 113 |
|
llm_backend.py
CHANGED
|
@@ -7,6 +7,14 @@ import json
|
|
| 7 |
from typing import List, Dict, Optional
|
| 8 |
from enum import Enum
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
class LLMProvider(Enum):
|
| 12 |
"""Supported LLM providers"""
|
|
@@ -27,13 +35,13 @@ class LLMBackend:
|
|
| 27 |
Initialize LLM backend with specified provider.
|
| 28 |
|
| 29 |
Args:
|
| 30 |
-
provider: LLM provider to use (defaults to env var or
|
| 31 |
api_key: API key for the provider (reads from env if not provided)
|
| 32 |
model: Model name to use (provider-specific defaults if not provided)
|
| 33 |
"""
|
| 34 |
# Determine provider
|
| 35 |
if provider is None:
|
| 36 |
-
provider_str = os.getenv("LLM_PROVIDER", "
|
| 37 |
self.provider = LLMProvider(provider_str)
|
| 38 |
else:
|
| 39 |
self.provider = provider
|
|
@@ -60,13 +68,19 @@ class LLMBackend:
|
|
| 60 |
# Set API endpoint
|
| 61 |
self.api_url = self._get_api_url()
|
| 62 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
def _get_default_model(self) -> str:
|
| 64 |
"""Get default model for each provider"""
|
| 65 |
defaults = {
|
| 66 |
LLMProvider.OPENAI: "gpt-4o-mini",
|
| 67 |
LLMProvider.ANTHROPIC: "claude-3-5-sonnet-20241022",
|
| 68 |
-
# Using Flan-T5-
|
| 69 |
-
|
|
|
|
| 70 |
LLMProvider.LM_STUDIO: "google/gemma-3-27b"
|
| 71 |
}
|
| 72 |
return os.getenv("LLM_MODEL", defaults[self.provider])
|
|
@@ -171,32 +185,74 @@ class LLMBackend:
|
|
| 171 |
data = response.json()
|
| 172 |
return data["content"][0]["text"]
|
| 173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
def _generate_huggingface(self, messages, max_tokens, temperature) -> str:
|
| 175 |
-
"""Generate using
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
"Content-Type": "application/json"
|
| 179 |
-
}
|
| 180 |
|
| 181 |
# Convert messages to prompt
|
| 182 |
prompt = self._messages_to_prompt(messages)
|
| 183 |
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
def _generate_lm_studio(self, messages, max_tokens, temperature) -> str:
|
| 202 |
"""Generate using LM Studio local API"""
|
|
|
|
| 7 |
from typing import List, Dict, Optional
|
| 8 |
from enum import Enum
|
| 9 |
|
| 10 |
+
# Try to import transformers for local model loading
|
| 11 |
+
try:
|
| 12 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
|
| 13 |
+
import torch
|
| 14 |
+
TRANSFORMERS_AVAILABLE = True
|
| 15 |
+
except ImportError:
|
| 16 |
+
TRANSFORMERS_AVAILABLE = False
|
| 17 |
+
|
| 18 |
|
| 19 |
class LLMProvider(Enum):
|
| 20 |
"""Supported LLM providers"""
|
|
|
|
| 35 |
Initialize LLM backend with specified provider.
|
| 36 |
|
| 37 |
Args:
|
| 38 |
+
provider: LLM provider to use (defaults to env var or HUGGINGFACE)
|
| 39 |
api_key: API key for the provider (reads from env if not provided)
|
| 40 |
model: Model name to use (provider-specific defaults if not provided)
|
| 41 |
"""
|
| 42 |
# Determine provider
|
| 43 |
if provider is None:
|
| 44 |
+
provider_str = os.getenv("LLM_PROVIDER", "huggingface").lower()
|
| 45 |
self.provider = LLMProvider(provider_str)
|
| 46 |
else:
|
| 47 |
self.provider = provider
|
|
|
|
| 68 |
# Set API endpoint
|
| 69 |
self.api_url = self._get_api_url()
|
| 70 |
|
| 71 |
+
# Cache for local models (transformers)
|
| 72 |
+
self.tokenizer = None
|
| 73 |
+
self.local_model = None
|
| 74 |
+
self.device = None
|
| 75 |
+
|
| 76 |
def _get_default_model(self) -> str:
|
| 77 |
"""Get default model for each provider"""
|
| 78 |
defaults = {
|
| 79 |
LLMProvider.OPENAI: "gpt-4o-mini",
|
| 80 |
LLMProvider.ANTHROPIC: "claude-3-5-sonnet-20241022",
|
| 81 |
+
# Using Flan-T5-Base - small, fast, works locally with transformers
|
| 82 |
+
# For larger models, try: google/flan-t5-large or google/flan-t5-xl
|
| 83 |
+
LLMProvider.HUGGINGFACE: "google/flan-t5-base",
|
| 84 |
LLMProvider.LM_STUDIO: "google/gemma-3-27b"
|
| 85 |
}
|
| 86 |
return os.getenv("LLM_MODEL", defaults[self.provider])
|
|
|
|
| 185 |
data = response.json()
|
| 186 |
return data["content"][0]["text"]
|
| 187 |
|
| 188 |
+
def _load_local_model(self):
|
| 189 |
+
"""Load model locally using transformers"""
|
| 190 |
+
if not TRANSFORMERS_AVAILABLE:
|
| 191 |
+
raise Exception("transformers library not available. Install with: pip install transformers torch")
|
| 192 |
+
|
| 193 |
+
if self.local_model is not None:
|
| 194 |
+
return # Already loaded
|
| 195 |
+
|
| 196 |
+
print(f"Loading model {self.model} locally...")
|
| 197 |
+
|
| 198 |
+
# Determine device
|
| 199 |
+
self.device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 200 |
+
print(f"Using device: {self.device}")
|
| 201 |
+
|
| 202 |
+
# Load tokenizer
|
| 203 |
+
self.tokenizer = AutoTokenizer.from_pretrained(self.model)
|
| 204 |
+
|
| 205 |
+
# Load model (T5 models use Seq2SeqLM, others use CausalLM)
|
| 206 |
+
if "t5" in self.model.lower() or "flan" in self.model.lower():
|
| 207 |
+
self.local_model = AutoModelForSeq2SeqLM.from_pretrained(
|
| 208 |
+
self.model,
|
| 209 |
+
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
|
| 210 |
+
low_cpu_mem_usage=True
|
| 211 |
+
)
|
| 212 |
+
else:
|
| 213 |
+
self.local_model = AutoModelForCausalLM.from_pretrained(
|
| 214 |
+
self.model,
|
| 215 |
+
torch_dtype=torch.float16 if self.device == "cuda" else torch.float32,
|
| 216 |
+
low_cpu_mem_usage=True
|
| 217 |
+
)
|
| 218 |
+
|
| 219 |
+
self.local_model = self.local_model.to(self.device)
|
| 220 |
+
print(f"Model loaded successfully!")
|
| 221 |
+
|
| 222 |
def _generate_huggingface(self, messages, max_tokens, temperature) -> str:
|
| 223 |
+
"""Generate using local transformers model"""
|
| 224 |
+
# Load model if not already loaded
|
| 225 |
+
self._load_local_model()
|
|
|
|
|
|
|
| 226 |
|
| 227 |
# Convert messages to prompt
|
| 228 |
prompt = self._messages_to_prompt(messages)
|
| 229 |
|
| 230 |
+
# Tokenize input
|
| 231 |
+
inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
|
| 232 |
+
inputs = inputs.to(self.device)
|
| 233 |
+
|
| 234 |
+
# Generate
|
| 235 |
+
with torch.no_grad():
|
| 236 |
+
outputs = self.local_model.generate(
|
| 237 |
+
**inputs,
|
| 238 |
+
max_new_tokens=max_tokens,
|
| 239 |
+
temperature=temperature,
|
| 240 |
+
do_sample=temperature > 0,
|
| 241 |
+
top_p=0.9,
|
| 242 |
+
pad_token_id=self.tokenizer.eos_token_id
|
| 243 |
+
)
|
| 244 |
+
|
| 245 |
+
# Decode output
|
| 246 |
+
generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 247 |
+
|
| 248 |
+
# For T5 models, the output is just the generated text
|
| 249 |
+
# For causal models, we need to remove the input prompt
|
| 250 |
+
if "t5" not in self.model.lower() and "flan" not in self.model.lower():
|
| 251 |
+
# Remove the input prompt from output
|
| 252 |
+
if generated_text.startswith(prompt):
|
| 253 |
+
generated_text = generated_text[len(prompt):].strip()
|
| 254 |
+
|
| 255 |
+
return generated_text
|
| 256 |
|
| 257 |
def _generate_lm_studio(self, messages, max_tokens, temperature) -> str:
|
| 258 |
"""Generate using LM Studio local API"""
|
requirements.txt
CHANGED
|
@@ -1,3 +1,7 @@
|
|
| 1 |
gradio==5.45.0
|
| 2 |
requests==2.32.3
|
| 3 |
-
pandas==2.2.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
gradio==5.45.0
|
| 2 |
requests==2.32.3
|
| 3 |
+
pandas==2.2.2
|
| 4 |
+
transformers>=4.36.0
|
| 5 |
+
torch>=2.0.0
|
| 6 |
+
accelerate>=0.25.0
|
| 7 |
+
sentencepiece>=0.1.99
|