status-law-gbot / README.md
Rulga's picture
Refactor settings.py: Replace Phi-2 model configuration with Neural Mistral 7B, enhancing reasoning and instruction following capabilities
e6ceacc
---
title: Status Law Gbot
emoji: πŸ’¬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_file: app.py
pinned: false
---
# Status Law Assistant
An intelligent chatbot based on Hugging Face and LangChain for legal consultations using information from the Status Law company website.
## πŸ“ Description
Status Law Assistant is a smart chatbot that answers user questions about Status Law company's legal services. The bot uses RAG (Retrieval-Augmented Generation) technology to find relevant information in a knowledge base created from the official website content.
## ✨ Key Features
- Knowledge Base Management:
- Dynamic URL management for knowledge base sources
- Ability to add custom URLs for information extraction
- Selective source inclusion/exclusion
- Two modes of knowledge base updates:
- Update Mode: Adds new information while preserving existing knowledge
- Rebuild Mode: Complete recreation of knowledge base from selected sources
- Real-time status tracking for knowledge base operations
- Automatic metadata management and versioning
- Automatic creation and updating of knowledge base from status.law website content
- Intelligent information retrieval for query responses
- Context-aware response generation
- Advanced multilingual support:
- Automatic language detection
- Native language response generation
- Built-in translation system with fallback mechanism
- Support for 30+ languages
- Customizable text generation parameters
- Model switching system with automatic fallback
- Fine-tuning capabilities based on chat history
- Multiple model support:
- Zephyr 7B: Enhanced performance and response quality
- TinyLlama 1.1B Chat: Lightweight model for resource-constrained environments
- Neural Mistral 7B: Superior reasoning and instruction following capabilities
- Mixtral 8x7B: Advanced mixture-of-experts architecture
## πŸš€ Technologies
- **LangChain**: Query processing chains and knowledge base management
- **Hugging Face**: Language model access and hosting
- **FAISS**: Efficient vector search
- **Gradio**: User interface creation
- **BeautifulSoup**: Web page information extraction
- **PEFT**: Efficient fine-tuning using LoRA
- **SentencePiece**: Tokenization
## πŸ—οΈ Project Structure
```
status-law-gbot/
β”œβ”€β”€ app.py # Main application file
β”œβ”€β”€ requirements.txt # Project dependencies
β”œβ”€β”€ config/ # Configuration files
β”‚ β”œβ”€β”€ settings.py # Application and model settings
β”‚ └── constants.py # Constants and default values
β”œβ”€β”€ src/ # Source code
β”‚ β”œβ”€β”€ analytics/ # Analytics module
β”‚ β”‚ └── chat_analyzer.py
β”‚ β”œβ”€β”€ knowledge_base/ # Knowledge base management
β”‚ β”‚ β”œβ”€β”€ loader.py
β”‚ β”‚ └── vector_store.py
β”‚ └── training/ # Training module
β”‚ β”œβ”€β”€ fine_tuner.py
β”‚ └── model_manager.py
└── dataset/ # HuggingFace dataset structure
β”œβ”€β”€ annotations/ # Conversation annotations
β”œβ”€β”€ chat_history/ # Chat logs and conversations
β”œβ”€β”€ fine_tuned_models/ # Fine-tuned model storage
β”œβ”€β”€ preferences/ # User preferences
β”œβ”€β”€ training_data/ # Processed training data
β”œβ”€β”€ training_logs/ # Training process logs
└── vector_store/ # FAISS vector storage
```
## πŸ’Ύ Data Storage
### Dataset Organization
- `annotations/`: Conversation quality metrics and annotations
- `chat_history/`: JSON files containing chat conversations
- `fine_tuned_models/`: Storage for LoRA adapters and model checkpoints
- `preferences/`: User preferences and settings
- `training_data/`: Processed data ready for model training
- `training_logs/`: Detailed training process logs
- `vector_store/`: FAISS indexes for semantic search
## πŸ› οΈ Setup
1. Clone the repository:
```bash
git clone https://github.com/PtOlga/status-law-gbot.git
cd status-law-gbot
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Set up environment variables:
```bash
cp .env.example .env
# Edit .env with your configuration, including HUGGINGFACE_TOKEN
```
4. Run the application:
```bash
python app.py
```
## πŸ”§ Model Fine-tuning
To fine-tune the model on your chat history:
```python
from src.training.fine_tuner import finetune_from_chat_history
success, message = finetune_from_chat_history(epochs=3)
print(message)
```
The fine-tuning process uses LoRA (Low-Rank Adaptation) for efficient training with minimal resource requirements.
## πŸ”„ Model Switching
The application supports multiple models with automatic fallback:
- Zephyr 7B: Enhanced performance and response quality
- TinyLlama 1.1B Chat: Lightweight model for resource-constrained environments
- Neural Mistral 7B: Superior reasoning and instruction following capabilities
- Mixtral 8x7B: Advanced mixture-of-experts architecture
Models can be switched dynamically through the interface or programmatically:
```python
from src.training.model_manager import switch_to_model
switch_to_model("zephyr-7b") # or "tinyllama-1.1b", "neural-mistral-7b", "mixtral-8x7b"
```
## πŸ”„ Knowledge Base Management
The application provides a flexible interface for managing knowledge sources:
1. **Source Management**:
- View and edit the list of source URLs
- Enable/disable specific sources
- Add custom URLs for information extraction
- Monitor source status and availability
2. **Update Operations**:
- **Update Knowledge Base**: Incrementally add new information while preserving existing knowledge
- **Rebuild Knowledge Base**: Completely recreate the knowledge base using only selected sources
- Real-time operation status tracking
- Automatic backup of previous versions
3. **Usage**:
```python
# Add new URL to knowledge base
sources_df.append({"URL": "https://example.com", "Include": True, "Status": "Ready"})
# Update knowledge base with selected sources
update_kb_with_selected(sources_df)
# Rebuild knowledge base from scratch
rebuild_kb_with_selected(sources_df)
```
All changes to the knowledge base are automatically synchronized with the Hugging Face dataset, ensuring data persistence and version control.
## πŸ”— Related Links
- [Status Law Website](https://status.law)
- [Status Law Assistant on Hugging Face](https://huggingface.co/spaces/Rulga/status-law-gbot)
## πŸ“ License
Public repository for Status Law Assistant.