status-law-gbot / README.md
Rulga's picture
Refactor settings.py: Replace Phi-2 model configuration with Neural Mistral 7B, enhancing reasoning and instruction following capabilities
e6ceacc

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Status Law Gbot
emoji: πŸ’¬
colorFrom: yellow
colorTo: purple
sdk: gradio
sdk_version: 5.23.0
app_file: app.py
pinned: false

Status Law Assistant

An intelligent chatbot based on Hugging Face and LangChain for legal consultations using information from the Status Law company website.

πŸ“ Description

Status Law Assistant is a smart chatbot that answers user questions about Status Law company's legal services. The bot uses RAG (Retrieval-Augmented Generation) technology to find relevant information in a knowledge base created from the official website content.

✨ Key Features

  • Knowledge Base Management:
    • Dynamic URL management for knowledge base sources
    • Ability to add custom URLs for information extraction
    • Selective source inclusion/exclusion
    • Two modes of knowledge base updates:
      • Update Mode: Adds new information while preserving existing knowledge
      • Rebuild Mode: Complete recreation of knowledge base from selected sources
    • Real-time status tracking for knowledge base operations
    • Automatic metadata management and versioning
  • Automatic creation and updating of knowledge base from status.law website content
  • Intelligent information retrieval for query responses
  • Context-aware response generation
  • Advanced multilingual support:
    • Automatic language detection
    • Native language response generation
    • Built-in translation system with fallback mechanism
    • Support for 30+ languages
  • Customizable text generation parameters
  • Model switching system with automatic fallback
  • Fine-tuning capabilities based on chat history
  • Multiple model support:
    • Zephyr 7B: Enhanced performance and response quality
    • TinyLlama 1.1B Chat: Lightweight model for resource-constrained environments
    • Neural Mistral 7B: Superior reasoning and instruction following capabilities
    • Mixtral 8x7B: Advanced mixture-of-experts architecture

πŸš€ Technologies

  • LangChain: Query processing chains and knowledge base management
  • Hugging Face: Language model access and hosting
  • FAISS: Efficient vector search
  • Gradio: User interface creation
  • BeautifulSoup: Web page information extraction
  • PEFT: Efficient fine-tuning using LoRA
  • SentencePiece: Tokenization

πŸ—οΈ Project Structure

status-law-gbot/
β”œβ”€β”€ app.py                 # Main application file
β”œβ”€β”€ requirements.txt       # Project dependencies
β”œβ”€β”€ config/               # Configuration files
β”‚   β”œβ”€β”€ settings.py       # Application and model settings
β”‚   └── constants.py      # Constants and default values
β”œβ”€β”€ src/                  # Source code
β”‚   β”œβ”€β”€ analytics/        # Analytics module
β”‚   β”‚   └── chat_analyzer.py
β”‚   β”œβ”€β”€ knowledge_base/   # Knowledge base management
β”‚   β”‚   β”œβ”€β”€ loader.py
β”‚   β”‚   └── vector_store.py
β”‚   └── training/         # Training module
β”‚       β”œβ”€β”€ fine_tuner.py
β”‚       └── model_manager.py
└── dataset/             # HuggingFace dataset structure
    β”œβ”€β”€ annotations/     # Conversation annotations
    β”œβ”€β”€ chat_history/    # Chat logs and conversations
    β”œβ”€β”€ fine_tuned_models/ # Fine-tuned model storage
    β”œβ”€β”€ preferences/     # User preferences
    β”œβ”€β”€ training_data/   # Processed training data
    β”œβ”€β”€ training_logs/   # Training process logs
    └── vector_store/    # FAISS vector storage

πŸ’Ύ Data Storage

Dataset Organization

  • annotations/: Conversation quality metrics and annotations
  • chat_history/: JSON files containing chat conversations
  • fine_tuned_models/: Storage for LoRA adapters and model checkpoints
  • preferences/: User preferences and settings
  • training_data/: Processed data ready for model training
  • training_logs/: Detailed training process logs
  • vector_store/: FAISS indexes for semantic search

πŸ› οΈ Setup

  1. Clone the repository:
git clone https://github.com/PtOlga/status-law-gbot.git
cd status-law-gbot
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your configuration, including HUGGINGFACE_TOKEN
  1. Run the application:
python app.py

πŸ”§ Model Fine-tuning

To fine-tune the model on your chat history:

from src.training.fine_tuner import finetune_from_chat_history

success, message = finetune_from_chat_history(epochs=3)
print(message)

The fine-tuning process uses LoRA (Low-Rank Adaptation) for efficient training with minimal resource requirements.

πŸ”„ Model Switching

The application supports multiple models with automatic fallback:

  • Zephyr 7B: Enhanced performance and response quality
  • TinyLlama 1.1B Chat: Lightweight model for resource-constrained environments
  • Neural Mistral 7B: Superior reasoning and instruction following capabilities
  • Mixtral 8x7B: Advanced mixture-of-experts architecture

Models can be switched dynamically through the interface or programmatically:

from src.training.model_manager import switch_to_model

switch_to_model("zephyr-7b")  # or "tinyllama-1.1b", "neural-mistral-7b", "mixtral-8x7b"

πŸ”„ Knowledge Base Management

The application provides a flexible interface for managing knowledge sources:

  1. Source Management:

    • View and edit the list of source URLs
    • Enable/disable specific sources
    • Add custom URLs for information extraction
    • Monitor source status and availability
  2. Update Operations:

    • Update Knowledge Base: Incrementally add new information while preserving existing knowledge
    • Rebuild Knowledge Base: Completely recreate the knowledge base using only selected sources
    • Real-time operation status tracking
    • Automatic backup of previous versions
  3. Usage:

    # Add new URL to knowledge base
    sources_df.append({"URL": "https://example.com", "Include": True, "Status": "Ready"})
    
    # Update knowledge base with selected sources
    update_kb_with_selected(sources_df)
    
    # Rebuild knowledge base from scratch
    rebuild_kb_with_selected(sources_df)
    

All changes to the knowledge base are automatically synchronized with the Hugging Face dataset, ensuring data persistence and version control.

πŸ”— Related Links

πŸ“ License

Public repository for Status Law Assistant.