| | --- |
| | title: Lets Talk |
| | emoji: π¨ |
| | colorFrom: green |
| | colorTo: blue |
| | sdk: docker |
| | pinned: false |
| | --- |
| | |
| | # Welcome to TheDataGuy Chat! π |
| |
|
| | This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as: |
| |
|
| | - RAGAS and RAG evaluation |
| | - Building research agents |
| | - Metric-driven development |
| | - Data science best practices |
| |
|
| | ## How it works |
| |
|
| | Under the hood, this application uses: |
| |
|
| | 1. **Snowflake Arctic Embeddings**: To convert text into vector representations |
| | - Base model: `Snowflake/snowflake-arctic-embed-l` |
| | - Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs) |
| |
|
| | 2. **Qdrant Vector Database**: To store and search for similar content |
| | - Efficiently indexes blog post content for fast semantic search |
| | - Supports real-time updates when new blog posts are published |
| |
|
| | 3. **GPT-4o-mini**: To generate helpful responses based on retrieved content |
| | - Primary model: OpenAI `gpt-4o-mini` for production inference |
| | - Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation |
| |
|
| | 4. **LangChain**: For building the RAG workflow |
| | - Orchestrates the retrieval and generation components |
| | - Provides flexible components for LLM application development |
| | - Structured for easy maintenance and future enhancements |
| |
|
| | 5. **Chainlit**: For the chat interface |
| | - Offers an interactive UI with message threading |
| | - Supports file uploads and custom components |
| |
|
| | ## Technology Stack |
| |
|
| | ### Core Components |
| | - **Vector Database**: Qdrant (stores embeddings via `pipeline.py`) |
| | - **Embedding Model**: Snowflake Arctic Embeddings |
| | - **LLM**: OpenAI GPT-4o-mini |
| | - **Framework**: LangChain + Chainlit |
| | - **Development Language**: Python 3.13 |
| |
|
| | ### Advanced Features |
| | - **Evaluation**: Ragas metrics for evaluating RAG performance: |
| | - Faithfulness |
| | - Context Relevancy |
| | - Answer Relevancy |
| | - Topic Adherence |
| | - **Synthetic Data Generation**: For training and testing |
| | - **Vector Store Updates**: Automated pipeline to update when new blog content is published |
| | - **Fine-tuned Embeddings**: Custom embeddings tuned for technical content |
| |
|
| | ## Project Structure |
| |
|
| | ``` |
| | lets-talk/ |
| | βββ data/ # Raw blog post content |
| | βββ py-src/ # Python source code |
| | β βββ lets_talk/ # Core application modules |
| | β β βββ agent.py # Agent implementation |
| | β β βββ config.py # Configuration settings |
| | β β βββ models.py # Data models |
| | β β βββ prompts.py # LLM prompt templates |
| | β β βββ rag.py # RAG implementation |
| | β β βββ rss_tool.py # RSS feed integration |
| | β β βββ tools.py # Tool implementations |
| | β βββ app.py # Main application entry point |
| | β βββ pipeline.py # Data processing pipeline |
| | βββ db/ # Vector database storage |
| | βββ evals/ # Evaluation datasets and results |
| | βββ notebooks/ # Jupyter notebooks for analysis |
| | ``` |
| |
|
| | ## Environment Setup |
| |
|
| | The application requires the following environment variables: |
| |
|
| | ``` |
| | OPENAI_API_KEY=your_openai_api_key |
| | VECTOR_STORAGE_PATH=./db/vector_store_tdg |
| | LLM_MODEL=gpt-4o-mini |
| | EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l |
| | CHUNK_SIZE=1000 |
| | CHUNK_OVERLAP=200 |
| | ``` |
| |
|
| | Additional configuration options for vector database creation: |
| |
|
| | ``` |
| | # Vector Database Creation Configuration |
| | FORCE_RECREATE=False # Whether to force recreation of the vector store |
| | OUTPUT_DIR=./stats # Directory to save stats and artifacts |
| | USE_CHUNKING=True # Whether to split documents into chunks |
| | SHOULD_SAVE_STATS=True # Whether to save statistics about the documents |
| | ``` |
| |
|
| | ## Running Locally |
| |
|
| | ### Using Docker |
| |
|
| | ```bash |
| | docker build -t lets-talk . |
| | docker run -p 7860:7860 \ |
| | --env-file ./.env \ |
| | lets-talk |
| | ``` |
| |
|
| | ### Using Python |
| |
|
| | ```bash |
| | # Install dependencies |
| | uv init && uv sync |
| | |
| | # Run the application |
| | chainlit run py-src/app.py --host 0.0.0.0 --port 7860 |
| | ``` |
| |
|
| | ## Deployment |
| |
|
| | The application is designed to be deployed on: |
| |
|
| | - **Development**: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk)) |
| | - **Production**: Azure Container Apps (planned) |
| |
|
| | ## Evaluation |
| |
|
| | This project includes extensive evaluation capabilities using the Ragas framework: |
| |
|
| | - **Synthetic Data Generation**: For creating test datasets |
| | - **Metric Evaluation**: Measuring faithfulness, relevance, and more |
| | - **Fine-tuning Analysis**: Comparing different embedding models |
| |
|
| | ## Future Enhancements |
| |
|
| | - **Agentic Reasoning**: Adding more sophisticated agent capabilities |
| | - **Web UI Integration**: Custom Svelte component for the blog |
| | - **CI/CD**: GitHub Actions workflow for automated deployment |
| | - **Monitoring**: LangSmith integration for observability |
| |
|
| | ## License |
| |
|
| | This project is available under the MIT License. |
| |
|
| | ## Acknowledgements |
| |
|
| | - [TheDataGuy blog](https://thedataguy.pro/blog/) for the content |
| | - [Ragas](https://docs.ragas.io/) for evaluation framework |
| | - [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components |
| | - [Chainlit](https://docs.chainlit.io/) for the chat interface |
| |
|
| | Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
| |
|