Spaces:

mafzaal
/

lets_talk

Runtime error

App Files Files Community

lets_talk / README.md

mafzaal

Add vector database creation configuration and update related scripts

a092eef 10 months ago

preview code

raw

history blame contribute delete

5.36 kB

	---
	title: Lets Talk
	emoji: 🐨
	colorFrom: green
	colorTo: blue
	sdk: docker
	pinned: false
	---

	# Welcome to TheDataGuy Chat! 👋

	This is a Q&A chatbot powered by [TheDataGuy blog](https://thedataguy.pro/blog/) blog posts. Ask questions about topics covered in the blog, such as:

	- RAGAS and RAG evaluation
	- Building research agents
	- Metric-driven development
	- Data science best practices

	## How it works

	Under the hood, this application uses:

	1. Snowflake Arctic Embeddings: To convert text into vector representations
	- Base model: `Snowflake/snowflake-arctic-embed-l`
	- Fine-tuned model: `mafzaal/thedataguy_arctic_ft` (custom-tuned using blog-specific query-context pairs)

	2. Qdrant Vector Database: To store and search for similar content
	- Efficiently indexes blog post content for fast semantic search
	- Supports real-time updates when new blog posts are published

	3. GPT-4o-mini: To generate helpful responses based on retrieved content
	- Primary model: OpenAI `gpt-4o-mini` for production inference
	- Evaluation model: OpenAI `gpt-4.1` for complex tasks including synthetic data generation and evaluation

	4. LangChain: For building the RAG workflow
	- Orchestrates the retrieval and generation components
	- Provides flexible components for LLM application development
	- Structured for easy maintenance and future enhancements

	5. Chainlit: For the chat interface
	- Offers an interactive UI with message threading
	- Supports file uploads and custom components

	## Technology Stack

	### Core Components
	- Vector Database: Qdrant (stores embeddings via `pipeline.py`)
	- Embedding Model: Snowflake Arctic Embeddings
	- LLM: OpenAI GPT-4o-mini
	- Framework: LangChain + Chainlit
	- Development Language: Python 3.13

	### Advanced Features
	- Evaluation: Ragas metrics for evaluating RAG performance:
	- Faithfulness
	- Context Relevancy
	- Answer Relevancy
	- Topic Adherence
	- Synthetic Data Generation: For training and testing
	- Vector Store Updates: Automated pipeline to update when new blog content is published
	- Fine-tuned Embeddings: Custom embeddings tuned for technical content

	## Project Structure

	```
	lets-talk/
	├── data/ # Raw blog post content
	├── py-src/ # Python source code
	│ ├── lets_talk/ # Core application modules
	│ │ ├── agent.py # Agent implementation
	│ │ ├── config.py # Configuration settings
	│ │ ├── models.py # Data models
	│ │ ├── prompts.py # LLM prompt templates
	│ │ ├── rag.py # RAG implementation
	│ │ ├── rss_tool.py # RSS feed integration
	│ │ └── tools.py # Tool implementations
	│ ├── app.py # Main application entry point
	│ └── pipeline.py # Data processing pipeline
	├── db/ # Vector database storage
	├── evals/ # Evaluation datasets and results
	└── notebooks/ # Jupyter notebooks for analysis
	```

	## Environment Setup

	The application requires the following environment variables:

	```
	OPENAI_API_KEY=your_openai_api_key
	VECTOR_STORAGE_PATH=./db/vector_store_tdg
	LLM_MODEL=gpt-4o-mini
	EMBEDDING_MODEL=Snowflake/snowflake-arctic-embed-l
	CHUNK_SIZE=1000
	CHUNK_OVERLAP=200
	```

	Additional configuration options for vector database creation:

	```
	# Vector Database Creation Configuration
	FORCE_RECREATE=False # Whether to force recreation of the vector store
	OUTPUT_DIR=./stats # Directory to save stats and artifacts
	USE_CHUNKING=True # Whether to split documents into chunks
	SHOULD_SAVE_STATS=True # Whether to save statistics about the documents
	```

	## Running Locally

	### Using Docker

	```bash
	docker build -t lets-talk .
	docker run -p 7860:7860 \
	--env-file ./.env \
	lets-talk
	```

	### Using Python

	```bash
	# Install dependencies
	uv init && uv sync

	# Run the application
	chainlit run py-src/app.py --host 0.0.0.0 --port 7860
	```

	## Deployment

	The application is designed to be deployed on:

	- Development: Hugging Face Spaces ([Live Demo](https://huggingface.co/spaces/mafzaal/lets_talk))
	- Production: Azure Container Apps (planned)

	## Evaluation

	This project includes extensive evaluation capabilities using the Ragas framework:

	- Synthetic Data Generation: For creating test datasets
	- Metric Evaluation: Measuring faithfulness, relevance, and more
	- Fine-tuning Analysis: Comparing different embedding models

	## Future Enhancements

	- Agentic Reasoning: Adding more sophisticated agent capabilities
	- Web UI Integration: Custom Svelte component for the blog
	- CI/CD: GitHub Actions workflow for automated deployment
	- Monitoring: LangSmith integration for observability

	## License

	This project is available under the MIT License.

	## Acknowledgements

	- [TheDataGuy blog](https://thedataguy.pro/blog/) for the content
	- [Ragas](https://docs.ragas.io/) for evaluation framework
	- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) for RAG components
	- [Chainlit](https://docs.chainlit.io/) for the chat interface

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference