Spaces:

T0X1N
/

Agentic-RagBot

Running

App Files Files Community

Agentic-RagBot / docs /DEVELOPMENT.md

Nikhil Pravin Pise

docs: update all documentation to reflect current codebase state

aefac4f 19 days ago

preview code

raw

history blame contribute delete

12.1 kB

	# RagBot Development Guide

	## For Developers & Maintainers

	This guide covers extending, customizing, and contributing to RagBot.

	## Project Structure

	```
	RagBot/
	├── src/ # Core application code
	│ ├── __init__.py # Package marker
	│ ├── workflow.py # Multi-agent workflow orchestration
	│ ├── state.py # Pydantic data models & state
	│ ├── biomarker_validator.py # Biomarker validation logic
	│ ├── biomarker_normalization.py # Alias-to-canonical name mapping (80+ aliases)
	│ ├── llm_config.py # LLM & embedding configuration
	│ ├── pdf_processor.py # PDF loading & vector store
	│ ├── config.py # Global configuration
	│ │
	│ ├── agents/ # Specialist agents
	│ │ ├── __init__.py # Package marker
	│ │ ├── biomarker_analyzer.py # Validates biomarkers
	│ │ ├── disease_explainer.py # Explains disease (RAG)
	│ │ ├── biomarker_linker.py # Links biomarkers to disease (RAG)
	│ │ ├── clinical_guidelines.py # Provides guidelines (RAG)
	│ │ ├── confidence_assessor.py # Assesses prediction confidence
	│ │ └── response_synthesizer.py # Synthesizes findings
	│ │
	│ ├── evaluation/ # Evaluation framework
	│ │ ├── __init__.py
	│ │ └── evaluators.py # Quality evaluators
	│ │
	│ └── evolution/ # Experimental components
	│ ├── __init__.py
	│ ├── director.py # Evolution orchestration
	│ └── pareto.py # Pareto optimization
	│
	├── api/ # REST API application
	│ ├── app/
	│ │ ├── main.py # FastAPI application
	│ │ ├── routes/ # API endpoints
	│ │ │ ├── analyze.py # Main analysis endpoint
	│ │ │ ├── biomarkers.py # Biomarker endpoints
	│ │ │ └── health.py # Health check
	│ │ ├── models/ # Pydantic schemas
	│ │ └── services/ # Business logic
	│ ├── requirements.txt
	│ ├── Dockerfile
	│ └── docker-compose.yml
	│
	├── scripts/ # Utility & demo scripts
	│ ├── chat.py # Interactive CLI
	│ ├── setup_embeddings.py # Vector store builder
	│ ├── run_api.ps1 # API startup script
	│ └── ...
	│
	├── config/ # Configuration files
	│ └── biomarker_references.json # Biomarker reference ranges
	│
	├── data/ # Data storage
	│ ├── medical_pdfs/ # Source medical documents
	│ └── vector_stores/ # FAISS vector databases
	│
	├── tests/ # Test suite
	│ └── test_*.py
	│
	├── docs/ # Documentation
	│ ├── ARCHITECTURE.md # System design
	│ ├── API.md # API reference
	│ ├── DEVELOPMENT.md # This file
	│ └── ...
	│
	├── examples/ # Example integrations
	│ ├── test_website.html # Web integration example
	│ └── website_integration.js # JavaScript client
	│
	├── requirements.txt # Python dependencies
	├── README.md # Main documentation
	├── QUICKSTART.md # Setup guide
	├── CONTRIBUTING.md # Contribution guidelines
	└── LICENSE
	```

	## Development Setup

	### 1. Clone & Install

	```bash
	git clone https://github.com/yourusername/ragbot.git
	cd ragbot
	python -m venv .venv
	.venv\Scripts\activate # Windows
	pip install -r requirements.txt
	```

	### 2. Configure

	```bash
	cp .env.template .env
	# Edit .env with your API keys (Groq, Google, etc.)
	```

	### 3. Rebuild Vector Store

	```bash
	python scripts/setup_embeddings.py
	```

	### 4. Run Tests

	```bash
	pytest tests/
	```

	## Key Development Tasks

	### Adding a New Biomarker

	Step 1: Update reference ranges in `config/biomarker_references.json`:

	```json
	{
	"biomarkers": {
	"New Biomarker": {
	"min": 0,
	"max": 100,
	"unit": "mg/dL",
	"normal_range": "0-100",
	"critical_low": -1,
	"critical_high": 150,
	"related_conditions": ["Disease1", "Disease2"]
	}
	}
	}
	```

	Step 2: Add aliases in `src/biomarker_normalization.py`:

	```python
	NORMALIZATION_MAP = {
	# ... existing entries ...
	"your alias": "New Biomarker",
	"other name": "New Biomarker",
	}
	```

	All consumers (CLI, API, workflow) use this shared map automatically.

	Step 3: Add validation test in `tests/test_basic.py`:

	```python
	def test_new_biomarker():
	validator = BiomarkerValidator()
	result = validator.validate("New Biomarker", 50)
	assert result.is_valid
	```

	Step 4: Medical knowledge automatically updates through RAG

	### Adding a New Medical Domain

	Step 1: Collect relevant PDFs:
	```
	data/medical_pdfs/
	your_domain.pdf
	your_guideline.pdf
	```

	Step 2: Rebuild vector store:
	```bash
	python scripts/setup_embeddings.py
	```

	The system automatically:
	- Loads all PDFs from `data/medical_pdfs/`
	- Creates 2,609+ chunks with similarity search
	- Makes knowledge available to all RAG agents

	Step 3: Test with new biomarkers from that domain:
	```bash
	python scripts/chat.py
	# Input: biomarkers related to your domain
	```

	### Creating a Custom Analysis Agent

	Example: Add a "Medication Interactions" Agent

	Step 1: Create `src/agents/medication_checker.py`:

	```python
	from src.llm_config import LLMConfig
	from src.state import PatientInput

	class MedicationChecker:
	def __init__(self):
	config = LLMConfig()
	self.llm = config.analyzer # Uses centralized LLM config

	def check_interactions(self, state: PatientInput) -> dict:
	"""Check medication interactions based on biomarkers."""
	# Get relevant medical knowledge
	# Use LLM to identify drug-drug interactions
	# Return structured response
	return {
	"interactions": [],
	"warnings": [],
	"recommendations": []
	}
	```

	Step 2: Register in workflow (`src/workflow.py`):

	```python
	from src.agents.medication_checker import MedicationChecker

	medication_agent = MedicationChecker()

	def check_medications(state):
	return medication_agent.check_interactions(state)

	# Add to graph
	graph.add_node("MedicationChecker", check_medications)
	graph.add_edge("ClinicalGuidelines", "MedicationChecker")
	graph.add_edge("MedicationChecker", "ResponseSynthesizer")
	```

	Step 3: Update synthesizer to include medication info:

	```python
	# In response_synthesizer.py
	medication_info = state.get("medication_interactions", {})
	```

	### Switching LLM Providers

	RagBot supports three LLM providers out of the box. Set via `LLM_PROVIDER` in `.env`:

	\| Provider \| Model \| Cost \| Speed \|
	\|----------\|-------\|------\|-------\|
	\| `groq` (default) \| llama-3.3-70b-versatile \| Free \| Fast \|
	\| `gemini` \| gemini-2.0-flash \| Free \| Medium \|
	\| `ollama` \| configurable \| Free (local) \| Varies \|

	```bash
	# .env
	LLM_PROVIDER="groq"
	GROQ_API_KEY="gsk_..."

	# Or
	LLM_PROVIDER="gemini"
	GOOGLE_API_KEY="..."
	```

	No code changes needed — `src/llm_config.py` handles provider selection automatically.

	### Modifying Embedding Provider

	Current default: Google Gemini (`models/embedding-001`, free)
	Fallback: HuggingFace sentence-transformers (local, no API key needed)
	Optional: Ollama (local)

	Set via `EMBEDDING_PROVIDER` in `.env`:
	```bash
	EMBEDDING_PROVIDER="google" # Default - Google Gemini
	EMBEDDING_PROVIDER="huggingface" # Fallback - local
	EMBEDDING_PROVIDER="ollama" # Local Ollama
	```

	After changing, rebuild the vector store:
	```bash
	python scripts/setup_embeddings.py
	```

	⚠️ Note: Changing embeddings requires rebuilding the vector store (dimensions must match).

	## Testing

	### Run All Tests

	```bash
	.venv\Scripts\python.exe -m pytest tests/ -q --ignore=tests/test_basic.py --ignore=tests/test_diabetes_patient.py --ignore=tests/test_evolution_loop.py --ignore=tests/test_evolution_quick.py --ignore=tests/test_evaluation_system.py
	```

	### Run Specific Test

	```bash
	.venv\Scripts\python.exe -m pytest tests/test_normalization.py -v
	```

	### Test Coverage

	```bash
	.venv\Scripts\python.exe -m pytest --cov=src tests/
	```

	### Add New Tests

	Create `tests/test_myfeature.py`:

	```python
	import pytest
	from src.biomarker_validator import BiomarkerValidator

	class TestMyFeature:
	def setup_method(self):
	self.validator = BiomarkerValidator()

	def test_validation(self):
	result = self.validator.validate("Glucose", 140)
	assert result.is_valid == False
	assert result.status == "out-of-range"
	```

	## Debugging

	### Enable Debug Logging

	Set in `.env`:
	```
	LOG_LEVEL=DEBUG
	```

	### Interactive Debugging

	```bash
	python -c "
	from src.workflow import create_guild

	# Create the guild
	guild = create_guild()

	# Run workflow
	result = guild.run({
	'biomarkers': {'Glucose': 185, 'HbA1c': 8.2},
	'model_prediction': {'disease': 'Diabetes', 'confidence': 0.87}
	})

	# Inspect result
	print(result)
	"
	```

	### Profile Performance

	```bash
	python -m cProfile -s cumtime scripts/chat.py
	```

	## Code Quality

	### Format Code

	```bash
	black src/ api/ scripts/
	```

	### Check Types

	```bash
	mypy src/ --ignore-missing-imports
	```

	### Lint

	```bash
	pylint src/ api/ scripts/
	```

	### Pre-commit Hook

	Create `.git/hooks/pre-commit`:

	```bash
	#!/bin/bash
	black src/ api/ scripts/
	pytest tests/
	```

	## Documentation

	- Update `docs/` when adding features
	- Keep README.md in sync with changes
	- Document all new functions with docstrings:

	```python
	def analyze_biomarker(name: str, value: float) -> dict:
	"""
	Analyze a single biomarker value.

	Args:
	name: Biomarker name (e.g., "Glucose")
	value: Measured value

	Returns:
	dict: Analysis result with status, alerts, recommendations

	Raises:
	ValueError: If biomarker name is invalid
	"""
	```

	## Performance Optimization

	### Profile Agent Execution

	```python
	import time

	start = time.time()
	result = agent.run(state)
	elapsed = time.time() - start
	print(f"Agent took {elapsed:.2f}s")
	```

	### Parallel Agent Execution

	Agents already run in parallel via LangGraph:
	- Agent 1: Biomarker Analyzer
	- Agents 2-4: RAG agents (parallel)
	- Agent 5: Confidence Assessor
	- Agent 6: Synthesizer

	Modify in `src/workflow.py` if needed.

	### Cache Embeddings

	FAISS vector store is already loaded once at startup.

	### Reduce Processing Time

	- Fewer RAG docs: Modify `k=5` in agent prompts
	- Simpler LLM: Use smaller model or quantized version
	- Batch requests: Process multiple patients at once

	## Troubleshooting

	### Issue: Vector store not found

	```bash
	.venv\Scripts\python.exe scripts/setup_embeddings.py
	```

	### Issue: LLM provider not responding

	- Check your `.env` has valid API keys (`GROQ_API_KEY` or `GOOGLE_API_KEY`)
	- Verify internet connection
	- Check provider status pages (Groq Console, Google AI Studio)

	### Issue: Slow inference

	- Check Groq API status
	- Verify internet connection
	- Try smaller model or batch requests

	## Contributing

	See [CONTRIBUTING.md](../CONTRIBUTING.md) for:
	- Code style guidelines
	- Pull request process
	- Issue reporting
	- Testing requirements

	## Support

	- Issues: GitHub Issues
	- Discussions: GitHub Discussions
	- Documentation: See `/docs`

	## Resources

	- [LangGraph Docs](https://langchain-ai.github.io/langgraph/)
	- [Groq API Docs](https://console.groq.com)
	- [FAISS Documentation](https://github.com/facebookresearch/faiss/wiki)
	- [FastAPI Guide](https://fastapi.tiangolo.com/)
	- [Pydantic V2](https://docs.pydantic.dev/latest/)