# Copilot Instructions for Jacek AI This file provides guidance for GitHub Copilot when working with the Jacek AI codebase - a bilingual (Polish/English) accessibility chatbot using RAG with LanceDB and OpenAI GPT-4. ## Build, Test, and Run Commands ### Running the Application ```bash # Local development - starts Gradio UI at http://127.0.0.1:7860 python app.py # Run all startup tests before deployment python test_startup.py ``` ### Environment Setup ```bash # Install dependencies pip install -r requirements.txt # Configure environment (required before first run) cp .env.example .env # Edit .env and add your OPENAI_API_KEY ``` ### Database Management ```bash # Compact LanceDB (removes version history, reduces file count) python compact_database.py # Check document count python -c "import lancedb; db = lancedb.connect('./lancedb'); print(len(db.open_table('a11y_expert')))" ``` ### Testing ```bash # Run full test suite (imports, config, vector store, embeddings, agent) python test_startup.py # All tests must pass before deploying to Hugging Face Spaces ``` ## Architecture Overview ### Core Components **Agent System** (`agent/`) - `a11y_agent.py`: Main `A11yExpertAgent` class with streaming responses via OpenAI - `prompts.py`: Language-specific system prompts (Polish/English) with **strict language enforcement** - `tools.py`: RAG tools for knowledge base search (top-5 semantic results) **Vector Store** (`database/`) - `vector_store_client.py`: LanceDB client with lazy loading and automatic reconnection - Database path: `./lancedb/a11y_expert.lance` (tracked with Git LFS) - **READ-ONLY in production** (Hugging Face Spaces environment) **Embeddings** (`models/`) - `embeddings.py`: OpenAI embeddings client with disk caching (`./cache/embeddings`) and retry logic - Model: `text-embedding-3-large` (3072 dimensions) - Singleton pattern: use `get_embeddings_client()` for shared instance **UI** (`app.py`) - Gradio ChatInterface with two-column layout (chat + notes from `notes.md`) - **Lazy agent initialization** - agent loads on first user query, not at startup - Streaming responses for better UX **Configuration** (`config.py`) - Pydantic settings with environment variable support - All config loaded from `.env` file (never hardcode secrets) - Required: `OPENAI_API_KEY` (OpenAI API key for LLM and embeddings) ### Data Flow (RAG Pipeline) 1. User asks question in Gradio UI 2. Language detected from query using `langdetect` (Polish or English) 3. Query embedded using OpenAI embeddings API (with cache lookup) 4. Vector search in LanceDB (filtered by language: `where="language = 'pl'"` or `'en'`) 5. Top 5 results formatted as context 6. Context + query + language-specific system prompt sent to GPT-4 7. Response streamed back to UI token-by-token ### Key Design Patterns - **Lazy Initialization**: Agent and database connections initialize on first use, not at startup (faster deployment) - **Singleton Pattern**: `get_embeddings_client()` returns shared instance across the app - **Language Detection**: Auto-detects query language and adjusts both prompt and vector search filter - **Stateless Agent**: No internal conversation history (Gradio handles history in UI) - **Conversation Context**: Last 4 messages kept in context for follow-up questions ## Key Conventions ### Language Handling - CRITICAL The agent has **strict language enforcement** in system prompts: - Polish queries get `SYSTEM_PROMPT_PL` with "CRITICAL: Answer ONLY in Polish" - English queries get `SYSTEM_PROMPT_EN` with "CRITICAL: Answer ONLY in English" - System prompts explicitly instruct the LLM to translate sources if needed - Vector search is language-filtered: `where="language = 'pl'"` or `where="language = 'en'"` **When modifying prompts**: Never remove or weaken the language enforcement instructions - they prevent language mixing which confuses users. ### LanceDB Database - READ-ONLY in Production - Database at `./lancedb/` is tracked with Git LFS (not generated at runtime) - In Hugging Face Spaces: database is read-only (filesystem is immutable) - For local development: use `VectorStoreClient.add_documents()` to add data - After local changes: run `compact_database.py` to reduce file count before committing - Schema: `text`, `vector`, `source`, `language`, `doc_type`, `created_at`, `updated_at` ### Configuration Loading All settings in `config.py` are loaded from environment variables: ```python from config import get_settings settings = get_settings() # Singleton, cached print(settings.llm_model) # gpt-4o (default) ``` Never access environment variables directly - always use `get_settings()`. ### Hugging Face Spaces Deployment **Critical deployment requirements**: 1. `demo.queue()` must be called explicitly (see `app.py:238-243`) 2. Do **NOT** use `atexit.register()` for cleanup (causes premature shutdown) 3. LanceDB must be committed with Git LFS (database is read-only in HF) 4. API key stored as HF Spaces Secret: `OPENAI_API_KEY` 5. The `if __name__ == "__main__"` block handles both local and HF deployments **Testing before deployment**: ```bash python test_startup.py # All tests must pass ``` ### Logging Use loguru for all logging (already configured): ```python from loguru import logger logger.info("Starting process...") logger.success("✅ Completed successfully") logger.error(f"❌ Failed: {error}") ``` Set `LOG_LEVEL=DEBUG` in `.env` for verbose output during development. ### Error Handling - Always close resources in agent/client classes (implement `close()` method) - Use try/except with specific exception types - Log full traceback for debugging: `logger.error(traceback.format_exc())` - For user-facing errors, provide clear Polish/English messages depending on detected language ## Project Structure ``` JacekAI/ ├── agent/ # Core agent logic │ ├── a11y_agent.py # Main agent with RAG │ ├── prompts.py # Language-specific prompts (PL/EN) │ └── tools.py # Knowledge base search tools ├── database/ │ └── vector_store_client.py # LanceDB client ├── models/ │ └── embeddings.py # OpenAI embeddings with caching ├── lancedb/ # Vector database (Git LFS) │ └── a11y_expert.lance/ ├── cache/ # Embeddings cache (gitignored) ├── app.py # Gradio UI with lazy initialization ├── config.py # Pydantic settings (environment variables) ├── test_startup.py # Deployment readiness tests ├── compact_database.py # Database compaction utility ├── requirements.txt # Python dependencies ├── .env.example # Environment template └── notes.md # Optional notes displayed in UI sidebar ``` ## Important Implementation Notes ### When Adding New Features to Agent 1. Modifying prompts → Edit `agent/prompts.py` 2. Adding new tools → Add function to `agent/tools.py` 3. Changing RAG logic → Modify `agent/a11y_agent.py` 4. Test locally with `python app.py` and interact through UI ### When Updating Dependencies 1. Edit `requirements.txt` 2. Run `pip install -r requirements.txt` 3. Test with `python test_startup.py` 4. Commit changes and test in HF Spaces ### When Debugging - Set `LOG_LEVEL=DEBUG` in `.env` for verbose logging - Agent initialization happens on first query (check logs for "A11yExpertAgent initialized") - Embeddings cache is at `./cache/embeddings` (create directory if missing) - Vector search logs show retrieved context from database ## Common Pitfalls 1. **DO NOT** modify the database in production (LanceDB is read-only on HF Spaces) 2. **DO NOT** use `atexit.register()` in `app.py` (breaks HF Spaces deployment) 3. **DO NOT** weaken language enforcement in prompts (causes confusing mixed-language responses) 4. **DO NOT** access `os.environ` directly - always use `get_settings()` 5. **DO NOT** initialize agent at module level - use lazy initialization pattern 6. **DO NOT** forget to call `demo.queue()` before `demo.launch()` in Gradio ## Environment Variables Required in `.env` file: - `OPENAI_API_KEY` - OpenAI API key for LLM and embeddings - **REQUIRED** Optional (with defaults): - `LLM_MODEL` - Language model (default: `gpt-4o-mini`) - `LLM_BASE_URL` - API endpoint (default: GitHub Models endpoint) - `EMBEDDING_MODEL` - Embedding model (default: `text-embedding-3-large`) - `LANCEDB_URI` - Database path (default: `./lancedb`) - `LANCEDB_TABLE` - Table name (default: `a11y_expert`) - `LOG_LEVEL` - Logging verbosity (default: `INFO`) - `SERVER_HOST` - Gradio host (default: `127.0.0.1`, use `0.0.0.0` for HF) - `SERVER_PORT` - Gradio port (default: `7860`) ## Related Documentation - `CLAUDE.md` - Detailed guidance for Claude Code (includes architectural details) - `README.md` - User-facing documentation with setup instructions - `HF_SPACES_GUIDE.md` - Hugging Face Spaces deployment guide - `QUICK_REFERENCE.md` - Quick reference for common tasks