| # Copilot Instructions for Jacek AI | |
| This file provides guidance for GitHub Copilot when working with the Jacek AI codebase - a bilingual (Polish/English) accessibility chatbot using RAG with LanceDB and OpenAI GPT-4. | |
| ## Build, Test, and Run Commands | |
| ### Running the Application | |
| ```bash | |
| # Local development - starts Gradio UI at http://127.0.0.1:7860 | |
| python app.py | |
| # Run all startup tests before deployment | |
| python test_startup.py | |
| ``` | |
| ### Environment Setup | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Configure environment (required before first run) | |
| cp .env.example .env | |
| # Edit .env and add your OPENAI_API_KEY | |
| ``` | |
| ### Database Management | |
| ```bash | |
| # Compact LanceDB (removes version history, reduces file count) | |
| python compact_database.py | |
| # Check document count | |
| python -c "import lancedb; db = lancedb.connect('./lancedb'); print(len(db.open_table('a11y_expert')))" | |
| ``` | |
| ### Testing | |
| ```bash | |
| # Run full test suite (imports, config, vector store, embeddings, agent) | |
| python test_startup.py | |
| # All tests must pass before deploying to Hugging Face Spaces | |
| ``` | |
| ## Architecture Overview | |
| ### Core Components | |
| **Agent System** (`agent/`) | |
| - `a11y_agent.py`: Main `A11yExpertAgent` class with streaming responses via OpenAI | |
| - `prompts.py`: Language-specific system prompts (Polish/English) with **strict language enforcement** | |
| - `tools.py`: RAG tools for knowledge base search (top-5 semantic results) | |
| **Vector Store** (`database/`) | |
| - `vector_store_client.py`: LanceDB client with lazy loading and automatic reconnection | |
| - Database path: `./lancedb/a11y_expert.lance` (tracked with Git LFS) | |
| - **READ-ONLY in production** (Hugging Face Spaces environment) | |
| **Embeddings** (`models/`) | |
| - `embeddings.py`: OpenAI embeddings client with disk caching (`./cache/embeddings`) and retry logic | |
| - Model: `text-embedding-3-large` (3072 dimensions) | |
| - Singleton pattern: use `get_embeddings_client()` for shared instance | |
| **UI** (`app.py`) | |
| - Gradio ChatInterface with two-column layout (chat + notes from `notes.md`) | |
| - **Lazy agent initialization** - agent loads on first user query, not at startup | |
| - Streaming responses for better UX | |
| **Configuration** (`config.py`) | |
| - Pydantic settings with environment variable support | |
| - All config loaded from `.env` file (never hardcode secrets) | |
| - Required: `OPENAI_API_KEY` (OpenAI API key for LLM and embeddings) | |
| ### Data Flow (RAG Pipeline) | |
| 1. User asks question in Gradio UI | |
| 2. Language detected from query using `langdetect` (Polish or English) | |
| 3. Query embedded using OpenAI embeddings API (with cache lookup) | |
| 4. Vector search in LanceDB (filtered by language: `where="language = 'pl'"` or `'en'`) | |
| 5. Top 5 results formatted as context | |
| 6. Context + query + language-specific system prompt sent to GPT-4 | |
| 7. Response streamed back to UI token-by-token | |
| ### Key Design Patterns | |
| - **Lazy Initialization**: Agent and database connections initialize on first use, not at startup (faster deployment) | |
| - **Singleton Pattern**: `get_embeddings_client()` returns shared instance across the app | |
| - **Language Detection**: Auto-detects query language and adjusts both prompt and vector search filter | |
| - **Stateless Agent**: No internal conversation history (Gradio handles history in UI) | |
| - **Conversation Context**: Last 4 messages kept in context for follow-up questions | |
| ## Key Conventions | |
| ### Language Handling - CRITICAL | |
| The agent has **strict language enforcement** in system prompts: | |
| - Polish queries get `SYSTEM_PROMPT_PL` with "CRITICAL: Answer ONLY in Polish" | |
| - English queries get `SYSTEM_PROMPT_EN` with "CRITICAL: Answer ONLY in English" | |
| - System prompts explicitly instruct the LLM to translate sources if needed | |
| - Vector search is language-filtered: `where="language = 'pl'"` or `where="language = 'en'"` | |
| **When modifying prompts**: Never remove or weaken the language enforcement instructions - they prevent language mixing which confuses users. | |
| ### LanceDB Database - READ-ONLY in Production | |
| - Database at `./lancedb/` is tracked with Git LFS (not generated at runtime) | |
| - In Hugging Face Spaces: database is read-only (filesystem is immutable) | |
| - For local development: use `VectorStoreClient.add_documents()` to add data | |
| - After local changes: run `compact_database.py` to reduce file count before committing | |
| - Schema: `text`, `vector`, `source`, `language`, `doc_type`, `created_at`, `updated_at` | |
| ### Configuration Loading | |
| All settings in `config.py` are loaded from environment variables: | |
| ```python | |
| from config import get_settings | |
| settings = get_settings() # Singleton, cached | |
| print(settings.llm_model) # gpt-4o (default) | |
| ``` | |
| Never access environment variables directly - always use `get_settings()`. | |
| ### Hugging Face Spaces Deployment | |
| **Critical deployment requirements**: | |
| 1. `demo.queue()` must be called explicitly (see `app.py:238-243`) | |
| 2. Do **NOT** use `atexit.register()` for cleanup (causes premature shutdown) | |
| 3. LanceDB must be committed with Git LFS (database is read-only in HF) | |
| 4. API key stored as HF Spaces Secret: `OPENAI_API_KEY` | |
| 5. The `if __name__ == "__main__"` block handles both local and HF deployments | |
| **Testing before deployment**: | |
| ```bash | |
| python test_startup.py # All tests must pass | |
| ``` | |
| ### Logging | |
| Use loguru for all logging (already configured): | |
| ```python | |
| from loguru import logger | |
| logger.info("Starting process...") | |
| logger.success("✅ Completed successfully") | |
| logger.error(f"❌ Failed: {error}") | |
| ``` | |
| Set `LOG_LEVEL=DEBUG` in `.env` for verbose output during development. | |
| ### Error Handling | |
| - Always close resources in agent/client classes (implement `close()` method) | |
| - Use try/except with specific exception types | |
| - Log full traceback for debugging: `logger.error(traceback.format_exc())` | |
| - For user-facing errors, provide clear Polish/English messages depending on detected language | |
| ## Project Structure | |
| ``` | |
| JacekAI/ | |
| ├── agent/ # Core agent logic | |
| │ ├── a11y_agent.py # Main agent with RAG | |
| │ ├── prompts.py # Language-specific prompts (PL/EN) | |
| │ └── tools.py # Knowledge base search tools | |
| ├── database/ | |
| │ └── vector_store_client.py # LanceDB client | |
| ├── models/ | |
| │ └── embeddings.py # OpenAI embeddings with caching | |
| ├── lancedb/ # Vector database (Git LFS) | |
| │ └── a11y_expert.lance/ | |
| ├── cache/ # Embeddings cache (gitignored) | |
| ├── app.py # Gradio UI with lazy initialization | |
| ├── config.py # Pydantic settings (environment variables) | |
| ├── test_startup.py # Deployment readiness tests | |
| ├── compact_database.py # Database compaction utility | |
| ├── requirements.txt # Python dependencies | |
| ├── .env.example # Environment template | |
| └── notes.md # Optional notes displayed in UI sidebar | |
| ``` | |
| ## Important Implementation Notes | |
| ### When Adding New Features to Agent | |
| 1. Modifying prompts → Edit `agent/prompts.py` | |
| 2. Adding new tools → Add function to `agent/tools.py` | |
| 3. Changing RAG logic → Modify `agent/a11y_agent.py` | |
| 4. Test locally with `python app.py` and interact through UI | |
| ### When Updating Dependencies | |
| 1. Edit `requirements.txt` | |
| 2. Run `pip install -r requirements.txt` | |
| 3. Test with `python test_startup.py` | |
| 4. Commit changes and test in HF Spaces | |
| ### When Debugging | |
| - Set `LOG_LEVEL=DEBUG` in `.env` for verbose logging | |
| - Agent initialization happens on first query (check logs for "A11yExpertAgent initialized") | |
| - Embeddings cache is at `./cache/embeddings` (create directory if missing) | |
| - Vector search logs show retrieved context from database | |
| ## Common Pitfalls | |
| 1. **DO NOT** modify the database in production (LanceDB is read-only on HF Spaces) | |
| 2. **DO NOT** use `atexit.register()` in `app.py` (breaks HF Spaces deployment) | |
| 3. **DO NOT** weaken language enforcement in prompts (causes confusing mixed-language responses) | |
| 4. **DO NOT** access `os.environ` directly - always use `get_settings()` | |
| 5. **DO NOT** initialize agent at module level - use lazy initialization pattern | |
| 6. **DO NOT** forget to call `demo.queue()` before `demo.launch()` in Gradio | |
| ## Environment Variables | |
| Required in `.env` file: | |
| - `OPENAI_API_KEY` - OpenAI API key for LLM and embeddings - **REQUIRED** | |
| Optional (with defaults): | |
| - `LLM_MODEL` - Language model (default: `gpt-4o-mini`) | |
| - `LLM_BASE_URL` - API endpoint (default: GitHub Models endpoint) | |
| - `EMBEDDING_MODEL` - Embedding model (default: `text-embedding-3-large`) | |
| - `LANCEDB_URI` - Database path (default: `./lancedb`) | |
| - `LANCEDB_TABLE` - Table name (default: `a11y_expert`) | |
| - `LOG_LEVEL` - Logging verbosity (default: `INFO`) | |
| - `SERVER_HOST` - Gradio host (default: `127.0.0.1`, use `0.0.0.0` for HF) | |
| - `SERVER_PORT` - Gradio port (default: `7860`) | |
| ## Related Documentation | |
| - `CLAUDE.md` - Detailed guidance for Claude Code (includes architectural details) | |
| - `README.md` - User-facing documentation with setup instructions | |
| - `HF_SPACES_GUIDE.md` - Hugging Face Spaces deployment guide | |
| - `QUICK_REFERENCE.md` - Quick reference for common tasks | |