| # AliveAI Bucket README | |
| This document explains the current code snapshot in `hf://buckets/meet4150/ALIV_AI`, what was implemented today, and what each file is used for. | |
| ## What Was Implemented Today | |
| 1. File ingestion endpoints were added: | |
| - `POST /ingest/file` | |
| - `POST /ingest/text` | |
| - `GET /ingest/task/{task_id}` | |
| - `GET /ingest/schema` | |
| 2. Upload processing pipeline was added: | |
| - extract text from `.txt/.pdf/.doc/.docx` | |
| - chunk text with overlap | |
| - generate embeddings (`BAAI/bge-base-en-v1.5`) | |
| - store records into vector DB schema | |
| 3. Background ingestion was added via Celery. | |
| 4. Ingestion endpoints are configured to store into local Chroma collection `medical_kb`. | |
| 5. RAG tuning env params were added (`chunk_size`, `chunk_overlap`, `top_k`, `top_p`, `llm_top_k`). | |
| ## Canonical Ingestion Schema | |
| All ingested chunks are stored in this structure: | |
| ```json | |
| { | |
| "id": "string", | |
| "content": "string", | |
| "metadata": { | |
| "disease_id": "string", | |
| "topic": "string", | |
| "source": "string", | |
| "document_id": "string", | |
| "chunk_index": 0, | |
| "scraped_at": "YYYY-MM-DD" | |
| } | |
| } | |
| ``` | |
| ## Runtime Flow | |
| ### A) File upload flow | |
| 1. Client uploads file to `POST /ingest/file`. | |
| 2. File is saved to `data/uploads/`. | |
| 3. Text is extracted by file type parser. | |
| 4. Text is chunked. | |
| 5. Embeddings are computed. | |
| 6. Data is inserted into Chroma collection `medical_kb`. | |
| 7. If async mode is on, task status is checked via `GET /ingest/task/{task_id}`. | |
| ### B) Raw text flow | |
| 1. Client posts text to `POST /ingest/text`. | |
| 2. Text is chunked, embedded, and inserted with the same schema. | |
| ## How To Run | |
| Install deps: | |
| ```bash | |
| python3.12 -m pip install -r requirements.txt | |
| ``` | |
| Start API: | |
| ```bash | |
| python3.12 -m uvicorn app.main:app --reload --port 8000 | |
| ``` | |
| Start Celery worker (for async ingestion): | |
| ```bash | |
| export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0 | |
| export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0 | |
| celery -A app.celery_app.celery_app worker --loglevel=info | |
| ``` | |
| ## Important Env Vars | |
| ```bash | |
| # Ingestion + chunking | |
| export ALIVEAI_CHUNK_SIZE=700 | |
| export ALIVEAI_CHUNK_OVERLAP=150 | |
| export ALIVEAI_INGEST_BATCH_SIZE=256 | |
| # RAG retrieval/generation | |
| export ALIVEAI_RAG_TOP_K=5 | |
| export ALIVEAI_LLM_TOP_P=0.9 | |
| export ALIVEAI_LLM_TOP_K=40 | |
| # Backend selector (global app) | |
| export ALIVEAI_VECTOR_BACKEND=auto # auto | pinecone | chroma | |
| ``` | |
| Note: ingestion endpoints explicitly target Chroma for upload processing. | |
| ## File Purpose Map | |
| ### Root files | |
| - `README.md`: main project documentation and setup instructions. | |
| - `BUCKET_README.md`: this bucket-focused developer guide. | |
| - `requirements.txt`: Python dependencies. | |
| - `test_rag.py`: basic checks for embedding similarity, retrieval, and NLP routing. | |
| - `.gitignore`: ignore rules. | |
| ### `app/` | |
| - `app/main.py`: FastAPI app entrypoint, routes (`/chat`, `/health`, `/ingest/*`). | |
| - `app/celery_app.py`: Celery app configuration (broker/backend + task settings). | |
| ### `app/agent/` | |
| - `app/agent/health_agent.py`: chat orchestration and response generation with Ollama/HF fallback. | |
| - `app/agent/kb_embedding.py`: KB embedding service (bge-base-en-v1.5). | |
| - `app/agent/kb_retrieval.py`: vector retrieval functions used by chat flow. | |
| ### `app/db/` | |
| - `app/db/chroma_client.py`: vector DB adapter (Chroma + optional Pinecone integration), collection access. | |
| ### `app/nlp/` | |
| - `app/nlp/nlp_service.py`: intent + disease routing using MiniLM embeddings and heuristics. | |
| ### `app/ingestion/` | |
| - `app/ingestion/pipeline.py`: ingestion core logic: | |
| - file text extraction | |
| - chunking | |
| - schema record creation | |
| - embedding + vector insertion | |
| ### `app/tasks/` | |
| - `app/tasks/ingestion_tasks.py`: Celery tasks for async `ingest_file` and `ingest_text`. | |
| ### `scripts/` | |
| - `scripts/download_dataset.py`: download raw MedQuAD dataset and normalize fields. | |
| - `scripts/prepare_dataset.py`: chunk + transform raw dataset to ingestion schema JSONL. | |
| - `scripts/ingest.py`: batch ingestion of prepared JSONL into vector DB. | |
| - `scripts/download_models.py`: download local embedding models. | |
| - `scripts/download_hf_chat_model.py`: download local HF fallback chat model. | |
| ## Current Scope | |
| This bucket is now clean and code-focused for developers: | |
| - no duplicate project folders | |
| - no local model binaries | |
| - no local vector DB files | |
| - no raw data dumps | |
| If needed, data/model artifacts should be uploaded separately in dedicated paths. | |
Xet Storage Details
- Size:
- 4.37 kB
- Xet hash:
- 21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.