ArunCore / data /github /README.md
Neural Arun
ArunCore Deployment
9ae77d7
# data/github/
The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset β€” it contains structured, reviewed documentation for each project that proves real engineering capability.
---
## How It Was Built
1. All repos were cloned into `temp/repos/` locally
2. Source code, READMEs, and requirements files were analysed file by file
3. Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs)
4. Arun filled in the `decisions.md` with the *why* behind each architectural choice
5. Final files were reviewed and formatted before being committed here
---
## Folder Structure
One subfolder per project:
```
github/
β”œβ”€β”€ legal_RAG_system/ ← Tier 1: full 5-file treatment
β”œβ”€β”€ real_state_listing_scraper/ ← Tier 1: full 5-file treatment
β”œβ”€β”€ personal_ai_agent/ ← Tier 1: full 5-file treatment
β”œβ”€β”€ result_anomaly/ ← Tier 1: full 5-file treatment
β”œβ”€β”€ Agentic_AI_Projects/ ← Tier 2: metadata + readme only
β”œβ”€β”€ web_wizard/ ← Tier 2: metadata + readme only
└── neural_arun_labs/ ← Tier 2: metadata + readme only
```
---
## File Types Per Project
| File | Tier 1 | Tier 2 | Purpose |
|---|:---:|:---:|---|
| `metadata.json` | βœ… | βœ… | Machine-readable project facts: URL, stack, status, visibility |
| `readme.md` | βœ… | βœ… | Clean problem/solution/features (no install noise) |
| `architecture.md` | βœ… | ❌ | System design: components, data flow, design patterns used |
| `code_summaries.json` | βœ… | ❌ | Per-module summaries with GitHub file URLs |
| `decisions.md` | βœ… | ❌ | Key architectural decisions + reasoning (written by Arun) |
---
## Tiering Criteria
**Tier 1 β€” Full treatment:** Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally.
**Tier 2 β€” Lightweight:** Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation.
---
## How This Dataset Is Used
During ingestion, each file is:
1. Chunked with metadata tags (`source`, `project_name`, `type`, `tech_stack`, `status`, `visibility`)
2. Embedded using OpenAI `text-embedding-3-small`
3. Stored in the vector database
When a user asks *"What RAG projects have you built?"*, the retrieval engine pulls from `legal_RAG_system/architecture.md` and `legal_RAG_system/readme.md`. When asked *"Why did you use ChromaDB?"*, it retrieves from `legal_RAG_system/decisions.md`.