Gyan Model — 1.24M Knowledge Pairs
The database is the model. Training is INSERT. Cost is $0.
This is the pre-trained knowledge base for Gyan — an AI engine that uses retrieval instead of generation. No LLM. No hallucination.
Architecture
Query → MiniLM encoder (22M params) → 384-dim embedding
→ Cosine similarity search over 1.24M stored embeddings
→ Bi-embedding re-rank → Convergence loop → Answer
Based on the INSERT INTO Is All You Need paper.
Files
| File | Size | Description |
|---|---|---|
embeddings.npy |
953MB | 1,241,486 × 384 float16 embeddings |
metadata.json |
569MB | Question, answer, source for each pair |
Dataset Composition
| Dataset | Pairs | Type |
|---|---|---|
| MS MARCO | 502,689 | Factual passages |
| UltraChat | 199,564 | Conversational |
| WizardLM | 98,713 | Complex instructions |
| HotPotQA | 90,436 | Multi-hop reasoning |
| NaturalQuestions | 87,925 | Factual QA |
| TriviaQA | 87,607 | General knowledge |
| SQuAD 2.0 | 86,710 | Reading comprehension |
| CodeAlpaca | 18,875 | Code |
| Dolly | 13,767 | Hand-crafted |
| OASST2 | 12,979 | Assistant paragraphs |
| SciQ | 11,599 | Science |
| CommonsenseQA | 9,740 | Reasoning |
| GSM8K | 7,473 | Math |
| + 5 more | ~10K | Various |
| Total | 1,241,486 |
RLHF Results
Trained on RTX 4090 in 175 seconds. RLHF self-evolution:
| Dataset | Before | After | Time |
|---|---|---|---|
| NaturalQuestions | 29% | 92% EM | 0.3s |
| TriviaQA | 4% | 94.4% EM | 0.5s |
| HotPotQA | 6% | 100% EM | 0.1s |
Usage
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = np.load("embeddings.npy").astype(np.float32)
# Search
query_emb = model.encode(["What is photosynthesis?"], normalize_embeddings=True)
scores = (embeddings @ query_emb.T).flatten()
best_idx = scores.argmax()
Links
- App: github.com/tejasphatak/gyan-app
- Paper: INSERT INTO Is All You Need
- Research: github.com/tejasphatak/webmind-research
- Demo: webmind.sh
License
Code: MIT · Data: CC-BY 4.0