Gyan Model — 1.24M Knowledge Pairs

The database is the model. Training is INSERT. Cost is $0.

This is the pre-trained knowledge base for Gyan — an AI engine that uses retrieval instead of generation. No LLM. No hallucination.

Architecture

Query → MiniLM encoder (22M params) → 384-dim embedding
      → Cosine similarity search over 1.24M stored embeddings
      → Bi-embedding re-rank → Convergence loop → Answer

Based on the INSERT INTO Is All You Need paper.

Files

File	Size	Description
`embeddings.npy`	953MB	1,241,486 × 384 float16 embeddings
`metadata.json`	569MB	Question, answer, source for each pair

Dataset Composition

Dataset	Pairs	Type
MS MARCO	502,689	Factual passages
UltraChat	199,564	Conversational
WizardLM	98,713	Complex instructions
HotPotQA	90,436	Multi-hop reasoning
NaturalQuestions	87,925	Factual QA
TriviaQA	87,607	General knowledge
SQuAD 2.0	86,710	Reading comprehension
CodeAlpaca	18,875	Code
Dolly	13,767	Hand-crafted
OASST2	12,979	Assistant paragraphs
SciQ	11,599	Science
CommonsenseQA	9,740	Reasoning
GSM8K	7,473	Math
+ 5 more	~10K	Various
Total	1,241,486

RLHF Results

Trained on RTX 4090 in 175 seconds. RLHF self-evolution:

Dataset	Before	After	Time
NaturalQuestions	29%	92% EM	0.3s
TriviaQA	4%	94.4% EM	0.5s
HotPotQA	6%	100% EM	0.1s

Usage

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = np.load("embeddings.npy").astype(np.float32)

# Search
query_emb = model.encode(["What is photosynthesis?"], normalize_embeddings=True)
scores = (embeddings @ query_emb.T).flatten()
best_idx = scores.argmax()

License

Code: MIT · Data: CC-BY 4.0

Downloads last month: -; Downloads are not tracked for this model. How to track

tejadabheja
/

gyan-model