Gyan Model — 1.24M Knowledge Pairs

The database is the model. Training is INSERT. Cost is $0.

This is the pre-trained knowledge base for Gyan — an AI engine that uses retrieval instead of generation. No LLM. No hallucination.

Architecture

Query → MiniLM encoder (22M params) → 384-dim embedding
      → Cosine similarity search over 1.24M stored embeddings
      → Bi-embedding re-rank → Convergence loop → Answer

Based on the INSERT INTO Is All You Need paper.

Files

File Size Description
embeddings.npy 953MB 1,241,486 × 384 float16 embeddings
metadata.json 569MB Question, answer, source for each pair

Dataset Composition

Dataset Pairs Type
MS MARCO 502,689 Factual passages
UltraChat 199,564 Conversational
WizardLM 98,713 Complex instructions
HotPotQA 90,436 Multi-hop reasoning
NaturalQuestions 87,925 Factual QA
TriviaQA 87,607 General knowledge
SQuAD 2.0 86,710 Reading comprehension
CodeAlpaca 18,875 Code
Dolly 13,767 Hand-crafted
OASST2 12,979 Assistant paragraphs
SciQ 11,599 Science
CommonsenseQA 9,740 Reasoning
GSM8K 7,473 Math
+ 5 more ~10K Various
Total 1,241,486

RLHF Results

Trained on RTX 4090 in 175 seconds. RLHF self-evolution:

Dataset Before After Time
NaturalQuestions 29% 92% EM 0.3s
TriviaQA 4% 94.4% EM 0.5s
HotPotQA 6% 100% EM 0.1s

Usage

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = np.load("embeddings.npy").astype(np.float32)

# Search
query_emb = model.encode(["What is photosynthesis?"], normalize_embeddings=True)
scores = (embeddings @ query_emb.T).flatten()
best_idx = scores.argmax()

Links

License

Code: MIT · Data: CC-BY 4.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train tejadabheja/gyan-model