GK-Model-150M by AuraWorx
General Knowledge Q&A Model (150 Million Parameters)
This model was trained from scratch by AuraWorx as a demonstration of building a small, yet capable, general knowledge question-answering model using Google Colab and modern training techniques.
Model Details
- Parameters: 150 Million
- Architecture: Decoder-only Transformer (GPT-like)
- Training Data:
- Pretraining: English Wikipedia (20231101 dump) and a subset of OpenWebText (~5 Billion tokens total).
- Finetuning (SFT): Curated Q&A datasets including SQuAD v2, TriviaQA, Natural Questions (simplified), and Dolly-15k (~800K Q&A pairs).
- Tokenizer: Custom-trained BPE tokenizer (32,000 vocabulary size).
Usage
This model is designed for general knowledge question-answering. You can load it using the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "your-username/GK-Model-150M" # Replace with your actual repo name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def ask_question(question, model, tokenizer, max_new_tokens=256):
input_text = f"Question: {question}
Answer: "
inputs = tokenizer(input_text, return_tensors="pt", return_attention_mask=True)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
output_sequences = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
temperature=0.7,
top_k=50,
top_p=0.9,
repetition_penalty=1.15,
do_sample=True,
)
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
# Extracting only the answer part
answer_start = generated_text.find("Answer:") + len("Answer:")
answer = generated_text[answer_start:].strip()
return answer
question = "What is the capital of France?"
answer = ask_question(question, model, tokenizer)
print(f"Q: {question}")
print(f"A: {answer}")
question = "Who developed the theory of relativity?"
answer = ask_question(question, model, tokenizer)
print(f"Q: {question}")
print(f"A: {answer}")
Training Process
This model was trained in a multi-stage process:
- Setup & Dependencies: Environment setup and installation of necessary libraries.
- Data Download & Preprocessing: Acquisition and preparation of raw text data for pretraining and Q&A pairs for SFT.
- BPE Tokenizer Training: A custom Byte-Pair Encoding tokenizer was trained on a representative sample of the data.
- Pretraining: The base model was pretrained using next-token prediction on ~5 billion tokens from Wikipedia and OpenWebText.
- Supervised Finetuning (SFT): The pretrained model was finetuned on ~800,000 Q&A pairs to specialize in question-answering.
Acknowledgements
Developed by AuraWorx using resources from Google Colab and datasets from Hugging Face.
AuraWorx - Innovating AI Solutions
- Downloads last month
- 52