GK-Model-150M by AuraWorx

General Knowledge Q&A Model (150 Million Parameters)

This model was trained from scratch by AuraWorx as a demonstration of building a small, yet capable, general knowledge question-answering model using Google Colab and modern training techniques.

Model Details

Parameters: 150 Million
Architecture: Decoder-only Transformer (GPT-like)
Training Data:
- Pretraining: English Wikipedia (20231101 dump) and a subset of OpenWebText (~5 Billion tokens total).
- Finetuning (SFT): Curated Q&A datasets including SQuAD v2, TriviaQA, Natural Questions (simplified), and Dolly-15k (~800K Q&A pairs).
Tokenizer: Custom-trained BPE tokenizer (32,000 vocabulary size).

Usage

This model is designed for general knowledge question-answering. You can load it using the Hugging Face transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "your-username/GK-Model-150M" # Replace with your actual repo name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def ask_question(question, model, tokenizer, max_new_tokens=256):
    input_text = f"Question: {question}
Answer: "
    inputs = tokenizer(input_text, return_tensors="pt", return_attention_mask=True)
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    output_sequences = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        max_new_tokens=max_new_tokens,
        pad_token_id=tokenizer.eos_token_id,
        temperature=0.7,
        top_k=50,
        top_p=0.9,
        repetition_penalty=1.15,
        do_sample=True,
    )

    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    # Extracting only the answer part
    answer_start = generated_text.find("Answer:") + len("Answer:")
    answer = generated_text[answer_start:].strip()
    return answer

question = "What is the capital of France?"
answer = ask_question(question, model, tokenizer)
print(f"Q: {question}")
print(f"A: {answer}")

question = "Who developed the theory of relativity?"
answer = ask_question(question, model, tokenizer)
print(f"Q: {question}")
print(f"A: {answer}")

Training Process

This model was trained in a multi-stage process:

Setup & Dependencies: Environment setup and installation of necessary libraries.
Data Download & Preprocessing: Acquisition and preparation of raw text data for pretraining and Q&A pairs for SFT.
BPE Tokenizer Training: A custom Byte-Pair Encoding tokenizer was trained on a representative sample of the data.
Pretraining: The base model was pretrained using next-token prediction on ~5 billion tokens from Wikipedia and OpenWebText.
Supervised Finetuning (SFT): The pretrained model was finetuned on ~800,000 Q&A pairs to specialize in question-answering.

Acknowledgements

Developed by AuraWorx using resources from Google Colab and datasets from Hugging Face.

AuraWorx - Innovating AI Solutions

Downloads last month: 3