| --- |
| license: mit |
| datasets: |
| - GODELEV/BetterDataset-2M |
| language: |
| - en |
| pipeline_tag: text-generation |
| --- |
| |
| # Archaea-74M |
|
|
| Archaea-74M is a decoder-only causal language model with approximately 74 million parameters, pretrained from scratch on BetterDataset-2M. The model uses a LLaMA-style architecture with Grouped Query Attention (GQA) and was trained using BF16 mixed precision. |
|
|
| This release represents approximately **1.23 billion trained tokens** out of a planned **1.6 billion token pretraining run**, making it a substantial intermediate checkpoint that captures most of the intended training curriculum while leaving room for future scaling and refinement. |
|
|
| --- |
|
|
| # Model Card |
|
|
| | Attribute | Value | |
| |------------|------------| |
| | Model ID | GODELEV/Archaea-74M | |
| | Parameters | ~74 Million | |
| | Architecture | Decoder-only Transformer (LLaMA-style) | |
| | Attention | Grouped Query Attention (GQA) | |
| | Context Length | 1024 | |
| | Tokenizer | GPT-2 | |
| | Training Precision | BF16 | |
| | Framework | PyTorch + Transformers | |
| | License | MIT | |
|
|
| --- |
|
|
| # Architecture |
|
|
| ## Transformer Configuration |
|
|
| | Parameter | Value | |
| |------------|------------| |
| | Hidden Size | 512 | |
| | Intermediate Size | 1408 | |
| | Layers | 8 | |
| | Attention Heads | 8 | |
| | KV Heads | 2 | |
| | GQA Ratio | 4:1 | |
| | Activation | SiLU | |
| | Normalization | RMSNorm | |
| | Context Length | 1024 | |
|
|
| The model implements Grouped Query Attention, reducing KV-cache memory requirements while maintaining strong representational capacity for a model of this scale. |
|
|
| --- |
|
|
| # Training |
|
|
| ## Dataset |
|
|
| Archaea-74M was pretrained on **GODELEV/BetterDataset-2M**, a multi-source corpus composed of: |
|
|
| - General web text |
| - Conversational content |
| - Knowledge-focused material |
| - Educational content |
| - Instruction-like examples |
| - Technical and programming text |
|
|
| The complete corpus contains approximately **1.6 billion tokens**. |
|
|
| ### Training Progress |
|
|
| | Metric | Value | |
| |----------|----------| |
| | Planned Tokens | ~1.6B | |
| | Tokens Trained | ~1.23B | |
| | Completion | ~77% | |
| | Planned Steps | 25,000 | |
| | Completed Steps | 18,800 | |
|
|
| ## Optimization |
|
|
| | Parameter | Value | |
| |------------|------------| |
| | Optimizer | AdamW | |
| | Scheduler | OneCycleLR | |
| | Peak Learning Rate | 6e-4 | |
| | Weight Decay | 0.1 | |
| | Gradient Clipping | 1.0 | |
| | Sequence Length | 1024 | |
| | Effective Batch Size | 64 | |
| | Precision | BF16 | |
|
|
| ## Training Statistics |
|
|
| | Metric | Value | |
| |------------|------------| |
| | Initial Loss | 10.9223 | |
| | Final Loss | 2.9488 | |
| | Best Loss | 2.8071 | |
| | Final Perplexity | 19.08 | |
| | Best Perplexity | 16.56 | |
|
|
| ## Training Loss Curve |
|
|
| <img src="Archaea74M_Training_Loss_Curve.png" width="700"/> |
|
|
| ## Learning Rate Schedule |
|
|
| <img src="Archaea74M_Learning_Rate_Schedule.png" width="700"/> |
|
|
| --- |
|
|
| # Evaluation |
|
|
| Evaluated using EleutherAI LM Evaluation Harness. |
|
|
| ## Benchmark Results |
|
|
| Done on 0-Shot |
|
|
| | Benchmark | Metric | Score | |
| |------------|------------|------------| |
| | HellaSwag | acc_norm | 27.31% | |
| | PIQA | acc_norm | 58.54% | |
| | WinoGrande | acc | 51.54% | |
| | BoolQ | acc | 56.33% | |
| | ARC-Easy | acc_norm | 39.06% | |
| | ARC-Challenge | acc_norm | 22.70% | |
| | OpenBookQA | acc_norm | 26.00% | |
| | CommonsenseQA | acc | 19.66% | |
| | LAMBADA | acc | 18.01% | |
| | BLiMP | acc | 74.91% | |
| | MMLU | acc | 25.07% | |
| | SciQ | acc_norm | 57.70% | |
| | COPA | acc | 61.00% | |
| | RACE | acc | 24.78% | |
| | SWAG | acc_norm | 41.98% | |
| | TruthfulQA MC2 | acc | 46.46% | |
| | WikiText-2 | Word Perplexity | 68.06 | |
| |
| ## Evaluation Notes |
| |
| Failed tasks from previous runs: |
| |
| - SocialIQA: dataset loading incompatibility. |
| - Arithmetic benchmark: task identifier changes in recent lm-eval versions. |
| |
| --- |
| |
| # Sample Generations |
| |
| The following examples were generated directly from the pretrained base model. |
| |
| ## Example 1 |
| |
| **Input** |
| |
| ```text |
| Once upon a time |
| ``` |
| |
| **Output** |
| |
| ```text |
| Once upon a time |
| |
| In the days after, I was told that I could not make a living. So I said to myself, |
| βI want to go back to my family.β I said, βThis is a good one. You can take me to the doctor |
| or something and do it with you.β |
| |
| The doctor asked me how I felt about this. He said he did. But I said, |
| βYou can't have a baby.β |
| ``` |
| |
| ## Example 2 |
| |
| **Input** |
| |
| ```text |
| The horizon looked |
| ``` |
| |
| **Output** |
| |
| ```text |
| the horizon looked like a big black hole. The sun was shining and the wind was blowing. |
| |
| The sun was shining with the sun, but it was still there. The stars were on top of each other. |
| |
| The sun was falling as a result of the weather. The sun was going up. |
| ``` |
| |
| ## Example 3 |
| |
| **Input** |
| |
| ```text |
| what is the future of ai |
| ``` |
| |
| **Output** |
| |
| ```text |
| what is the future of Ai? |
| |
| I don't know what you're talking about, but I'm sure that's a good reason for you to be optimistic. |
| |
| It doesn't matter if it doesn't take away your wandering in a positive direction. |
| |
| The answer is simple: I'll leave you with a feeling of uncertainty and optimism. |
| ``` |
| |
| These outputs illustrate the model's ability to continue prompts coherently while also demonstrating typical limitations of small-scale pretrained language models, including repetition, topic drift, and inconsistent factual reasoning. |
| |
| --- |
| ## Usage |
| |
| ### Installation |
| |
| ```bash |
| pip install torch transformers |
| ``` |
| |
| ### Loading the Model |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_id = "GODELEV/Archaea-74M" |
|
|
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, |
| device_map="auto" |
| ) |
| ``` |
| |
| ### Text Generation |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_id = "GODELEV/Archaea-74M" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
| |
| prompt = "The future of artificial intelligence" |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| output = model.generate( |
| **inputs, |
| max_new_tokens=200, |
| temperature=0.8, |
| do_sample=True, |
| repetition_penalty=1.2, |
| pad_token_id=tokenizer.eos_token_id |
| ) |
| |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) |
| ``` |
|
|
| --- |
|
|
| # Repository Structure |
|
|
| ```text |
| Archaea-74M/ |
| βββ config.json |
| βββ generation_config.json |
| βββ model.safetensors |
| βββ tokenizer.json |
| βββ tokenizer_config.json |
| βββ Archaea74M_Training_Loss_Curve.png |
| βββ Archaea74M_Learning_Rate_Schedule.png |
| βββ README.md |
| ``` |
|
|
| --- |
|
|
| # Limitations |
|
|
| Archaea-74M is a base pretrained model and has not undergone: |
|
|
| - Instruction tuning |
| - RLHF |
| - Preference optimization |
| - Safety alignment |
|
|
| Known limitations: |
|
|
| - Hallucinations and factual inaccuracies |
| - Limited reasoning due to model scale |
| - Sensitivity to prompt phrasing |
| - Fixed 1024-token context window |
| - Not suitable for high-stakes applications |
|
|
| --- |
|
|
| # Future Work |
|
|
| - Instruction tuning |
| - Expanded benchmark coverage |
| - Longer context lengths |
| - Improved data quality and curriculum design |
|
|
| --- |
|
|
| # Citation |
|
|
| ```bibtex |
| @misc{archaea74m, |
| title={Archaea-74M}, |
| author={Akshit Kumar}, |
| year={2026}, |
| publisher={Hugging Face}, |
| url={https://huggingface.co/GODELEV/Archaea-74M} |
| } |
| ``` |