language:
- en
- de
- es
- fr
- pt
- it
- ru
license: other
license_name: all-rights-reserved
license_link: LICENSE
tags:
- cocoai
- base-model
- 183M
- llama
- multilingual
- wikipedia-trained
model_name: CoALa-1
model_type: llama
datasets:
- wikimedia/wikipedia
metrics:
- arc_easy
- hellaswag
model-index:
- name: CoALa-1
results:
- task:
type: text-generation
name: Knowledge & Logic Evaluation
dataset:
name: ARC-Easy
type: ai2_arc
metrics:
- name: Accuracy (Norm)
type: acc_norm
value: 28.87
- task:
type: text-generation
name: Common Sense Reasoning
dataset:
name: HellaSwag
type: hellaswag
metrics:
- name: Accuracy (Norm)
type: acc_norm
value: 26.96
CoALa-1 (183M Multilingual Llama-Base)
CoALa-1 is a highly efficient, multilingual base model with 183 million parameters. Built on a modern Llama-based architecture, it is designed to deliver maximum performance in a compact size, making it one of the top-performing models in the sub-200M parameter class.
Key Highlights
- Architecture: Llama-based (utilizing RoPE, RMSNorm, and SiLU) for superior stability and reasoning compared to older GPT-2 structures.
- Top 3 Performance: In its weight class (<200M), CoALa-1 outperforms industry standards like Meta's OPT-125M and competes directly with OpenAI's GPT-2 Small.
- Multilingual Power: Trained from scratch on high-quality Wikipedia data in 7 languages (English, German, Spanish, French, Portuguese, Italian, Russian).
- Custom Tokenizer: Features a 64,000 vocab Byte-level BPE tokenizer, optimized for multilingual efficiency.
⚠️ Important Note: Base Model vs. Instruct Model
CoALa-1 is a Base Model (Pretrained). It has been trained to predict the next token on a massive Wikipedia corpus but has not yet undergone Instruction Fine-Tuning (SFT) or RLHF.
What this means for users:
- The model will not answer questions like a chatbot (e.g., "How are you?").
- Instead, it will continue a given text in a neutral, encyclopedic style.
Evaluation Results
CoALa-1 was evaluated using the lm-evaluation-harness. It shows a strong performance in factual knowledge compared to other models in its weight class.
| Benchmark | Metric | CoALa-1 (183M) | GPT-2 (124M) | OPT-125M |
|---|---|---|---|---|
| ARC-Easy | acc_norm | 28.87% | 27.00% | 24.50% |
| HellaSwag | acc_norm | 26.96% | 28.50% | 26.00% |
Figure 1: Comparison of ARC-Easy (Knowledge) and HellaSwag (Reasoning) scores. CoALa-1 leads in factual knowledge retrieval among sub-200M parameter models.
Technical Specifications
- Hidden Size: 768
- Intermediate Size: 2048
- Layers: 12
- Attention Heads: 12
- Context Length: 2048 tokens
- Vocab Size: 64,000
Usage & Licensing
License: All Rights Reserved
This model is provided for private, non-commercial use only. Redistribution, modification (for the purpose of redistribution), and commercial usage are strictly prohibited.
How to Load
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "CocoEntertainment/CoALa-1-Pretuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
