Sophon-OSS-1B-v1 ๐Ÿง 

โš ๏ธ Status: Model card prepared. Model weights coming soon (Q2 2026).

This repository is set up in advance. The actual model is currently in training. Follow @monodox for updates!

Status Parameters License Languages

Sophon-OSS-1B-v1 is Monodox's first open-source language model, a 1-billion parameter model with primary focus on Malayalam language. Built for efficiency, accessibility, and research.

๐Ÿ‡ฎ๐Ÿ‡ณ Built in India, for the world.


๐Ÿ“Š Model Details

Attribute Value
Organization Monodox Technologies Pvt Ltd
Model Type Causal Language Model (Transformer)
Parameters 1 Billion (1B)
Context Length 2048 tokens
License Apache 2.0
Release Date February 2026
Primary Languages Malayalam + English

๏ฟฝ Why Malayalam First?

Malayalam is our home language and represents an underserved market in AI:

  • 50M+ speakers worldwide
  • Rich literary tradition spanning 1000+ years
  • High digital literacy in Kerala
  • Limited AI models available compared to Hindi/English
  • Complex script with unique linguistic features

Starting with Malayalam allows us to:

  • Build deep expertise in low-resource languages
  • Serve our local community first
  • Create techniques applicable to other Dravidian languages
  • Prove our approach before scaling

Future languages: Hindi, Tamil, Telugu, Kannada, Bengali (2026-2027)


๏ฟฝ๐ŸŽฏ Key Features

โœจ Multilingual by Design

  • Native support for 2 languages
  • Strong performance on Malayalam
  • Code-switching capabilities

โšก Efficient & Accessible

  • Runs on consumer hardware (4GB+ VRAM)
  • Mobile-friendly inference
  • Low latency generation

๐Ÿ”ฌ Research-First

  • Open weights and architecture
  • Reproducible training
  • Designed for fine-tuning

๐ŸŒ Community-Driven

  • Apache 2.0 license (commercial use allowed)
  • Active community support
  • Regular updates

๐Ÿš€ Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model
model_name = "monodox/sophon-oss-1b-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
prompt = "The future of AI in India is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Multilingual Example

# English
prompt_en = "Artificial Intelligence is"

# Malayalam
prompt_ml = "เด†เตผเดŸเตเดŸเดฟเดซเดฟเดทเตเดฏเตฝ เด‡เดจเตเดฑเดฒเดฟเดœเตปเดธเต"

# Generate in any language
for prompt in [prompt_en, prompt_ml]:
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_length=50)
    print(tokenizer.decode(outputs[0]))

๐Ÿ“ Model Architecture

  • Base Architecture: GPT-style Transformer decoder
  • Layers: 24
  • Hidden Size: 2048
  • Attention Heads: 16
  • Vocabulary Size: 50,000 (multilingual)
  • Positional Encoding: Learned
  • Activation: GELU

๐Ÿ“š Training Details

Training Data

Sophon-OSS-1B-v1 was trained on a diverse multilingual corpus:

  • Web Crawl: Common Crawl, OSCAR
  • Indian Languages: IndicCorp, AI4Bharat datasets
  • Code: GitHub repositories
  • Books & Articles: Multilingual text
  • Wikipedia: All supported languages

Total Tokens: ~200 Billion tokens
Data Curation: Extensive filtering for quality, safety, and diversity

Training Configuration

  • Framework: PyTorch 2.0 + Hugging Face Transformers
  • Hardware: NVIDIA A100 GPUs
  • Training Time: 10 days
  • Batch Size: 512
  • Learning Rate: 3e-4
  • Optimizer: AdamW
  • Warmup Steps: 2000
  • Precision: Mixed precision (FP16)

๐Ÿ“Š Performance

Benchmarks

Coming soon - benchmarks on MMLU, HellaSwag, ARC, and Indian language tasks

Supported Languages

Language ISO Code Performance
Malayalam ml โญโญโญโญ
English en โญโญโญโญ

Size Comparison

Model Parameters Languages Malayalam Support
GPT-2 1.5B English โŒ None
Llama-2 7B Multilingual โš ๏ธ Limited
IndicBERT 110M 12 Indian โœ… Basic
Sophon-OSS-1B 1B ML+EN โœ… Native

Sophon prioritizes quality over quantity, focusing deeply on Malayalam.


๐Ÿ’ป Hardware Requirements

Inference

Configuration Min VRAM Recommended
FP16 4GB 8GB
8-bit Quantized 2GB 4GB
4-bit Quantized 1GB 2GB

Compatible Hardware

โœ… RTX 3060 (12GB)
โœ… RTX 4070 (12GB)
โœ… Apple M1/M2 (8GB+ unified memory)
โœ… Google Colab (Free tier with limitations)
โœ… High-end mobile devices (quantized)


๐ŸŽ“ Use Cases

โœ… Recommended

  • Text generation and completion
  • Conversational AI (chatbots)
  • Multilingual applications
  • Educational tools
  • Content creation assistance
  • Code completion (basic)
  • Research and experimentation

โš ๏ธ Limitations

  • Not suitable for complex reasoning tasks
  • Limited context window (2048 tokens)
  • May generate biased or incorrect content
  • Requires fact-checking for critical applications
  • Not recommended for medical/legal advice

โš–๏ธ Bias, Risks & Limitations

Known Limitations

  • Size: 1B parameters limit reasoning capability
  • Context: 2048 token limit restricts long documents
  • Knowledge Cutoff: Training data up to [date]
  • Hallucinations: May generate plausible but incorrect information

Bias Considerations

  • Trained on internet data (may contain biases)
  • Gender, cultural, and regional biases possible
  • Content filtering applied during training
  • Users should verify outputs for fairness

Safety Measures

  • Content filtering during training
  • Red-teaming for harmful outputs
  • Clear documentation of limitations
  • Community reporting mechanism

๐Ÿ”ง Fine-Tuning

Sophon-OSS-1B-v1 is designed for fine-tuning:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./sophon-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-5,
    # ... your config
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Recommended for:

  • Domain-specific applications
  • Custom language pairs
  • Specialized tasks
  • Personal assistants

๐Ÿ“– Citation

If you use Sophon-OSS-1B-v1 in your research, please cite:

@misc{sophon-oss-1b-v1-2026,
  title={Sophon-OSS-1B-v1: A Multilingual Language Model for Indian Languages},
  author={Monodox Technologies Pvt Ltd},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/monodox/sophon-oss-1b-v1}}
}

๐Ÿค Community & Support

Get Help

Contributing

We welcome contributions!

  • Fine-tune for new languages
  • Report bugs and issues
  • Share use cases
  • Improve documentation

๏ฟฝ Training Progress

We'll update this section as training progresses!

Current Status: Data collection and preprocessing
Last Updated: February 13, 2026

Timeline

  • โœ… Model architecture finalized
  • โœ… Training data collected
  • ๐Ÿ”„ Data preprocessing (in progress)
  • โณ Model training (starting Q1 2026)
  • โณ Evaluation and benchmarking
  • โณ Public release (Q2 2026)

Follow our blog for detailed progress updates.


๏ฟฝ๐Ÿ™ Acknowledgments

Built with:

Special thanks to Kerala Startup Mission and supporters who made this possible.


๐Ÿ“œ License

This model is released under the Apache 2.0 License.

You are free to:

  • โœ… Use commercially
  • โœ… Modify and distribute
  • โœ… Use privately
  • โœ… Use for research

See LICENSE for full details.


๐Ÿ”ฎ What's Next?

Upcoming Models:

  • ๐Ÿ”ฅ Sophon-Lite-7B (Q3 2026)
  • ๐ŸŒ Sophon-Indic-7B (Q4 2026)
  • ๐Ÿ’ป Sophon-Code-3B (2027)

๐Ÿ—บ๏ธ Sophon Roadmap

2026 Q2  โ†’  Sophon-OSS-1B-v1 (Malayalam + English)
2026 Q3  โ†’  Add Hindi, Tamil
2026 Q4  โ†’  Sophon-Indic-7B (10+ languages)
2027 Q1  โ†’  Sophon-Code-3B (Coding specialist)
2027 Q2+  โ†’  Sophon-Lite-7B (Production-ready)

Follow us:


Built with โค๏ธ in India ๐Ÿ‡ฎ๐Ÿ‡ณ

Research. Build. Innovate.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Datasets used to train monodox/sophon-oss-1b-v1