YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Biawak-8B-Base

  • library_name: transformers
  • base_model: Qwen/Qwen3-8B
  • tags: qwen, qwen3, causal-lm, continued-pretraining, indonesian, id, prd, dtp
  • license: apache-2.0
  • language: id, en

๐Ÿ“Œ Overview

Biawak-8B-Base is an 8-billion-parameter Large Language Model (LLM) adapted specifically for Indonesia's strategic focus areas:

  • Perlindungan Ruang Digital (PRD) โ€“ Digital Space Protection
  • Digital Talent Pool (DTP) โ€“ Workforce and digital capability development

This model is built through Continued Preโ€‘training (CPT) on the Qwenโ€‘3โ€‘8B base model using a curated Indonesian dataset.


๐Ÿง  Model Details

Model Description

  • Developed by: AITF Indonesia
  • Model Type: Causal Language Model (Base)
  • Base Model: Qwen/Qwen3-8B
  • Language: Indonesian (Primary), English (Secondary)
  • License: Apache 2.0
  • Training Method: Continued Preโ€‘training (CPT)

๐ŸŽฏ Goal

To create a sovereign, domainโ€‘specialized Indonesian foundation model with strong understanding of:

  • Digital policies (UU PDP, UU ITE)
  • Digital workforce & skill landscape (DTP)

๐Ÿ“š Dataset Composition

Total Dataset Size: ~214.2 Million Tokens

Category Description Token Count (M) Percentage
DTP Digital HR, tech syllabi, certifications, job trends 94.0 ~43.9%
PRD Cybersecurity, PDP Law, content moderation, hoax prevention 92.0 ~42.9%
Wikipedia ID General knowledge anchor & grammar stability 28.2 ~13.2%
Total โ€” 214.2 100%

๐Ÿงฉ Intended Use

As a Base Model, Biawakโ€‘8B outputs text completions and can be adapted into chat/instruct variants.

1. PRD (Perlindungan Ruang Digital)

  • Policy sentiment analysis
  • Misinformation pattern detection
  • Understanding legal terminology (UU ITE, UU PDP)

2. DTP (Digital Talent Pool)

  • Skill gap analysis
  • Curriculum drafting assistance
  • Job description & talent understanding

๐Ÿš€ How to Get Started

Load the model using HuggingFace Transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
model_id = "YOUR_USERNAME/Biawak-8B-Base"  # Replace with your actual Hub ID

# 2. Load Model
# Use bfloat16 for A100/A10G, float16 for T4
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# 3. Inference Example (Completion)
input_text = "Strategi utama untuk mengurangi gap talenta digital di Indonesia adalah"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.7
    )
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

โš™๏ธ Training Details

Training Procedure

The model was continuedโ€‘pretrained with a causal language modeling (CLM) objective while preserving base reasoning capabilities.

Hardware & Environment

  • GPU: NVIDIA A100 80GB (Colab Pro+)
  • Training Duration: ~36 hours
  • Frameworks: PyTorch, Transformers, Accelerate

๐Ÿ”ง Hyperparameters (Highlights)

  • Sequence Length: 4096
  • Optimizer: AdamW
  • Scheduler: Cosine Decay
  • Precision: bf16

โš ๏ธ Limitations

  • Base Model: No SFT or RLHF; fewโ€‘shot prompting may be required.
  • Web Data Bias: May inherit biases from Indonesian web sources.
  • Hallucinations: Possible incorrect factual output.

โœ… Recommendations

For production use, it is recommended to:

  • Perform Supervised Fineโ€‘Tuning (SFT) for PRD/DTP domains
  • Add highโ€‘quality instruction datasets
  • Apply evaluation benchmarks before deployment

Downloads last month
59
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ismaprasetiyadi/Biawak-8B-Base

Adapters
1 model
Quantizations
1 model