Instructions to use plasterlabs/Qwen_0.6B-BF16_Base_Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plasterlabs/Qwen_0.6B-BF16_Base_Model with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="plasterlabs/Qwen_0.6B-BF16_Base_Model")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("plasterlabs/Qwen_0.6B-BF16_Base_Model", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use plasterlabs/Qwen_0.6B-BF16_Base_Model with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "plasterlabs/Qwen_0.6B-BF16_Base_Model"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plasterlabs/Qwen_0.6B-BF16_Base_Model",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/plasterlabs/Qwen_0.6B-BF16_Base_Model

SGLang

How to use plasterlabs/Qwen_0.6B-BF16_Base_Model with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "plasterlabs/Qwen_0.6B-BF16_Base_Model" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plasterlabs/Qwen_0.6B-BF16_Base_Model",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "plasterlabs/Qwen_0.6B-BF16_Base_Model" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plasterlabs/Qwen_0.6B-BF16_Base_Model",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use plasterlabs/Qwen_0.6B-BF16_Base_Model with Docker Model Runner:
```
docker model run hf.co/plasterlabs/Qwen_0.6B-BF16_Base_Model
```

Qwen 0.6B BF16 Base Model

Overview

Welcome to the repository for the Qwen 0.6B BF16 Base Model. This repository provides a highly optimized, bfloat16 (BF16) precision version of the 0.6 billion parameter Qwen base model, originally developed and released by the Qwen team at Alibaba AI Lab.

This specific repository has been packaged to provide a streamlined, immediate plug-and-play experience for researchers, hobbyists, and developers looking to deploy lightweight language models or conduct fine-tuning experiments without the massive computational overhead required by larger models.

By utilizing the bfloat16 data type, this model achieves a significantly reduced memory footprint while maintaining a dynamic range comparable to standard 32-bit floating-point arrays. This ensures training stability and prevents the numerical overflow issues sometimes encountered when using standard float16, making it an exceptional starting point for custom downstream task training, LoRA/QLoRA fine-tuning, and edge-device deployment.

Important Note: This is a base (non-instruction-tuned) model. It has been trained to predict the next token in a sequence based on vast amounts of internet text. It has not been aligned using Reinforcement Learning from Human Feedback (RLHF) or instruction-tuning datasets. Therefore, it is meant to continue text rather than answer questions in a conversational chatbot format.

Detailed Model Specifications

Understanding the hardware requirements and architectural nuances of the model is critical for effective deployment and fine-tuning.

Architecture Type: Qwen Causal Language Model (Transformer-based decoder-only architecture)
Parameter Count: Approximately 0.6 Billion (~600 million) parameters.
Precision Format: bfloat16 (BF16).
Context Length: Capable of processing up to 32,768 tokens, allowing for extensive document processing and contextual understanding.
Vocabulary Size: Over 151,851 tokens, offering highly efficient tokenization across multiple languages and coding syntaxes.
Storage Format: safetensors (Fast, zero-copy loading that is significantly more secure than standard PyTorch .bin pickle files).
Framework Compatibility: PyTorch, deeply integrated with the Hugging Face transformers ecosystem.
Memory Footprint: Requires roughly 1.2 GB to 1.5 GB of VRAM to load the base weights into GPU memory, making it easily accessible for consumer GPUs (e.g., NVIDIA RTX 3060, 4060) and even CPU/Mac environments.

Intended Use Cases

Due to its compact size and base nature, this model is highly versatile. Intended use cases include, but are not limited to:

Research and Prototyping: A fast, low-cost sandbox for testing new training methodologies, tokenization strategies, or alignment techniques before scaling up to 7B or 70B parameter models.
Parameter-Efficient Fine-Tuning (PEFT): An ideal base model for LoRA (Low-Rank Adaptation) and QLoRA experiments. You can easily train this model on custom, niche datasets (e.g., medical texts, legal documents, specialized coding languages) on a single consumer GPU.
Lightweight Edge Inference: Suitable for deployment on resource-constrained devices, such as mobile processors, Raspberry Pi, or local desktops without dedicated high-end accelerators.
Autocompletion Tasks: Can be integrated into IDEs or text editors to provide offline, private text and code autocompletion.

Installation and Requirements

To utilize this model effectively, ensure your software environment meets the following minimum requirements.

Python: Version 3.9 or higher.
PyTorch: Version 2.0 or higher (built with CUDA support if you intend to run inference on an NVIDIA GPU).
Transformers: Version 4.37.0 or higher.

Set up your environment and install the required dependencies using the following bash command:

# Core dependencies
pip install torch transformers accelerate

# Optional dependencies for fine-tuning and quantization
pip install peft bitsandbytes datasets

Inference Guide: How to Use the Model
Because this is a base model, it relies on prompting strategies that frame your desired output as a natural continuation of the input text. Below are examples ranging from basic loading to advanced generation generation.

1. Standard Text Continuation
The following Python script demonstrates how to load the model in its native BF16 precision and generate text.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Define the model repository
model_id = "Hemansh2633B/Qwen_0.6B-BF16_Base_Model"

# Initialize the tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Load the model directly onto the optimal device (GPU if available)
print("Loading model in BF16 precision...")
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Maintains the native precision
    device_map="auto",          # Automatically places model on GPU/CPU
    trust_remote_code=True
)

# Design a prompt suitable for a base model
prompt = "The history of artificial intelligence dates back to the mid-20th century when"

# Tokenize the input prompt
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate the sequence
print("Generating text...")
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=150,       # Number of tokens to generate
        do_sample=True,           # Enable probabilistic sampling
        temperature=0.7,          # Controls randomness (higher = more creative)
        top_p=0.9,                # Nucleus sampling threshold
        repetition_penalty=1.1    # Penalizes the model for looping text
    )

# Decode and print the result
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\n--- Generated Output ---\n")
print(generated_text)

Downloads last month: -; Downloads are not tracked for this model. How to track