Instructions to use amusktweewt/tiny-model-700M-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use amusktweewt/tiny-model-700M-chat with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="amusktweewt/tiny-model-700M-chat")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("amusktweewt/tiny-model-700M-chat")
model = AutoModelForCausalLM.from_pretrained("amusktweewt/tiny-model-700M-chat")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use amusktweewt/tiny-model-700M-chat with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "amusktweewt/tiny-model-700M-chat"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amusktweewt/tiny-model-700M-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/amusktweewt/tiny-model-700M-chat

SGLang

How to use amusktweewt/tiny-model-700M-chat with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "amusktweewt/tiny-model-700M-chat" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amusktweewt/tiny-model-700M-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "amusktweewt/tiny-model-700M-chat" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "amusktweewt/tiny-model-700M-chat",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use amusktweewt/tiny-model-700M-chat with Docker Model Runner:
```
docker model run hf.co/amusktweewt/tiny-model-700M-chat
```

amusktweewt/tiny-model-700M-chat

This is a general-purpose transformer-based language model tailored for conversational tasks, story generation, and code-related interactions. It builds upon earlier models in the "tiny" series with increased model size, improved attention efficiency, and optimized training setup.

It is more than twice as smart as the 500M model, with a significantly better user experience. It knows more facts and is the first model in this series capable of performing basic arithmetic.

Model Details

Model Description

Model type: LlamaForCausalLM
Hidden size: 816
Layers: 26
Attention heads: 12
Key/Value heads: 6
Intermediate size: 9856
Total Parameters: 706M
Tokenizer vocab size: 32,768
Max sequence length: 2048 tokens
Rotary Positional Encoding: Dynamic (factor: 2.0)
Activation: SiLU
Attention Implementation: Flash Attention 2
Other optimizations:
- Scaled dot-product attention
- Memory-efficient attention
- No bias in MLP or attention layers

Training Details

Training Configuration

Optimizer: AdamW with 8-bit precision (adamw_bnb_8bit)
Learning rate: 8e-5
Scheduler: Cosine
Warmup ratio: 15%
Weight decay: 0.01
Batch size: 6 (train), 2 (eval) per device
Gradient accumulation: 2 steps
Mixed precision: bfloat16
Epochs: 1
Training tokens: 43.6B
Seed: 42

Training Hardware

Hardware: Assumed similar to 4090-class GPU
Torch Compile: Enabled (inductor backend)

Evaluation

Perplexity: 2.177
Eval loss: 0.7776

In my own custom made benchmark for small models gets the highest grade of all my models

Intelligence Score Comparison

Model	Intelligence Score
Gemma-3-27B (for comparison)	8.3
tiny-model-700M-chat	4.42841
tiny-model-141M-chat (unreleased)	2.7
tiny-model-500M-chat-v2	2.50909
tiny-model-500M-chat-v2-5-exp	2.08295

Usage and Applications

Direct Use

This model is suitable for:

Text and dialogue generation
Educational tasks
Code completion and explanation
Story creation

Not Recommended For

High factual precision tasks
Sensitive or critical domains without human supervision

How to Get Started

import torch
from transformers import pipeline, set_seed

# Set up the text-generation pipeline
model_name = "amusktweewt/tiny-model-700M-chat"
chatbot = pipeline(
    "text-generation",
    model=model_name,
    device=0 if torch.cuda.is_available() else -1
)

# Ensure that bos_token and eos_token are explicitly set as strings
chatbot.tokenizer.bos_token = "<sos>"
chatbot.tokenizer.eos_token = "<|endoftext|>"

# Set seed for reproducibility (optional)
set_seed(42)

print("Chatbot is ready! Type 'exit' to end the conversation.")

# Initialize the conversation history
conversation_history = []

conversation_history.append({"role": "system", "content": "You are a highly intelligent and helpful AI assistant named Tiny Chat, developed by amusktweewt. Always refer to yourself like that. Your responses should be clear, concise, and accurate. Always prioritize user needs, provide well-structured answers, and maintain a friendly yet professional tone. Adapt to the user's preferences and communication style. When needed, ask clarifying questions to ensure the best response. Be honest about limitations and avoid making assumptions. Keep interactions engaging, informative, and efficient."})

while True:
    user_input = input("You: ").strip()
    if user_input.lower() == "exit":
        print("Exiting chat. Goodbye!")
        break

    # Append user message to the conversation history
    conversation_history.append({"role": "user", "content": user_input})

    # Prepare the messages with the conversation history and an empty assistant turn
    messages = conversation_history + [{"role": "assistant", "content": ""}]

    # Use the tokenizer's apply_chat_template() method to format the prompt.
    prompt = chatbot.tokenizer.apply_chat_template(messages, tokenize=False)

    # Generate text using the formatted prompt.
    response = chatbot(
        prompt,
        do_sample=True,
        max_new_tokens=512,
        top_k=50,
        temperature=0.6,
        num_return_sequences=1,
        repetition_penalty=1.1,
        pad_token_id=chatbot.tokenizer.eos_token_id,
        min_new_tokens=20
    )

    # The returned 'generated_text' includes the prompt plus the generation.
    full_text = response[0]["generated_text"]
    # Extract the assistant's response by removing the prompt portion.
    bot_response = full_text[len(prompt):].strip()
    print(f"Bot: {bot_response}")

Contact

Author: amusktweewt

For issues or feedback, please reach out via Hugging Face profile.

Downloads last month: 22

Safetensors

Model size

0.7B params

Tensor type

F32

amusktweewt
/

tiny-model-700M-chat