Instructions to use ThingAI/Quark-135m-Bilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ThingAI/Quark-135m-Bilingual with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ThingAI/Quark-135m-Bilingual", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ThingAI/Quark-135m-Bilingual", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ThingAI/Quark-135m-Bilingual with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ThingAI/Quark-135m-Bilingual"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ThingAI/Quark-135m-Bilingual

SGLang

How to use ThingAI/Quark-135m-Bilingual with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ThingAI/Quark-135m-Bilingual" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ThingAI/Quark-135m-Bilingual" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ThingAI/Quark-135m-Bilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ThingAI/Quark-135m-Bilingual with Docker Model Runner:
```
docker model run hf.co/ThingAI/Quark-135m-Bilingual
```

you may like to study this:

by usermma - opened Jun 5

Discussion

usermma

Jun 5

new architecture :
https://huggingface.co/datasets/usermma/HMECA-Whitepaper

ThingsAI

ModotAI org Jun 5

Thanks for sharing! Interesting concept but currently not implementable as described - runtime neural module generation is still too slow for real-time use.
Many HMECA ideas already exist in practice: MoE (routing + experts), multi-agent systems (domain modules), and chain-of-thought (task decomposition).
If you have working code, happy to benchmark against Quark!

usermma

Jun 5

Thanks for having interest for reading it, unfortunately i don't currently have any of ability of having any cloud or local computers to have it trained in real life, or maybe its just my own hallucinations...

but i may in nowadays start making the datasets that i would train it on, or the someone who have that interest, maybe some researcher or someone curious...

its the idea wheres every embedding works like a net only trained for just one specific tasks, a small or big net..., for example writing codes without any understanding about why some fishs tastes good when eating and some fishs are not, beacuse if the same net have understanded more than a thing, it may more causes hallucinations..

or maybe there is something else you didn't like, maybe i need to change the license into raw apache 2.0...

what are your suggestions?

so i get more people interested in it? or what?

i just doesn't want HMECA to be forgotten without any implementation on real life of it.... !

ThingsAI

ModotAI org Jun 7

•

edited Jun 7

Let’s address the elephant in the room and look at this from a pure hardware and engineering perspective. If you want to prevent HMECA from being forgotten and actually get researchers interested, you need to understand why the current paper raises immediate skepticism for anyone running production clusters.
The main issue with HMECA isn't the data; it's the computational overhead of your routing and runtime mechanics.
Here are the three structural flaws that conflict with modern deep learning engineering:
1 The VRAM and Hardware Bottleneck: In modern AI, passing tensors through 6 different hierarchical layers of independent micro-networks (from Nodes up to Big Embeddeds) destroys throughput. The latency caused by routing overhead, memory allocation, and kernel launches across the GPU's VRAM would make execution orders of magnitude slower than a standard dense Transformer or a optimized Mixture of Experts (MoE).
2 The Runtime Generation Fallacy: You mention runtime adaptability and dynamic generation for all embeddeds. In real-world hardware, you cannot compile, instantiate, or train neural weights on the fly while serving a live user request. Backpropagation and dynamic weight allocation require gradient descent and optimizer states that would completely freeze a real-time inference pipeline.
3 The 'Isolation' Misconception: Isolating networks so strictly ("writing code without knowing why fish tastes good") removes the core strength of deep learning: cross-domain generalization and emergent properties. Models don't hallucinate because they know too many different topics; they hallucinate due to low-quality pre-training data, poor tokenization, or flaws in probability sampling. Cross-domain weights actually help the model build better latent representations.
My suggestions to get people interested and make HMECA real:
Change the License to Apache 2.0: Do this immediately. Nobody in the open-source or research community will touch or fork a repository with restrictive or custom licenses. Apache 2.0 lowers the friction.
Write a Minimal Python Simulation (No GPUs needed): Don't focus on training a huge model right now. Write a simple, CPU-friendly Python script using PyTorch or pure NumPy that acts as a mock simulation of your Meta Controller and execution graph. Show us how a request travels through these layers mathematically.
Create a Small Proof-of-Concept Dataset: Instead of a massive dataset, create a tiny, clean sample (e.g., 5k–10k rows) that explicitly demonstrates how your custom routing tags or formatting would look in practice.
If you can provide a working script that demonstrates the mathematical and logistical feasibility of this routing without melting a server's PCIe bandwidth, people will stop calling it a concept and start looking into it. Turn the theory into executable code!

usermma

Jun 7

i have changed the license into pure-raw apache-2.0
Enjoy.

in the next few days i may finish just the small of it

usermma

Jun 7

i hope no one patent any part of it....

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment