Instructions to use jdopensource/JoyAI-LLM-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jdopensource/JoyAI-LLM-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jdopensource/JoyAI-LLM-Flash", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("jdopensource/JoyAI-LLM-Flash", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use jdopensource/JoyAI-LLM-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jdopensource/JoyAI-LLM-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jdopensource/JoyAI-LLM-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jdopensource/JoyAI-LLM-Flash
- SGLang
How to use jdopensource/JoyAI-LLM-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jdopensource/JoyAI-LLM-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jdopensource/JoyAI-LLM-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jdopensource/JoyAI-LLM-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jdopensource/JoyAI-LLM-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use jdopensource/JoyAI-LLM-Flash with Docker Model Runner:
docker model run hf.co/jdopensource/JoyAI-LLM-Flash
π JoyAI-LLM-Flash Official Collection Notice
We are excited to announce that the official JoyAI-LLM-Flash collection by jdopensource now provides a complete lineup of models designed to support diverse hardware environments β from high-performance GPUs to memory-constrained edge devices.
All models below are part of the official Hugging Face collection.
π· Core Models
jdopensource/JoyAI-LLM-Flash
Full-precision flagship model for high-quality inference and advanced reasoning.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flashjdopensource/JoyAI-LLM-Flash-Base
Streamlined base version for rapid experimentation and foundational deployment.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-Base
πΆ Optimized & Quantized Variants
jdopensource/JoyAI-LLM-Flash-FP8
FP8 quantization β strong balance between performance and efficiency.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-FP8jdopensource/JoyAI-LLM-Flash-INT4
Ultra-compact INT4 version for extremely limited VRAM environments.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-INT4jdopensource/JoyAI-LLM-Flash-Block-INT8
Block-wise INT8 quantization for improved memory efficiency.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-Block-INT8jdopensource/JoyAI-LLM-Flash-Channel-INT8
Channel-wise INT8 optimization for balanced throughput and accuracy.
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-Channel-INT8jdopensource/JoyAI-LLM-Flash-GGUF
GGUF format for broad compatibility (e.g., llama.cpp, Ollama, CPU inference).
π https://huggingface.co/jdopensource/JoyAI-LLM-Flash-GGUF
π― Deploy Anywhere
Whether you are running:
High-throughput GPU servers
Consumer-grade GPUs
Low-memory edge devices
CPU-only local inference
The official JoyAI-LLM-Flash collection provides a tailored option for your hardware stack.
Choose the precision. Match your memory budget. Maximize performance.