Text Generation
Transformers
Diffusers
Safetensors
English
gpt_oss
phillnet-2
gpt-oss
multimodal
image-generation
video-generation
speech
audio
custom-code
conversational
custom_code
Instructions to use ayjays132/Phillnet-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ayjays132/Phillnet-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ayjays132/Phillnet-2", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ayjays132/Phillnet-2", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("ayjays132/Phillnet-2", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ayjays132/Phillnet-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ayjays132/Phillnet-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayjays132/Phillnet-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ayjays132/Phillnet-2
- SGLang
How to use ayjays132/Phillnet-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ayjays132/Phillnet-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayjays132/Phillnet-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ayjays132/Phillnet-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ayjays132/Phillnet-2", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ayjays132/Phillnet-2 with Docker Model Runner:
docker model run hf.co/ayjays132/Phillnet-2
| # Memory Optimization Module | |
| Unified memory management system with shared Qwen model integration for zero memory overhead. | |
| ## Module Structure | |
| ``` | |
| memory_optimization/ | |
| βββ __init__.py # Module exports and convenience functions | |
| βββ config.py # MemoryOptimizationConfig | |
| βββ manager.py # UnifiedMemoryManager | |
| βββ tensor_pool.py # TensorPool | |
| βββ model_cache.py # ModelCache (uses shared Qwen model) | |
| βββ cleanup.py # MemoryCleanup | |
| βββ README.md # This file | |
| ``` | |
| ## Features | |
| ### β Shared Model Integration | |
| - **ModelCache**: Uses shared Qwen model for zero memory overhead | |
| - Automatic fallback to cached models if shared model unavailable | |
| - Prevents model duplication across modules | |
| ### β CUDA Optimization | |
| - All operations run on CUDA when available | |
| - Efficient tensor pooling | |
| - Adaptive memory cleanup | |
| ### β Self-Contained Modules | |
| - Each component is independent | |
| - Easy to test and benchmark | |
| - Clean separation of concerns | |
| ## Usage | |
| ### Basic Usage | |
| ```python | |
| from memory_optimization import ( | |
| UnifiedMemoryManager, | |
| MemoryOptimizationConfig, | |
| get_unified_memory_manager | |
| ) | |
| # Initialize with shared model | |
| config = MemoryOptimizationConfig( | |
| use_shared_model=True, | |
| device="cuda" | |
| ) | |
| manager = UnifiedMemoryManager(config) | |
| # Get shared model (uses shared Qwen if available) | |
| model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer") | |
| # Get optimized tensor | |
| tensor = manager.get_tensor((10, 1024), dtype=torch.float32) | |
| # Return tensor to pool | |
| manager.return_tensor(tensor) | |
| ``` | |
| ### Convenience Functions | |
| ```python | |
| from memory_optimization import ( | |
| get_shared_model, | |
| get_tensor, | |
| return_tensor, | |
| clear_memory, | |
| get_memory_stats | |
| ) | |
| # Get shared model | |
| model = get_shared_model("Qwen/Qwen3-0.6B", "transformer") | |
| # Get tensor | |
| tensor = get_tensor((10, 1024)) | |
| # Return tensor | |
| return_tensor(tensor) | |
| # Get stats | |
| stats = get_memory_stats() | |
| # Clear memory | |
| clear_memory() | |
| ``` | |
| ## Integration with Shared Model | |
| The module automatically detects and uses the shared Qwen model: | |
| 1. **ModelCache**: Uses shared Qwen model for transformers (zero memory overhead) | |
| 2. **ModelCache**: Uses shared Qwen tokenizer (zero memory overhead) | |
| 3. **Automatic Fallback**: Falls back to cached models if shared model unavailable | |
| ## CUDA Compatibility | |
| All components are CUDA-compatible: | |
| - Automatic device detection | |
| - Efficient GPU memory management | |
| - Adaptive cleanup based on memory pressure | |
| ## Configuration | |
| See `config.py` for all configuration options. Key settings: | |
| - `use_shared_model`: Enable shared Qwen model (default: True) | |
| - `device`: Device to use ("cuda" or "cpu") | |
| - `memory_threshold`: Memory usage threshold for cleanup | |
| - `max_pool_size`: Maximum tensor pool size | |
| - `use_4bit_quantization`: Enable 4-bit quantization | |
| ## Memory Management | |
| The system provides: | |
| - **Tensor Pooling**: Reuse tensors to reduce allocations | |
| - **Model Caching**: Share model instances across modules | |
| - **Adaptive Cleanup**: Automatic memory cleanup based on pressure | |
| - **Emergency Cleanup**: Force cleanup when memory is critical | |
| ## Dependencies | |
| - PyTorch (CUDA support recommended) | |
| - Transformers | |
| - BitsAndBytes (for 4-bit quantization, optional) | |