Commit History

refactor: enhance model unloading and memory management for improved GPU efficiency
371aac9

Patryk Studzinski commited on

refactor: enable CPU offload and adjust model loading for improved performance
9ecca89

Patryk Studzinski commited on

refactor: disable KV cache to prevent quality degradation after multiple requests
4297da2

Patryk Studzinski commited on

refactor: enable 8-bit quantization and adjust device map for improved model loading diagnostics
19175de

Patryk Studzinski commited on

refactor: disable 8-bit quantization and set device map to CPU when GPU is unavailable
31d96e8

Patryk Studzinski commited on

refactor: enable 8-bit quantization for improved memory efficiency in Transformers model loading
4a88d6f

Patryk Studzinski commited on

refactor: disable 8-bit quantization and CPU offload for optimized model loading on T4 GPUs
b95b5b2

Patryk Studzinski commited on

fix: improve error handling during model loading and fallback for quantization failures
0916214

Patryk Studzinski commited on

refactor: remove unused model configurations and streamline model creation logic
36a4581

Patryk Studzinski commited on

fix: streamline CPU offload handling in model loading for better memory management
1784558

Patryk Studzinski commited on

feat: add CPU offload support for Transformers model to optimize memory usage
f639230

Patryk Studzinski commited on

feat: add Transformers model support with GPU optimization and 8-bit quantization
470149b

Patryk Studzinski commited on

feat: add model size and polish support to model info
b31e4c3

Patryk Studzinski commited on

feat: enable GPU acceleration for Bielik GGUF models
7c2f84b

Patryk Studzinski commited on

update Dockerfile and README.md to replace Qwen2.5-3B and Gemma-2-2B with Bielik-1.5B-GGUF; adjust model loading instructions in the API documentation
812e56d

Patryk Studzinski commited on

update HuggingFaceInferenceAPI comment for clarity; change huggingface_hub version to minimum required
f4ce3a1

Patryk Studzinski commited on

add model management methods to ModelRegistry; include model listing, loading, and unloading functionalities
c50ae32

Patryk Studzinski commited on

add HuggingFace Inference API model; implement async initialization and text generation with caching
b2cbc2b

Patryk Studzinski commited on

add GBNF grammar for car advertisement gap filling; update LlamaCppModel to support loading grammar from file
c14ac43

Patryk Studzinski commited on

add GBNF grammar utilities for structured LLM output; integrate grammar in model generation
329abd1

Patryk Studzinski commited on

update LlamaCppModel initialization parameters and enable verbose logging for model loading; update llama-cpp-python requirement
fb1531e

Patryk Studzinski commited on

enhance error handling in LlamaCppModel initialization; include full traceback on failure
cdff838

Patryk Studzinski commited on

add get_info method to return model details for /models endpoint
baa08b7

Patryk Studzinski commited on

add debug logging for batch infill and model generation processes; update bielik model configuration
9d2cc15

Patryk Studzinski commited on

increase context size and improve message handling in LlamaCppModel
db4996d

Patryk Studzinski commited on

adding-bielik-gguf
8cde7d1

Patryk Studzinski commited on

fix: Make 8-bit quantization opt-in and gracefully handle missing bitsandbytes
f740ffc

Patryk Studzinski commited on

fix: Add bitsandbytes to requirements and graceful fallback for 8-bit quantization
e0c72ee

Patryk Studzinski commited on

Add 8-bit quantization support for CPU environments (4-6x speedup)
cf7d274

Patryk Studzinski commited on

Fix: Remove unsupported use_xformers_attention parameter
9153886

Patryk Studzinski commited on

Fix: Use direct model.generate() with proper KV caching instead of pipeline
eaa2e37

Patryk Studzinski commited on

Add KV caching and batch processing optimizations for 5-10x speedup
ab2e415

Patryk Studzinski commited on

Fix gemma chat template fallback
42e3538

Patryk Studzinski commited on

pre-downloading-all-models-at-startup
cf748a3

Patryk Studzinski commited on

model-lazy-loading
b50a781

Patryk Studzinski commited on

first-imrpvement-commit
a7fd202

Patryk Studzinski commited on

adding-github-files-to-spaces
9a9ec03

Patryk Studzinski commited on