Commit History

refactor: enhance model unloading and memory management for improved GPU efficiency
371aac9

Patryk Studzinski commited on

refactor: enable CPU offload and adjust model loading for improved performance
9ecca89

Patryk Studzinski commited on

refactor: disable KV cache to prevent quality degradation after multiple requests
4297da2

Patryk Studzinski commited on

refactor: enable 8-bit quantization and adjust device map for improved model loading diagnostics
19175de

Patryk Studzinski commited on

refactor: disable 8-bit quantization and set device map to CPU when GPU is unavailable
31d96e8

Patryk Studzinski commited on

refactor: enable 8-bit quantization for improved memory efficiency in Transformers model loading
4a88d6f

Patryk Studzinski commited on

refactor: disable 8-bit quantization and CPU offload for optimized model loading on T4 GPUs
b95b5b2

Patryk Studzinski commited on

fix: improve error handling during model loading and fallback for quantization failures
0916214

Patryk Studzinski commited on

refactor: remove unused model configurations and streamline model creation logic
36a4581

Patryk Studzinski commited on

refactor: remove runtime installation of llama-cpp-python, now pre-installed via requirements.txt
45df19f

Patryk Studzinski commited on

feat: Add main backup and simplified service implementations with API endpoints
9222e8a

Patryk Studzinski commited on

fix: streamline CPU offload handling in model loading for better memory management
1784558

Patryk Studzinski commited on

feat: add CPU offload support for Transformers model to optimize memory usage
f639230

Patryk Studzinski commited on

feat: add Transformers model support with GPU optimization and 8-bit quantization
470149b

Patryk Studzinski commited on

feat: add model size and polish support to model info
b31e4c3

Patryk Studzinski commited on

fix: use prebuilt CUDA wheel for llama-cpp-python
3d43242

Patryk Studzinski commited on

fix: use python3.10 instead of python3.9 for ubuntu 22.04
9cab5ee

Patryk Studzinski commited on

fix: defer model downloads to first request
6415787

Patryk Studzinski commited on

refactor: defer llama-cpp-python install to runtime
1caee5e

Patryk Studzinski commited on

fix: use symlinks instead of update-alternatives for python
ba285b0

Patryk Studzinski commited on

fix: correct Dockerfile syntax for llama-cpp-python fallback
421d61e

Patryk Studzinski commited on

refactor: consolidate to single unified Dockerfile with GPU support
afbf927

Patryk Studzinski commited on

config: add GPU Dockerfile to README frontmatter
4349abd

Patryk Studzinski commited on

feat: add GPU-enabled Dockerfile.gpu for HF Spaces CUDA support
a957e36

Patryk Studzinski commited on

fix: graceful fallback for llama-cpp-python installation on HF Spaces
21b6bfe

Patryk Studzinski commited on

fix: enable CUDA compilation for llama-cpp-python
ba31957

Patryk Studzinski commited on

perf: defer llama-cpp-python build to runtime startup
4a91398

Patryk Studzinski commited on

fix: remove invalid chown command from Dockerfile
08f73ce

Patryk Studzinski commited on

feat: enable GPU acceleration for Bielik GGUF models
7c2f84b

Patryk Studzinski commited on

update Dockerfile and README.md to replace Qwen2.5-3B and Gemma-2-2B with Bielik-1.5B-GGUF; adjust model loading instructions in the API documentation
812e56d

Patryk Studzinski commited on

update HuggingFaceInferenceAPI comment for clarity; change huggingface_hub version to minimum required
f4ce3a1

Patryk Studzinski commited on

refine GBNF grammar for car advertisement; ensure compact JSON output and improve gap-item structure
068583f

Patryk Studzinski commited on

add model management methods to ModelRegistry; include model listing, loading, and unloading functionalities
c50ae32

Patryk Studzinski commited on

add HuggingFace Inference API model; implement async initialization and text generation with caching
b2cbc2b

Patryk Studzinski commited on

add GBNF grammar for car advertisement gap filling; update LlamaCppModel to support loading grammar from file
c14ac43

Patryk Studzinski commited on

add GBNF grammar utilities for structured LLM output; integrate grammar in model generation
329abd1

Patryk Studzinski commited on

enhance infill processing to handle custom messages; return cleaned output directly when provided
89e4dfe

Patryk Studzinski commited on

update llama-cpp-python installation to version 0.3.16 for improved compatibility
3aec39a

Patryk Studzinski commited on

install llama-cpp-python at runtime to avoid build issues in HuggingFace Spaces; update requirements.txt to reflect this change
c704a06

Patryk Studzinski commited on

update LlamaCppModel initialization parameters and enable verbose logging for model loading; update llama-cpp-python requirement
fb1531e

Patryk Studzinski commited on

enhance error handling in LlamaCppModel initialization; include full traceback on failure
cdff838

Patryk Studzinski commited on

add get_info method to return model details for /models endpoint
baa08b7

Patryk Studzinski commited on

add debug logging for batch infill and model generation processes; update bielik model configuration
9d2cc15

Patryk Studzinski commited on

increase context size and improve message handling in LlamaCppModel
db4996d

Patryk Studzinski commited on

update requirements with libmetadata
d9b1571

Patryk Studzinski commited on

dockerfile fix
87ebbc6

Patryk Studzinski commited on

fixing naming for bielik gguf
858725c

Patryk Studzinski commited on

improved-index-url
821afac

Patryk Studzinski commited on

fix-docker-error-for-gguf
2d2d7ff

Patryk Studzinski commited on

adding-bielik-gguf
8cde7d1

Patryk Studzinski commited on