CLARITYWORKING.TXT
Estado: FUNCIONANDO EN HUGGING FACE SPACE CON GPU
Fecha de validacion: 2026-05-08

Resumen corto
--------------
ClarityGuardAgent quedo funcionando con llama-server precompilado, CUDA 12.6.3,
modelo ClarityGuard-v2 y projector multimodal v2. El Space arranco, detecto GPU,
cargo el modelo, cargo el mmproj multimodal, proceso texto e imagen, y respondio
HTTP 200 por /v1/chat/completions.

Ruta local del Space usado
--------------------------
/home/charlie/Documents/claritynew/ClarityGuardAgent

Repo remoto del Space
---------------------
https://huggingface.co/spaces/CharlieBonito/ClarityGuardAgent

Modelo remoto usado por app.py
------------------------------
MODEL_REPO = CharlieBonito/clarity-guard-gemma4-7b
MODEL_FILE = ClarityGuard-v2.gguf
MMPROJ_FILE = mmproj-ClarityGuard-v2.gguf

Archivos GGUF activos en Hugging Face
-------------------------------------
ClarityGuard-v2.gguf
mmproj-ClarityGuard-v2.gguf

Archivos GGUF antiguos eliminados del repo del modelo
-----------------------------------------------------
Checkpoint-375-Ollama-Clean-7.5B-Q4_K_M.gguf
mmproj-Checkpoint-375-Ollama-Clean-BF16.gguf

Dockerfile que funciono
-----------------------
Base image:
nvidia/cuda:12.6.3-runtime-ubuntu22.04

El Dockerfile NO compila llama.cpp en Hugging Face. Copia binarios precompilados:
COPY bin/llama-server /opt/llama-cpp/llama-server
COPY bin/*.so* /usr/local/lib/

Paquetes runtime principales:
python3
python3-pip
git
git-lfs
curl
libgomp1

Binario llama-server que funciono
---------------------------------
El binario fue recompilado localmente dentro de Docker con CUDA 12.6.3 devel,
no con el CUDA local de la maquina. Esto evita que el binario pida libcudart.so.13.

Build directory local:
/home/charlie/Documents/llama.cpp/build-cuda126-75-89

Arquitecturas CUDA compiladas:
CMAKE_CUDA_ARCHITECTURES=75;89

Esto cubre:
75 = NVIDIA Tesla T4
89 = NVIDIA L4 / Ada

Comando conceptual usado
------------------------
Se compilo con CUDA 12.6.3 devel y flags equivalentes a:

cmake -B build-cuda126-75-89 \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="75;89" \
  -DGGML_NATIVE=OFF \
  -DGGML_LLAMAFILE=OFF \
  -DGGML_OPENMP=ON \
  -DGGML_AVX512=OFF \
  -DGGML_AVX512_VBMI=OFF \
  -DGGML_AVX512_VNNI=OFF \
  -DGGML_AVX512_BF16=OFF \
  -DCMAKE_CXX_FLAGS="-march=x86-64-v2" \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF

Tambien se uso linker contra los CUDA stubs para poder compilar sin driver NVIDIA
dentro del contenedor de build. En runtime, Hugging Face provee libcuda.so.1 desde
el driver del host GPU.

Dependencias verificadas del binario final
------------------------------------------
El binario final pide CUDA 12, no CUDA 13:
libcudart.so.12
libcublas.so.12
libcublasLt.so.12
libnccl.so.2

Importante:
libcuda.so.1 puede aparecer como "not found" en Docker local sin NVIDIA runtime.
Eso es normal. En Hugging Face con GPU, libcuda.so.1 la provee el driver del host.

Commits importantes
-------------------
7a88bb3 Update to ClarityGuard-v2 checkpoint-750
46995ed Remove old Checkpoint-375 GGUF files
65593ac Fix: use prebuilt llama-server binary, update to ClarityGuard-v2
d27e6ea Fix: use CUDA 12.6 image for libcudart.so.13 compatibility
cc68cc4 Fix: rebuild llama-server for CUDA 12.6 L4
86be4c3 Fix: rebuild llama-server for CUDA 12.6 T4 and L4

Ultimo commit funcional conocido:
86be4c3 Fix: rebuild llama-server for CUDA 12.6 T4 and L4

Problema anterior
-----------------
El primer binario subido habia sido compilado en la maquina local contra CUDA 13.
Por eso el Space fallaba con:

libcudart.so.13: cannot open shared object file: No such file or directory

Cambiar solo el Dockerfile a CUDA 12.6 no era suficiente porque el binario seguia
enlazado a CUDA 13. La solucion real fue recompilar llama-server contra CUDA 12.6.

Senales exactas del log que confirman que funciono
--------------------------------------------------
El contenedor arranco con CUDA 12.6.3:
CUDA Version 12.6.3

El Space detecto GPU:
ggml_cuda_init: found 1 CUDA devices

GPU detectada:
Device 0: Tesla T4, compute capability 7.5, VMM: yes, VRAM: 15095 MiB

El binario contiene kernels para T4 y L4:
CUDA : ARCHS = 750,890

El modelo se cargo en GPU:
llama_model_load_from_file_impl: using device CUDA0 (Tesla T4)

Capas offload a GPU:
load_tensors: offloading output layer to GPU
load_tensors: offloading 41 repeating layers to GPU
load_tensors: offloaded 43/43 layers to GPU

Buffers principales:
CPU model buffer size = 2730.00 MiB
CUDA0 model buffer size = 2868.05 MiB
CUDA0 KV buffer size = 192.00 MiB
CUDA0 compute buffer size = 574.02 MiB

Projector multimodal cargo:
srv load_model: loaded multimodal model, '/app/models/mmproj-ClarityGuard-v2.gguf'

Vision en GPU:
clip_ctx: CLIP using CUDA0 backend
has vision encoder

Audio en GPU:
has audio encoder
clip_ctx: CLIP using CUDA0 backend

Servidor activo:
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...

Primera request texto respondio:
done request: POST /v1/chat/completions 127.0.0.1 200

Request multimodal con imagen respondio:
processing image...
image slice encoded in 264 ms
image decoded (batch 1/1) in 102 ms
image processed in 366 ms
done request: POST /v1/chat/completions 127.0.0.1 200

Rendimiento observado
---------------------
Request texto:
prompt eval: 4364 tokens en 2708.55 ms = 1611.20 tokens/s
generation: 1516 tokens en 32369.46 ms = 46.83 tokens/s
total: 5880 tokens en 35078.01 ms

Request con imagen:
prompt eval: 5699 tokens en 3776.40 ms = 1509.11 tokens/s
generation: 1456 tokens en 31612.40 ms = 46.06 tokens/s
total: 7155 tokens en 35388.80 ms

Configuracion runtime funcional
-------------------------------
CPU_THREADS=8
LLAMA_CTX=12288
LLAMA_MAX_TOKENS=8192
LLAMA_BATCH=1024
LLAMA_UBATCH=512
LLAMA_GPU_LAYERS=999
MMPROJ_OFFLOAD=True
RAG_TOP_K=4
RAG_MAX_CONTEXT_CHARS=9000

Comando efectivo de llama-server
--------------------------------
/opt/llama-cpp/llama-server \
  -m /app/models/ClarityGuard-v2.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  -c 12288 \
  -ngl 999 \
  -t 8 \
  -tb 8 \
  -np 1 \
  -b 1024 \
  -ub 512 \
  --threads-http 2 \
  --fit off \
  --no-mmap \
  --jinja \
  --mmproj /app/models/mmproj-ClarityGuard-v2.gguf

Nota sobre mensaje "CPU-only"
-----------------------------
El log de app.py dice "Lanzando llama-server CPU-only", pero ese texto esta
desactualizado. No significa que este corriendo en CPU. El comando incluye
-ngl 999 y el log de llama.cpp confirma CUDA0, offload 43/43 capas, CLIP en GPU
y CUDA ARCHS = 750,890.

Nota sobre Hugging Face token
-----------------------------
El log mostro:
Warning: You are sending unauthenticated requests to the HF Hub.

Eso no rompio el arranque. Solo puede afectar limites o velocidad de descarga.
Si se quiere evitar, configurar HF_TOKEN como secret del Space.

Conclusion
----------
La combinacion que funciono fue:
1. Docker runtime nvidia/cuda:12.6.3-runtime-ubuntu22.04.
2. llama-server precompilado con CUDA 12.6.
3. CMAKE_CUDA_ARCHITECTURES=75;89.
4. binarios subidos al Space por Git LFS.
5. modelo ClarityGuard-v2.gguf.
6. mmproj-ClarityGuard-v2.gguf.
7. LLAMA_GPU_LAYERS=999 y MMPROJ_OFFLOAD=True.

Estado final:
FUNCIONA EN GPU T4 Y DEBE FUNCIONAR TAMBIEN EN L4 PORQUE EL BINARIO INCLUYE
KERNELS CUDA PARA 75 Y 89.