Spaces:

CharlieBonito
/

ClarityGuardAgent

Sleeping

App Files Files Community

ClarityGuardAgent / CLARITYWORKING.TXT

CharlieBonito

Docs: record working ClarityGuard GPU configuration

88de52a about 1 month ago

raw

history blame contribute delete

7.31 kB

	CLARITYWORKING.TXT
	Estado: FUNCIONANDO EN HUGGING FACE SPACE CON GPU
	Fecha de validacion: 2026-05-08

	Resumen corto
	--------------
	ClarityGuardAgent quedo funcionando con llama-server precompilado, CUDA 12.6.3,
	modelo ClarityGuard-v2 y projector multimodal v2. El Space arranco, detecto GPU,
	cargo el modelo, cargo el mmproj multimodal, proceso texto e imagen, y respondio
	HTTP 200 por /v1/chat/completions.

	Ruta local del Space usado
	--------------------------
	/home/charlie/Documents/claritynew/ClarityGuardAgent

	Repo remoto del Space
	---------------------
	https://huggingface.co/spaces/CharlieBonito/ClarityGuardAgent

	Modelo remoto usado por app.py
	------------------------------
	MODEL_REPO = CharlieBonito/clarity-guard-gemma4-7b
	MODEL_FILE = ClarityGuard-v2.gguf
	MMPROJ_FILE = mmproj-ClarityGuard-v2.gguf

	Archivos GGUF activos en Hugging Face
	-------------------------------------
	ClarityGuard-v2.gguf
	mmproj-ClarityGuard-v2.gguf

	Archivos GGUF antiguos eliminados del repo del modelo
	-----------------------------------------------------
	Checkpoint-375-Ollama-Clean-7.5B-Q4_K_M.gguf
	mmproj-Checkpoint-375-Ollama-Clean-BF16.gguf

	Dockerfile que funciono
	-----------------------
	Base image:
	nvidia/cuda:12.6.3-runtime-ubuntu22.04

	El Dockerfile NO compila llama.cpp en Hugging Face. Copia binarios precompilados:
	COPY bin/llama-server /opt/llama-cpp/llama-server
	COPY bin/.so /usr/local/lib/

	Paquetes runtime principales:
	python3
	python3-pip
	git
	git-lfs
	curl
	libgomp1

	Binario llama-server que funciono
	---------------------------------
	El binario fue recompilado localmente dentro de Docker con CUDA 12.6.3 devel,
	no con el CUDA local de la maquina. Esto evita que el binario pida libcudart.so.13.

	Build directory local:
	/home/charlie/Documents/llama.cpp/build-cuda126-75-89

	Arquitecturas CUDA compiladas:
	CMAKE_CUDA_ARCHITECTURES=75;89

	Esto cubre:
	75 = NVIDIA Tesla T4
	89 = NVIDIA L4 / Ada

	Comando conceptual usado
	------------------------
	Se compilo con CUDA 12.6.3 devel y flags equivalentes a:

	cmake -B build-cuda126-75-89 \
	-DGGML_CUDA=ON \
	-DCMAKE_CUDA_ARCHITECTURES="75;89" \
	-DGGML_NATIVE=OFF \
	-DGGML_LLAMAFILE=OFF \
	-DGGML_OPENMP=ON \
	-DGGML_AVX512=OFF \
	-DGGML_AVX512_VBMI=OFF \
	-DGGML_AVX512_VNNI=OFF \
	-DGGML_AVX512_BF16=OFF \
	-DCMAKE_CXX_FLAGS="-march=x86-64-v2" \
	-DCMAKE_BUILD_TYPE=Release \
	-DLLAMA_BUILD_TESTS=OFF \
	-DLLAMA_BUILD_EXAMPLES=OFF

	Tambien se uso linker contra los CUDA stubs para poder compilar sin driver NVIDIA
	dentro del contenedor de build. En runtime, Hugging Face provee libcuda.so.1 desde
	el driver del host GPU.

	Dependencias verificadas del binario final
	------------------------------------------
	El binario final pide CUDA 12, no CUDA 13:
	libcudart.so.12
	libcublas.so.12
	libcublasLt.so.12
	libnccl.so.2

	Importante:
	libcuda.so.1 puede aparecer como "not found" en Docker local sin NVIDIA runtime.
	Eso es normal. En Hugging Face con GPU, libcuda.so.1 la provee el driver del host.

	Commits importantes
	-------------------
	7a88bb3 Update to ClarityGuard-v2 checkpoint-750
	46995ed Remove old Checkpoint-375 GGUF files
	65593ac Fix: use prebuilt llama-server binary, update to ClarityGuard-v2
	d27e6ea Fix: use CUDA 12.6 image for libcudart.so.13 compatibility
	cc68cc4 Fix: rebuild llama-server for CUDA 12.6 L4
	86be4c3 Fix: rebuild llama-server for CUDA 12.6 T4 and L4

	Ultimo commit funcional conocido:
	86be4c3 Fix: rebuild llama-server for CUDA 12.6 T4 and L4

	Problema anterior
	-----------------
	El primer binario subido habia sido compilado en la maquina local contra CUDA 13.
	Por eso el Space fallaba con:

	libcudart.so.13: cannot open shared object file: No such file or directory

	Cambiar solo el Dockerfile a CUDA 12.6 no era suficiente porque el binario seguia
	enlazado a CUDA 13. La solucion real fue recompilar llama-server contra CUDA 12.6.

	Senales exactas del log que confirman que funciono
	--------------------------------------------------
	El contenedor arranco con CUDA 12.6.3:
	CUDA Version 12.6.3

	El Space detecto GPU:
	ggml_cuda_init: found 1 CUDA devices

	GPU detectada:
	Device 0: Tesla T4, compute capability 7.5, VMM: yes, VRAM: 15095 MiB

	El binario contiene kernels para T4 y L4:
	CUDA : ARCHS = 750,890

	El modelo se cargo en GPU:
	llama_model_load_from_file_impl: using device CUDA0 (Tesla T4)

	Capas offload a GPU:
	load_tensors: offloading output layer to GPU
	load_tensors: offloading 41 repeating layers to GPU
	load_tensors: offloaded 43/43 layers to GPU

	Buffers principales:
	CPU model buffer size = 2730.00 MiB
	CUDA0 model buffer size = 2868.05 MiB
	CUDA0 KV buffer size = 192.00 MiB
	CUDA0 compute buffer size = 574.02 MiB

	Projector multimodal cargo:
	srv load_model: loaded multimodal model, '/app/models/mmproj-ClarityGuard-v2.gguf'

	Vision en GPU:
	clip_ctx: CLIP using CUDA0 backend
	has vision encoder

	Audio en GPU:
	has audio encoder
	clip_ctx: CLIP using CUDA0 backend

	Servidor activo:
	main: server is listening on http://127.0.0.1:8080
	main: starting the main loop...

	Primera request texto respondio:
	done request: POST /v1/chat/completions 127.0.0.1 200

	Request multimodal con imagen respondio:
	processing image...
	image slice encoded in 264 ms
	image decoded (batch 1/1) in 102 ms
	image processed in 366 ms
	done request: POST /v1/chat/completions 127.0.0.1 200

	Rendimiento observado
	---------------------
	Request texto:
	prompt eval: 4364 tokens en 2708.55 ms = 1611.20 tokens/s
	generation: 1516 tokens en 32369.46 ms = 46.83 tokens/s
	total: 5880 tokens en 35078.01 ms

	Request con imagen:
	prompt eval: 5699 tokens en 3776.40 ms = 1509.11 tokens/s
	generation: 1456 tokens en 31612.40 ms = 46.06 tokens/s
	total: 7155 tokens en 35388.80 ms

	Configuracion runtime funcional
	-------------------------------
	CPU_THREADS=8
	LLAMA_CTX=12288
	LLAMA_MAX_TOKENS=8192
	LLAMA_BATCH=1024
	LLAMA_UBATCH=512
	LLAMA_GPU_LAYERS=999
	MMPROJ_OFFLOAD=True
	RAG_TOP_K=4
	RAG_MAX_CONTEXT_CHARS=9000

	Comando efectivo de llama-server
	--------------------------------
	/opt/llama-cpp/llama-server \
	-m /app/models/ClarityGuard-v2.gguf \
	--host 127.0.0.1 \
	--port 8080 \
	-c 12288 \
	-ngl 999 \
	-t 8 \
	-tb 8 \
	-np 1 \
	-b 1024 \
	-ub 512 \
	--threads-http 2 \
	--fit off \
	--no-mmap \
	--jinja \
	--mmproj /app/models/mmproj-ClarityGuard-v2.gguf

	Nota sobre mensaje "CPU-only"
	-----------------------------
	El log de app.py dice "Lanzando llama-server CPU-only", pero ese texto esta
	desactualizado. No significa que este corriendo en CPU. El comando incluye
	-ngl 999 y el log de llama.cpp confirma CUDA0, offload 43/43 capas, CLIP en GPU
	y CUDA ARCHS = 750,890.

	Nota sobre Hugging Face token
	-----------------------------
	El log mostro:
	Warning: You are sending unauthenticated requests to the HF Hub.

	Eso no rompio el arranque. Solo puede afectar limites o velocidad de descarga.
	Si se quiere evitar, configurar HF_TOKEN como secret del Space.

	Conclusion
	----------
	La combinacion que funciono fue:
	1. Docker runtime nvidia/cuda:12.6.3-runtime-ubuntu22.04.
	2. llama-server precompilado con CUDA 12.6.
	3. CMAKE_CUDA_ARCHITECTURES=75;89.
	4. binarios subidos al Space por Git LFS.
	5. modelo ClarityGuard-v2.gguf.
	6. mmproj-ClarityGuard-v2.gguf.
	7. LLAMA_GPU_LAYERS=999 y MMPROJ_OFFLOAD=True.

	Estado final:
	FUNCIONA EN GPU T4 Y DEBE FUNCIONAR TAMBIEN EN L4 PORQUE EL BINARIO INCLUYE
	KERNELS CUDA PARA 75 Y 89.