sandeshrajx
/

llama_cpp_colab_builds

Model card Files Files and versions

sandeshrajx commited on 23 days ago

Commit

8e5c9f3

·

verified ·

1 Parent(s): 00938ca

Upload builds/latest-ik-llama-cuda.txt with huggingface_hub

Files changed (1) hide show

builds/latest-ik-llama-cuda.txt +2 -0

builds/latest-ik-llama-cuda.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ builds/ik-llama-cpp-cuda-default-colab-20260507-042329-b9372190.zip
2	+ {"binaries": ["llama-cli", "llama-server", "llama-quantize", "llama-imatrix", "llama-bench"], "build_command": ["cmake", "--build", "build", "--config", "Release", "-j", "17"], "builder_gpu_type": "T4", "built_at_utc": "2026-05-07T04:23:29.899564Z", "cmake_version": "cmake version 4.3.2\n\nCMake suite maintained and supported by Kitware (kitware.com/cmake).", "configure_command": ["cmake", "-B", "build", "-DGGML_NATIVE=ON", "-DGGML_CUDA=ON"], "shared_libraries": ["libmtmd.so", "libggml.so", "libllama.so"], "smoke_test": {"command": ["/tmp/ik-llama-cpp-build/bin/llama-cli", "-m", "/root/.cache/huggingface/hub/models--TheBloke--TinyLlama-1.1B-Chat-v1.0-GGUF/snapshots/52e7645ba7c309695bec7ac98f4f005b139cf465/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", "-ngl", "99", "-n", "32", "-p", "Write one short sentence confirming this CUDA build works."], "returncode": 0, "stderr_tail": "rs to GPU\nllm_load_tensors: offloaded 23/23 layers to GPU\nllm_load_tensors: CPU buffer size = 35.16 MiB\nllm_load_tensors: CUDA0 buffer size = 601.02 MiB\n....................................................................................\nllama_init_from_model: n_ctx = 2048\nllama_init_from_model: n_batch = 2048\nllama_init_from_model: n_ubatch = 512\nllama_init_from_model: flash_attn = 1\nllama_init_from_model: attn_max_b = 0\nllama_init_from_model: fused_moe = 1\nllama_init_from_model: grouped er = 0\nllama_init_from_model: fused_up_gate = 1\nllama_init_from_model: fused_mmad = 1\nllama_init_from_model: rope_cache = 0\nllama_init_from_model: graph_reuse = 1\nllama_init_from_model: k_cache_hadam = 0\nllama_init_from_model: v_cache_hadam = 0\nllama_init_from_model: split_mode_graph_scheduling = 0\nllama_init_from_model: reduce_type = f16\nllama_init_from_model: sched_async = 0\nllama_init_from_model: ser = -1, 0\nllama_init_from_model: freq_base = 10000.0\nllama_init_from_model: freq_scale = 1\nllama_kv_cache_init: CUDA0 KV buffer size = 44.00 MiB\nllama_init_from_model: KV self size = 44.00 MiB, K (f16): 22.00 MiB, V (f16): 22.00 MiB\nllama_init_from_model: CUDA_Host output buffer size = 0.12 MiB\nllama_init_from_model: CUDA0 compute buffer size = 70.50 MiB\nllama_init_from_model: CUDA_Host compute buffer size = 6.01 MiB\nllama_init_from_model: graph nodes = 511\nllama_init_from_model: graph splits = 2\nllama_init_from_model: enabling only_active_experts scheduling\n\nsystem_info: n_threads = 17 / 17 \| AVX = 1 \| AVX_VNNI = 0 \| AVX2 = 1 \| AVX512 = 1 \| AVX512_VBMI = 0 \| AVX512_VNNI = 1 \| AVX512_BF16 = 0 \| FMA = 1 \| NEON = 0 \| SVE = 0 \| ARM_FMA = 0 \| F16C = 1 \| FP16_VA = 0 \| WASM_SIMD = 0 \| BLAS = 1 \| SSE3 = 1 \| SSSE3 = 1 \| VSX = 0 \| MATMUL_INT8 = 0 \| \nsampling: \n\trepeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000\n\ttop_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800\n\tmirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000\n\txtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000\n\tadaptive_target = -1.00, adaptive_decay = 0.90\nsampling order: \nCFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature -> adaptive_p \ngenerate: n_ctx = 2048, n_batch = 2048, n_predict = 32, n_keep = 1\n\n\n [end of text]\n\nllama_print_timings: load time = 319.96 ms\nllama_print_timings: sample time = 0.89 ms / 16 runs ( 0.06 ms per token, 18018.02 tokens per second)\nllama_print_timings: prompt eval time = 63.96 ms / 14 tokens ( 4.57 ms per token, 218.89 tokens per second)\nllama_print_timings: eval time = 68.62 ms / 15 runs ( 4.57 ms per token, 218.61 tokens per second)\nllama_print_timings: total time = 139.78 ms / 29 tokens\n~ggml_backend_cuda_context: have 2 graphs\nLog end\n", "stdout_tail": " Write one short sentence confirming this CUDA build works. Use the appropriate AST nodes and their corresponding functions for syntax highlighting."}, "source_commit": "b93721902b4662f9b973b1c412006081c958d085", "source_commit_short": "b9372190", "source_ref": "main", "source_repo": "https://github.com/ikawrakow/ik_llama.cpp.git", "version_stderr": "version: 4465 (b9372190)\nbuilt with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu\n"}