Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Llama.cpp GPU Offloading Issue - Unexpected Switch to CPU

I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems to have switched to CPU execution.

Observations:

BLAS=1 is set, indicating the use of BLAS routines (likely for linear algebra operations). llm_load_print_meta: LF token = 13 '<0x0A>' (potential output related to loading metadata, but its specific meaning might be context-dependent). llm_load_tensors: ggml ctx size = 0.11 MiB (indicates the size of the global memory context, which seems relatively small). llm_load_tensors: offloading 0 repeating layers to GPU (no repeating layers are being offloaded to the GPU). llm_load_tensors: offloaded 0/33 layers to GPU (no layers have been offloaded to the GPU). llm_load_tensors: CPU buffer size = 7338.64 MiB (a significant amount of data is being loaded into CPU buffers).

Questions:

Has anyone else encountered a similar situation with llama.cpp switching from GPU to CPU execution? Are there any known configuration changes or environmental factors that might be causing this behavior? Could there be specific conditions in my code that are preventing GPU offloading?

like image 253
Montassar Jaziri Avatar asked Dec 06 '25 16:12

Montassar Jaziri


1 Answers

Try eg the parameter -ngl 100 (for llama.cpp main) or --n_gpu_layers 100 (for llama-cpp-python) to offload to gpu.

like image 185
Chris A. Avatar answered Dec 10 '25 14:12

Chris A.



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!