I'm reaching out to the community for some assistance with an issue I'm encountering in llama.cpp. Previously, the program was successfully utilizing the GPU for execution. However, recently, it seems to have switched to CPU execution.
Observations:
BLAS=1 is set, indicating the use of BLAS routines (likely for linear algebra operations). llm_load_print_meta: LF token = 13 '<0x0A>' (potential output related to loading metadata, but its specific meaning might be context-dependent). llm_load_tensors: ggml ctx size = 0.11 MiB (indicates the size of the global memory context, which seems relatively small). llm_load_tensors: offloading 0 repeating layers to GPU (no repeating layers are being offloaded to the GPU). llm_load_tensors: offloaded 0/33 layers to GPU (no layers have been offloaded to the GPU). llm_load_tensors: CPU buffer size = 7338.64 MiB (a significant amount of data is being loaded into CPU buffers).
Questions:
Has anyone else encountered a similar situation with llama.cpp switching from GPU to CPU execution? Are there any known configuration changes or environmental factors that might be causing this behavior? Could there be specific conditions in my code that are preventing GPU offloading?
Try eg the parameter -ngl 100 (for llama.cpp main) or --n_gpu_layers 100 (for llama-cpp-python)
to offload to gpu.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With