I am training yolov8 model on cuda using this code :
from ultralytics import YOLO
import torch
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
model = YOLO("yolov8n.pt") # load a pretrained model (recommended for training)
results = model.train(data="data.yaml", epochs=15, workers=0, batch=12)
results = model.val()
model.export(format="onnx")
and I am getting Nan for all losses
Epoch GPU_mem box_loss cls_loss dfl_loss Instances Size
1/15 1.74G nan nan nan 51 640: 4%
I have tried training a model on cpu and it worked fine. the problem appeared when I installed cuda and started training on it.
I expected that there was an error reading the data or something but everything works fine.
I think it has something to do with memory because when I decreased the image size for the model it worked fine, but when I increased batch size for the same decreased image size it showed NaN again. so it's a trade of between image size, batch size and memory. I am not sure 100% if that is right. but that is what I figured out by experiment. but if you have good answer for this problem, please share it.
I had the same issue. Even after upgrading ultralytics to its latest version 8.0.94 and setting the batch size to a lower value, it did not help me.
When I set the device to CPU device=cpu, it works perfectly fine.
so, the problem was mainly with the GPU. As suggested by the github issue, setting amp=False fixed it and I was able to run it on GPU.
yolo task=detect mode=train model=yolov8s.pt data="data.yaml" epochs=20 batch=2 imgsz=640 device=0 workers=8 optimizer=Adam pretrained=true val=true plots=true save=True show=true optimize=true lr0=0.001 lrf=0.01 fliplr=0.0 amp=False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With