I would like to make the result of a text classification model (finBERT pytorch model) available through an endpoint that is deployed on Kubernetes.
The whole pipeline is working but it's super slow to process (30 seconds for one sentence) when deployed. If I time the same endpoint in local, I'm getting results in 1 or 2 seconds. Running the docker image in local, the endpoint also takes 2 seconds to return a result.
When I'm checking the CPU usage of my kubernetes instance while the request is running, it doesn't go above 35% so I'm not sure it's related to a lack of computation power?
Did anyone witness such performances issues when making a forward pass to a pytorch model? Any clues on what I should investigate?
Any help is greatly appreciated, thank you!
I am currently using
limits: cpu: "2" requests: cpu: "1"
Python : 3.7 Pytorch : 1.8.1
I had the same issue. Locally my pytorch model would return a prediction in 25 ms and then on Kubernetes it would take 5 seconds. The problem had to do with how many threads torch had available to use. I'm not 100% sure why this works, but reducing the number of threads sped up performance significantly.
Set the following environment variable on your kubernetes pod.
OMP_NUM_THREADS=1
After doing that it performed on kubernetes like it did running it locally ~30ms per call.
These are my pod limits:
11500mI was led to discover this from this blog post: https://www.chunyangwen.com/blog/python/pytorch-slow-inference.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With