Starting up the CUDA runtime takes a certain amount of time to harmonize the UVM memory maps of the device and the host; see:
Now, it's been suggested to me that using Persistence Mode would mitigate this phenomenon significantly. In what way? I mean, what will happen, or fail to happen, when persistence mode is on, and a process using CUDA exists?
The documentation says:
Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.
but - what does "keeping initialized" mean? Later, the section about the persistence daemon (which is not the same thing as persistence mode) says:
The GPU state remains loaded in the driver whenever one or more clients have the device file open. Once all clients have closed the device file, the GPU state will be unloaded unless persistence mode is enabled.
So what exactly is unloaded? To where it is unloaded? How does it relate to the memory size? And why would it take so much time to load it back if nothing significant has happend on the system?
Persistence Mode is the term for a user-settable driver property that keeps a target GPU. initialized even when no clients are connected to it.
Persistence Mode (Legacy) Persistence Mode is the term for a user-settable driver property that keeps a target GPU initialized even when no clients are connected to it.
It is strongly recommended, though not required, that the daemon be run as a non-root user for security purposes. The daemon may either be started with root privileges and the --user option, or it may be run directly as the non-root user.
The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
There are 2 major pieces to the GPU/CUDA start-up sequence:
A modern CUDA GPU can exist in one of several power states.  The current power state is observable via nvidia-smi or via NVML (although note that the effect of running a tool like nvidia-smi may modify the power state of a GPU.)  When the GPU is not being used for any purpose (i.e. it is idle, technically: no contexts of any kind are instantiated on the GPU) and persistence mode is not enabled, the GPU, in concert with the GPU driver, will automatically reduce its power state to a very low level, sometimes including a complete power-off scenario.
The process of moving a GPU to a lower power state will involve shutting off or modifying the behavior of various pieces of hardware. For example, reducing memory clocks, reducing core clocks, shutting off display output, shutting off the memory subsystem, shutting off various internal subsystems such as clock generators, and even major parts of the chip, such as the compute cores, caches, etc. and potentially even a "complete" power-down of the chip. A modern GPU has a controllable power delivery system, both on-chip and off-chip, to enable this behavior.
To reverse this process, the GPU driver software must carefully (in a prescribed sequence) power up modules, wait for a hardware settling time, then apply a module-level reset, then begin initializing controlling registers in the module. For example, powering up memory would involve, amongst other things, turning on the on-chip DRAM control module, turning on DRAM power, turning on the memory pin drivers, setting slew rates, turning on the memory clock, initializing the memory clock generator PLL for desired operation, and in many cases, initializing memory to some known state. For example, proper ECC usage requires that memory be initialized to a known state, which may not be simply all zeroes, but involves ECC tags which must be computed and stored. This "ECC Scrub" is one example of a "time-consuming" process mentioned in the documentation.
Depending on the exact power state, there may be any number of things that the driver must do to bring the GPU to the next higher power state (or "performance state"), P0 being the highest state. Once the perf state is above a certain level (say, P8) then the GPU may be capable of supporting certain types of contexts (e.g. a compute context) but perhaps at a reduced performance level (unless you are at P0).
These operations take time, and persistence mode will generally keep the GPU at power/perf state P2 or P0, meaning that essentially none of the above steps must be performed if it is desired that a context be opened on the GPU.
However, opening a GPU context may involve start-up costs of its own, that the GPU cannot or does not keep track of. For example, opening a compute context in a UVA regime requires, among other things, that "virtual allocations" be requested of the host OS, and that the memory maps of all processors in the system (all "visible" GPUs, plus the CPU) be "harmonized" so that everyone has a unique space to work in, and the numerical value of a 64-bit pointer in the space can be used to uniquely determine "ownership" or "meaning/introspection" of that pointer.
For the most part, activities related to opening a CUDA context (other than the process of bringing the device to a state where it can support a context) will not be impacted or benefitted by having the GPU in persistence mode.
Since both device initialization, and CUDA context creation may impact perceived "CUDA startup time", then persistence mode may improve/mitigate the overall perceived start-up time, but it cannot reduce it to zero, since some activities associated with context creation are outside of its purview.
The exact behavior of persistence mode may vary over time and by GPU type. Recently, it seems that persistence mode may still allow GPUs to move down to a power state of P8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With