table of contents
- bookworm 3.17.1-2+deb12u1
ROCM SUPPORT(1) | ROCM SUPPORT(1) |
NAME
amdgpu_plugin - A plugin extension to CRIU to support checkpoint/restore in
userspace for AMD GPUs.
CURRENT SUPPORT
Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer
DESCRIPTION
Though criu is a great tool for checkpointing and restoring running applications, it has certain limitations such as it cannot handle applications that have device files open. In order to support ROCm based workloads with criu we need to augment criu’s core functionality with a plugin based extension mechanism. amdgpu_plugin provides the necessary support to criu to allow Checkpoint / Restore with ROCm.
Dependencies
amdkfd support
criu 3.16
OPTIONS
Optional parameters can be passed in as environment variables before executing criu command.
KFD_FW_VER_CHECK
E.g: KFD_FW_VER_CHECK=0
KFD_SDMA_FW_VER_CHECK
E.g: KFD_SDMA_FW_VER_CHECK=0
KFD_CACHES_COUNT_CHECK
E.g: KFD_CACHES_COUNT_CHECK=0
KFD_NUM_GWS_CHECK
E.g: KFD_NUM_GWS_CHECK=0
KFD_VRAM_SIZE_CHECK
E.g: KFD_VRAM_SIZE_CHECK=0
KFD_NUMA_CHECK
E.g: KFD_NUMA_CHECK=1
KFD_CAPABILITY_CHECK
E.g: KFD_CAPABILITY_CHECK=1
AUTHOR
The AMDKFD team.
COPYRIGHT
Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
11/20/2024 |