table of contents
ROCM SUPPORT(1) | ROCM SUPPORT(1) |
NAME
criu-amdgpu-plugin - A plugin extension to CRIU to support checkpoint/restore
in userspace for AMD GPUs.
CURRENT SUPPORT
Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer
DESCRIPTION
Though criu is a great tool for checkpointing and restoring running applications, it has certain limitations such as it cannot handle applications that have device files open. In order to support ROCm based workloads with criu we need to augment criu’s core functionality with a plugin based extension mechanism. criu-amdgpu-plugin provides the necessary support to criu to allow Checkpoint / Restore with ROCm.
DEPENDENCIES
amdkfd support
OPTIONS
Optional parameters can be passed in as environment variables before executing criu command.
KFD_FW_VER_CHECK
E.g: KFD_FW_VER_CHECK=0
KFD_SDMA_FW_VER_CHECK
E.g: KFD_SDMA_FW_VER_CHECK=0
KFD_CACHES_COUNT_CHECK
E.g: KFD_CACHES_COUNT_CHECK=0
KFD_NUM_GWS_CHECK
E.g: KFD_NUM_GWS_CHECK=0
KFD_VRAM_SIZE_CHECK
E.g: KFD_VRAM_SIZE_CHECK=0
KFD_NUMA_CHECK
E.g: KFD_NUMA_CHECK=1
KFD_CAPABILITY_CHECK
E.g: KFD_CAPABILITY_CHECK=1
KFD_MAX_BUFFER_SIZE
E.g: KFD_MAX_BUFFER_SIZE="2G"
AUTHOR
The AMDKFD team.
COPYRIGHT
Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)
01/09/2025 |