Thursday, January 18, 2024

Build pytorch cpp_extention with cuda arch error

 Error:

"cpp_extension.py", line 1561, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'

IndexError: list index out of range

Solution:

Tried to investigate a bit this issue since I've faced the same problem in one of my Docker container.

If you're currently running your code through a setup.py , you should first add TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" to run:

python TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" setup.py install

(or an ARG TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" in your Dockerfile for instance )

Additional infos. can be found here: https://pytorch.org/docs/stable/cpp_extension.html


CUDA_VERSION=$(/usr/local/cuda/bin/nvcc --version | sed -n 's/^.*release \([0-9]\+\.[0-9]\+\).*$/\1/p')
if [[ ${CUDA_VERSION} == 9.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;7.0+PTX"
elif [[ ${CUDA_VERSION} == 9.2* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0+PTX"
elif [[ ${CUDA_VERSION} == 10.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5+PTX"
elif [[ ${CUDA_VERSION} == 11.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0+PTX"
elif [[ ${CUDA_VERSION} == 11.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
else
    echo "unsupported cuda version."
    exit 1
fi

If the gpu driver is loaded correctly, execute the following statement in the python console

>>> torch.cuda.get_device_capability(0)
(6, 1)

that means TORCH_CUDA_ARCH_LIST="6.1"However, in most cases, cuda is unavailable because you have specified gpu incorrectly.

Tuesday, July 4, 2023

"apt --fix-broken install" fail

 Get errors when uninstall nvidia driver

~$ sudo apt --fix-broken install


Reading package lists... Done
Building dependency tree
Reading state information... Done
Correcting dependencies... failed.
The following packages have unmet dependencies:
nvidia-dkms-530 : Depends: nvidia-kernel-common-530 (>= 530.41.03) but 530.30.02-0ubuntu1 is installed
nvidia-driver-530 : Depends: nvidia-kernel-common-530 (>= 530.41.03) but 530.30.02-0ubuntu1 is installed
Recommends: libnvidia-compute-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-decode-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-encode-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-fbc1-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-gl-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
E: Unable to correct dependencies


Solution:

~$ sudo dpkg --purge --force-depends nvidia-dkms-530

To remove cuda toolkit:

sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*" 

To remove Nvidia drivers:

sudo apt-get --purge remove "*nvidia*"

If you have installed via source files (assuming the default location to be /usr/local) then remove it using:

sudo rm -rf /usr/local/cuda*






Wednesday, July 20, 2022

Install cudnn8

Official doc:

https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#package-manager-ubuntu-install 


Download header and libs

https://developer.nvidia.com/rdp/cudnn-download

Procedure

  1. Navigate to your <cudnnpath> directory containing the cuDNN tar file.
  2. Unzip the cuDNN package.
    $ tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
  3. Copy the following files into the CUDA toolkit directory.
    $ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include 
    $ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64 
    $ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Wednesday, June 8, 2022

Docker error: could not select device driver

Problem:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Solution:

Install nvidia-docker

Document: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

Install guid: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker


Installing on Ubuntu and Debian

The following steps can be used to setup NVIDIA Container Toolkit on Ubuntu LTS - 16.04, 18.04, 20.4 and Debian - Stretch, Buster distributions.

Setting up Docker

Docker-CE on Ubuntu can be setup using Docker’s official convenience script:

$ curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

See also

Follow the official instructions for more details and post-install actions.

Setting up NVIDIA Container Toolkit

Setup the package repository and the GPG key:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Note

To get access to experimental features and access to release candidates, you may want to add the experimental branch to the repository listing:

$  distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listthe Signed-By option, see the relevant troubleshooting section.

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

$ sudo systemctl restart docker





Tuesday, May 31, 2022

GPG error: "public key is not available" in Ubuntu

For both server and container:

Official solution

https://developer.nvidia.com/blog/updating-the-cuda-linux-gpg-repository-key/

Remove the outdated signing key

Debian, Ubuntu, WSL

$ sudo apt-key del 7fa2af80

Install the new key

For Debian-based distributions, including Ubuntu, you must also install the new package or manually install the new signing key.

Install the new cuda-keyring package

To avoid the need for manual key installation steps, NVIDIA is providing a new helper package to automate the installation of new signing keys for NVIDIA repositories. 

Replace $distro/$arch in the following commands with values appropriate for your OS; for example:

  • ubuntu1604/x86_64
  • ubuntu1804/cross-linux-sbsa
  • ubuntu1804/ppc64el
  • ubuntu1804/sbsa
  • ubuntu1804/x86_64
  • ubuntu2004/cross-linux-sbsa
  • ubuntu2004/sbsa
  • ubuntu2004/x86_64
  • ubuntu2204/sbsa
  • ubuntu2204/x86_64
  • wsl-ubuntu/x86_64

Debian, Ubuntu, WSL

$ wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

Common issues and solutions on Debian-based distros

Here are some common errors that we’ve helped people with. If you see an error not listed here, please comment below.

Duplicate .list entries

{{E: Conflicting values set for option Signed-By regarding source
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /: 
/usr/share/keyrings/cuda-archive-keyring.gpg !=
E: The list of sources could not be read.}}

Solution: If you previously used add-apt-repository to enable the CUDA repository, then remove the duplicate entry.

sudo sed -i '/developer\.download\.nvidia\.com\/compute\/cuda\/repos/d' /etc/apt/sources.list

Also check for and remove cuda*.list files under the /etc/apt/sources.d/ directory.

----------------------------------------------

Error

Err:6 http://packages.microsoft.com/repos/azurecore focal InRelease

  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY EB3E94ADBE1229CF

Reading package lists... Done

W: GPG error: http://packages.microsoft.com/repos/azurecore focal InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY EB3E94ADBE1229CF

Solution

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys

EB3E94ADBE1229CF

Problem in docker

RUN sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub

or 


RUN wget https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-keyring_1.0-1_all.deb
RUN sudo dpkg -i cuda-keyring_1.0-1_all.deb


Update 2023-0705

Error:

gpgkeys: key F60F4B3D7FA2AF80 not found on keyserver





The solution is:
wget -qO - https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -

as documented here: https://developer.nvidia.com/cuda-downloads -> Linux -> x86_64 -> Ubuntu -> 18.04 -> deb (network)



Conflict key error for "sudo apt update", can not find in source

E: Conflicting values set for option Signed-By regarding source https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /: /usr/share/keyrings/cuda-archive-keyring.gpg !=

Solution 1:
remove /etc/apt/source.list.d/cuda*, /etc/apt/source.list.d/nccl*
del cuda_key

Solution 2:

sed -i '/developer\.download\.nvidia\.com\/compute\/cuda\/repos/d' /etc/apt/sources.list.d/*
sed -i '/developer\.download\.nvidia\.com\/compute\/machine-learning\/repos/d' /etc/apt/sources.list.d/*

from:

https://askubuntu.com/questions/1424040/e-conflicting-values-set-for-option-signed-by-regarding-source-https-develope/1424054#1424054