Friday, May 10, 2024

SSH log with key

Reference: 

https://www.digitalocean.com/community/tutorials/how-to-configure-ssh-key-based-authentication-on-a-linux-server 


Step 1 — Creating SSH Keys in Client

On your local computer, generate a SSH key pair by typing:

  1. ssh-keygen
Output
Generating public/private rsa key pair. Enter file in which to save the key (/home/username/.ssh/id_rsa):

The utility will prompt you to select a location for the keys that will be generated. By default, the keys will be stored in the ~/.ssh directory within your user’s home directory. The private key will be called id_rsa and the associated public key will be called id_rsa.pub.

Step 2 — Copying an SSH Public Key to Your Server

The full command will look like this:

  1. cat ~/.ssh/id_rsa.pub | ssh username@remote_host "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"

You may see a message like this:

Output
The authenticity of host '203.0.113.1 (203.0.113.1)' can't be established. ECDSA key fingerprint is fd:fd:d4:f9:77:fe:73:84:e1:55:00:ad:d6:6d:22:fe. Are you sure you want to continue connecting (yes/no)? yes

This means that your local computer does not recognize the remote host. This will happen the first time you connect to a new host. Type yes and press ENTER to continue.

Thursday, January 18, 2024

Build pytorch cpp_extention with cuda arch error

 Error:

"cpp_extension.py", line 1561, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'

IndexError: list index out of range

Solution:

Tried to investigate a bit this issue since I've faced the same problem in one of my Docker container.

If you're currently running your code through a setup.py , you should first add TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" to run:

python TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" setup.py install

(or an ARG TORCH_CUDA_ARCH_LIST="YOUR_GPUs_CC+PTX" in your Dockerfile for instance )

Additional infos. can be found here: https://pytorch.org/docs/stable/cpp_extension.html


CUDA_VERSION=$(/usr/local/cuda/bin/nvcc --version | sed -n 's/^.*release \([0-9]\+\.[0-9]\+\).*$/\1/p')
if [[ ${CUDA_VERSION} == 9.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;7.0+PTX"
elif [[ ${CUDA_VERSION} == 9.2* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0+PTX"
elif [[ ${CUDA_VERSION} == 10.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5+PTX"
elif [[ ${CUDA_VERSION} == 11.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0+PTX"
elif [[ ${CUDA_VERSION} == 11.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
else
    echo "unsupported cuda version."
    exit 1
fi

If the gpu driver is loaded correctly, execute the following statement in the python console

>>> torch.cuda.get_device_capability(0)
(6, 1)

that means TORCH_CUDA_ARCH_LIST="6.1"However, in most cases, cuda is unavailable because you have specified gpu incorrectly.

Tuesday, July 4, 2023

"apt --fix-broken install" fail

 Get errors when uninstall nvidia driver

~$ sudo apt --fix-broken install


Reading package lists... Done
Building dependency tree
Reading state information... Done
Correcting dependencies... failed.
The following packages have unmet dependencies:
nvidia-dkms-530 : Depends: nvidia-kernel-common-530 (>= 530.41.03) but 530.30.02-0ubuntu1 is installed
nvidia-driver-530 : Depends: nvidia-kernel-common-530 (>= 530.41.03) but 530.30.02-0ubuntu1 is installed
Recommends: libnvidia-compute-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-decode-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-encode-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-fbc1-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
Recommends: libnvidia-gl-530:i386 (= 530.41.03-0ubuntu0.20.04.2)
E: Error, pkgProblemResolver::Resolve generated breaks, this may be caused by held packages.
E: Unable to correct dependencies


Solution:

~$ sudo dpkg --purge --force-depends nvidia-dkms-530

To remove cuda toolkit:

sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*" 

To remove Nvidia drivers:

sudo apt-get --purge remove "*nvidia*"

If you have installed via source files (assuming the default location to be /usr/local) then remove it using:

sudo rm -rf /usr/local/cuda*






Wednesday, July 20, 2022

Install cudnn8

Official doc:

https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#package-manager-ubuntu-install 


Download header and libs

https://developer.nvidia.com/rdp/cudnn-download

Procedure

  1. Navigate to your <cudnnpath> directory containing the cuDNN tar file.
  2. Unzip the cuDNN package.
    $ tar -xvf cudnn-linux-x86_64-8.x.x.x_cudaX.Y-archive.tar.xz
  3. Copy the following files into the CUDA toolkit directory.
    $ sudo cp cudnn-*-archive/include/cudnn*.h /usr/local/cuda/include 
    $ sudo cp -P cudnn-*-archive/lib/libcudnn* /usr/local/cuda/lib64 
    $ sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*

Wednesday, June 8, 2022

Docker error: could not select device driver

Problem:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Solution:

Install nvidia-docker

Document: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker

Install guid: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker


Installing on Ubuntu and Debian

The following steps can be used to setup NVIDIA Container Toolkit on Ubuntu LTS - 16.04, 18.04, 20.4 and Debian - Stretch, Buster distributions.

Setting up Docker

Docker-CE on Ubuntu can be setup using Docker’s official convenience script:

$ curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

See also

Follow the official instructions for more details and post-install actions.

Setting up NVIDIA Container Toolkit

Setup the package repository and the GPG key:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Note

To get access to experimental features and access to release candidates, you may want to add the experimental branch to the repository listing:

$  distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.listthe Signed-By option, see the relevant troubleshooting section.

Install the nvidia-docker2 package (and dependencies) after updating the package listing:

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2

Restart the Docker daemon to complete the installation after setting the default runtime:

$ sudo systemctl restart docker