Showing posts with label tensorflow. Show all posts
Showing posts with label tensorflow. Show all posts

Sunday, June 24, 2018

BUG: tensorflow tf.scatter_nd will accumulate (or undermined) values when indices have duaplicates

Problem:
WARNING: The order in which updates are applied is nondeterministic, so the output will be nondeterministic if indicescontains duplicates.

Solution:
https://github.com/tensorflow/tensorflow/issues/8102

If you are in the unpooling business:
@teramototoya what I did as a hack: with https://www.tensorflow.org/api_docs/python/tf/unique_with_counts i counted the multiplication in the indices and i divided the tensor which was holding the values (aka tensor named 'updates' in the first comment) with this counter
so when add() comes it will undo what the division made
If not, so your problem is general, then you have to somehow flatten your indices, and then tf.unique, see this post:
https://stackoverflow.com/questions/44117430/how-to-use-tf-scatter-nd-without-accumulation


2. Solution
Use a count auxilliary tensor.

cntOnes = tf.ones_like(dmVis, tf.float32)
vals = tf.stack([dmVis, mlVis, cntOnes], axis=1)
scatter_shape = tf.constant([bs, h*upsample, w*upsample, 3])
dmc = tf.scatter_nd(locVis, vals, scatter_shape)
dm, ml, cnt = tf.unstack(dmc, axis=-1)
cnt = cnt + epsilon
dm = dm / cnt

Thursday, October 12, 2017

Nvidia driver version mismatch (which cause tensorflow gpu not work)

Problem:
When using tensorflow-gpu, get the following error:

Solved in the environment Ubuntu 16.04.

tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE

1. May be the nvidia driver version problem. Check the installed driver.

nvidia-smi

Failed to initialize NVML: Driver/library version mismatch

Show installed nvidia driver
$ dpkg --get-selections | grep nvidia
nvidia-375 install
nvidia-384 install
nvidia-opencl-icd-375 deinstall
nvidia-opencl-icd-384 install
nvidia-prime install
nvidia-settings install

dpkg -l | grep -i nvidia

ii  bbswitch-dkms                              0.8-3ubuntu1                                  amd64        Interface for toggling the power on NVIDIA Optimus video cards
ii  libcuda1-375                               375.82-0ubuntu0~gpu16.04.1                    amd64        NVIDIA CUDA runtime library
ii  nvidia-375                                 375.82-0ubuntu0~gpu16.04.1                    amd64        NVIDIA binary driver - version 375.82
ii  nvidia-opencl-icd-375                      375.82-0ubuntu0~gpu16.04.1                    amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                               0.8.2                                         amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                            384.90-0ubuntu0~gpu16.04.1                    amd64        Tool for configuring the NVIDIA graphics driver

2. Uninstall current driver and reinstall nvidia-375

$nvidia-uninstall

If there is no nvidia-uninstall, you should remove all nvidia driver
  1. Run sudo apt-get purge nvidia-*
  2. Run sudo add-apt-repository ppa:graphics-drivers/ppa and then sudo apt-get update.
  3. Run sudo apt-get install nvidia-375.
  4. Reboot and your graphics issue should be fixed.

You can check your installation status with the following command
lsmod | grep nvidia
Done! Then the "nvidia-smi" should work.



3. May turn off the ubuntu automatical updates.

Other helpful commands.
to list the devices
ubuntu-drivers devices


4. Important. Prevent driver auto-update
Sometimes, only this step is enough!!!

$ sudo apt-mark hold nvidia-375
$ dpkg --get-selections | grep nvidia
nvidia-375 hold
nvidia-384 install
nvidia-opencl-icd-384 install
nvidia-prime install

nvidia-settings install

5. Check tensorflow

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX TITAN X

Monday, September 18, 2017

Install tensorflow by building source

Problem occurs when using direct pip install:
make tensorflow c++ code
third_party/eigen3/unsupported/eigen/cxx11/tensor: no such file or directory

Fixed by compiling open source
Environment used
Ubuntu 16.04
gcc 5.4.0
Cuda 8.0
Cudnn 5
python 2.7
tensorflow 1.2.1

Oficial doc:
https://www.tensorflow.org/install/install_sources#PrepareLinux

Clone the TensorFlow repository

Start the process of building TensorFlow by cloning a TensorFlow repository.
To clone the latest TensorFlow repository, issue the following command:
$ git clone https://github.com/tensorflow/tensorflow
The preceding git clone command creates a subdirectory named tensorflow. After cloning, you may optionally build a specific branch (such as a release branch) by invoking the following commands:
$ cd tensorflow $ git checkout Branch # where Branch is the desired branch
For example, to work with the r1.0 release instead of the master release, issue the following command:
$ git checkout r1.2

Prepare environment for Linux

Install Bazel

If bazel is not installed on your system, install it now by following these directions.
Bug:
The latest bazel has problem to build, need to roll back to 0.5.2
Download from
https://github.com/bazelbuild/bazel/releases/download/0.5.2/bazel_0.5.2-linux-x86_64.deb

Install TensorFlow Python dependencies

To install these packages for Python 2.7, issue the following command:
$ sudo apt-get install python-numpy python-dev python-pip python-wheel
To install these packages for Python 3.n, issue the following command:
$ sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel

Optional: install TensorFlow for GPU prerequisites

Finally, you must also install libcupti-dev by invoking the following command:

$ sudo apt-get install libcupti-dev

Next

After preparing the environment, you must now configure the installation.
$ cd tensorflow  # cd to the top-level directory created
$ ./configure
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python2.7
Found possible Python library paths:
  /usr/local/lib/python2.7/dist-packages
  /usr/lib/python2.7/dist-packages
Please input the desired Python library path to use.  Default is [/usr/lib/python2.7/dist-packages]

Using python library path: /usr/local/lib/python2.7/dist-packages
Do you wish to build TensorFlow with MKL support? [y/N]
No MKL support will be enabled for TensorFlow
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:
Do you wish to use jemalloc as the malloc implementation? [Y/n]
jemalloc enabled
Do you wish to build TensorFlow with Google Cloud Platform support? [y/N]
No Google Cloud Platform support will be enabled for TensorFlow
Do you wish to build TensorFlow with Hadoop File System support? [y/N]
No Hadoop File System support will be enabled for TensorFlow
Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N]
No XLA support will be enabled for TensorFlow
Do you wish to build TensorFlow with VERBS support? [y/N]
No VERBS support will be enabled for TensorFlow
Do you wish to build TensorFlow with OpenCL support? [y/N]
No OpenCL support will be enabled for TensorFlow
Do you wish to build TensorFlow with CUDA support? [y/N] Y
CUDA support will be enabled for TensorFlow
Do you want to use clang as CUDA compiler? [y/N]
nvcc will be used as CUDA compiler
Please specify the Cuda SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 8.0]: 8.0
Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 6.0]: 5
Please specify the location where cuDNN 6 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: "3.5,5.2"]: 3.0
Do you wish to build TensorFlow with MPI support? [y/N] 
MPI support will not be enabled for TensorFlow
Configuration finished
To build a pip package for TensorFlow with GPU support, invoke the following command:
$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" 
NOTE on gcc 5 or later: the binary pip packages available on the TensorFlow website are built with gcc 4, which uses the older ABI. To make your build compatible with the older ABI, you need to add --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" to your bazel build command.


The bazel build command builds a script named build_pip_package. Running this script as follows will build a .whl file within the /tmp/tensorflow_pkg directory:

$ bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tmp/tensorflow_pkg

Install via pip in virtual environment
virtualenv tensorflow
(tensorflow)$ source ~/tensorflow/bin/activate
(tensorflow) $ pip install /tmp/tensorflow_pkg/tensorflow-1.2.1-cp27-cp27mu-linux_x86_64.whl


Update 2018.01.16

Problems for

cuda 9.1
cudnn 7.0
tensorflow 1.5.0


When building using bazel ...


Describe the problem 1

While trying to compile the latest TensorFlow(cloned from 798fa36), such error will be raised:
ERROR: /home/ubuntu/tensorflow/tensorflow/contrib/seq2seq/BUILD:64:1: error while parsing .d file: /home/ubuntu/.cache/bazel/_bazel_ubuntu/ad1e09741bb4109fbc70ef8216b59ee2/execroot/org_tensorflow/bazel-out/k8-py3-opt/bin/tensorflow/contrib/seq2seq/_objs/python/ops/_beam_search_ops_gpu/tensorflow/contrib/seq2seq/kernels/beam_search_ops_gpu.cu.pic.d (No such file or directory)
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:14:0,
                 from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1,
                 from ./tensorflow/contrib/seq2seq/kernels/beam_search_ops.h:19,
                 from tensorflow/contrib/seq2seq/kernels/beam_search_ops_gpu.cu.cc:20:
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/Core:59:34: fatal error: math_functions.hpp: No such file or directory
It turns out that in CUDA 9.1, math_functions.hpp is located at cuda/include/crt/math_functions.hpp, rather than cuda/include/math_functions.hpp (CUDA 9.0 does), which leads to this error.
ln -s /usr/local/cuda/include/crt/math_functions.hpp /usr/local/cuda/include/math_functions.hpp will fix this problem and complete the compiling process.

Reference


Note on gcc version >=5: gcc uses the new C++ ABI since version 5. The binary pip packages available on the TensorFlow website are built with gcc4 that uses the older ABI. If you compile your op library with gcc>=5, add -D_GLIBCXX_USE_CXX11_ABI=0 to the command line to make the library compatible with the older abi. Furthermore if you are using TensorFlow package created from source remember to add --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" as bazel command to compile the Python package.

Problem 2

no such package '@nasm//': java.io.IOException: Error downloading [https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2, http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2]

Solution
https://github.com/tensorflow/tensorflow/issues/16862
The problem is that one of two mirrors for nasm is dead, and the second one is sort some reason problematic. Workaround would be to add one more mirror:
      urls = [
          "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",  
          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
          "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
      ]
in
"https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",

Problem 3

'Numpy dangling symbolic links' when building from source
Solution

sudo pip install --no-cache-dir --upgrade --force-reinstall numpy

Update 2018.06.05

Problem for 
cuda 9.0
cudnn 7.0
tensorflow 1.7.0
bazel 0.14

when building using bazel

bazel-out/host/bin/_solib_local/_U_S_Stensorflow_Spython_Cgen_Ustring_Uops_Upy_Uwrappers_Ucc___Utensorflow/libtensorflow_framework.so: undefined reference to `cublasDsymm_v2@libcublas.so.9.0
........

Solution
1. check $LD_LIBRARY_PATH in ~/.bashrc 
2. check CUDA_PATH
The solution is to not use LD_LIBRARY_PATH but ldconfig:
sudo echo "/usr/local/cuda/lib64" > /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

Updated on 2018.12.21


Reinstall bazel

https://docs.bazel.build/versions/master/install-ubuntu.html
rm ~/.cache/bazel -fr
rm -fr ~/.bazel ~/.bazelrc

Step 2: Download Bazel

Next, download the Bazel binary installer named bazel-<version>-installer-linux-x86_64.sh from the Bazel releases page on GitHub.

Step 3: Run the installer

Run the Bazel installer as follows:
chmod +x bazel-<version>-installer-linux-x86_64.sh
./bazel-<version>-installer-linux-x86_64.sh --user
The --user flag installs Bazel to the $HOME/bin directory on your system and sets the .bazelrc path to $HOME/.bazelrc. Use the --help command to see additional installation options.
Different version of bazel for different tensorflow

Linux

VersionPython versionCompilerBuild tools
tensorflow-1.12.02.7, 3.3-3.6GCC 4.8Bazel 0.15.0
tensorflow-1.11.02.7, 3.3-3.6GCC 4.8Bazel 0.15.0
tensorflow-1.10.02.7, 3.3-3.6GCC 4.8Bazel 0.15.0
tensorflow-1.9.02.7, 3.3-3.6GCC 4.8Bazel 0.11.0
tensorflow-1.8.02.7, 3.3-3.6GCC 4.8Bazel 0.10.0
tensorflow-1.7.02.7, 3.3-3.6GCC 4.8Bazel 0.10.0
tensorflow-1.6.02.7, 3.3-3.6GCC 4.8Bazel 0.9.0
tensorflow-1.5.02.7, 3.3-3.6GCC 4.8Bazel 0.8.0
tensorflow-1.4.02.7, 3.3-3.6GCC 4.8Bazel 0.5.4
tensorflow-1.3.02.7, 3.3-3.6GCC 4.8Bazel 0.4.5
tensorflow-1.2.02.7, 3.3-3.6GCC 4.8Bazel 0.4.5
tensorflow-1.1.02.7, 3.3-3.6GCC 4.8Bazel 0.4.2
tensorflow-1.0.02.7, 3.3-3.6GCC 4.8Bazel 0.4.2
VersionPython versionCompilerBuild toolscuDNNCUDA
tensorflow_gpu-1.12.02.7, 3.3-3.6GCC 4.8Bazel 0.15.079
tensorflow_gpu-1.11.02.7, 3.3-3.6GCC 4.8Bazel 0.15.079
tensorflow_gpu-1.10.02.7, 3.3-3.6GCC 4.8Bazel 0.15.079
tensorflow_gpu-1.9.02.7, 3.3-3.6GCC 4.8Bazel 0.11.079
tensorflow_gpu-1.8.02.7, 3.3-3.6GCC 4.8Bazel 0.10.079
tensorflow_gpu-1.7.02.7, 3.3-3.6GCC 4.8Bazel 0.9.079
tensorflow_gpu-1.6.02.7, 3.3-3.6GCC 4.8Bazel 0.9.079
tensorflow_gpu-1.5.02.7, 3.3-3.6GCC 4.8Bazel 0.8.079
tensorflow_gpu-1.4.02.7, 3.3-3.6GCC 4.8Bazel 0.5.468
tensorflow_gpu-1.3.02.7, 3.3-3.6GCC 4.8Bazel 0.4.568
tensorflow_gpu-1.2.02.7, 3.3-3.6GCC 4.8Bazel 0.4.55.18
tensorflow_gpu-1.1.02.7, 3.3-3.6GCC 4.8Bazel 0.4.25.18
tensorflow_gpu-1.0.02.7, 3.3-3.6GCC 4.8Bazel 0.4.25.18

Update 2020.01.19
Tested on (version matters!)
tensorflow 1.8
Cuda 10.0
cudnn 7.6
python 3.6
bazel 0.15.0