This post describes the steps I used to install Theano on my Mac (OSX 10.9.5) with NVIDIA GeForce GTX 660M graphics card.
Install Theano
Please use pip
to install Theano as below
$ pip install Theano
Then, create the ~/.theanorc
with content as
[global]
mode=FAST_RUN
floatX = float32
device = gpu
[nvcc]
fastmath = True
Install CUDA Toolkit
Download the package of CUDA Toolkit 7.5 from the official link.
Install cuDNN
Next, we have to register on NVIDIA to be able to download cuDNN, which is a GPU-accelerated library of primitives for deep neural networks.
After downloading,
please uncompress the package and copy the header file and libraries to include
and lib
under the root directory of CUDA Toolkit (e.g. /usr/local/cuda), respectively.
$ tar xzf cudnn-7.0-osx-x64-v3.0-prod.tgz
$ cd cuda
$ sudo cp include/cudnn.h /usr/local/cuda/include/
$ sudo cp lib/libcudnn* /usr/local/cuda/lib/
Add environment variables
Add the following environment variables to ~/.bash_profile
.
# Theano
export CUDA_ROOT="/usr/local/cuda"
export THEANO_FLAGS="mode=FAST_RUN,device=gpu,floatX=float32"
# CUDA
export LD_LIBRARY_PATH="$CUDA_ROOT/lib:$LD_LIBRARY_PATH"
export PATH="$CUDA_ROOT/bin:$PATH"
You may want to execute source ~/.bash_profile
to validate the settings right away.
Testing
Now, we can run a test code to see if the Theano works as expected.
test.py:
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print('Used the cpu')
else:
print('Used the gpu')
Let's run this code on CPU and GPU, separately.
CPU case:
$ THEANO_FLAGS='device=cpu' python test.py
[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 14.474722 seconds
Result is [ 1.23178029 1.61879337 1.52278066 ..., 2.20771813 2.29967761
1.62323284]
Used the cpu
GPU case:
$ THEANO_FLAGS='device=gpu' python test.py
Using gpu device 0: GeForce GTX 660M (CNMeM is disabled)
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.517552 seconds
Result is [ 1.23178029 1.61879349 1.52278066 ..., 2.20771813 2.29967761
1.62323296]
Used the gpu
Please note that Theano will have to compile the python code to generate C++/CUDA code when executing with GPU for the first time. Thus, the results shown above came from the execution of the second time.
Finally, it can be observed that the runtime is greatly reduced when GPU is used : )