# How to harness the Powers of the Cloud TPU?

**Introduction**

About a couple of months ago, I got introduced to __Google Colab__ (an enhanced version of __JuPyteR__ notebooks) and since then didn’t look back. It’s got everything and more you need as a researcher. The initial idea was to load up python notebooks and run them on it. Soon we realised we can actually run those notebooks not just on CPUs available on GCP but also __GPUs__ and __TPUs__. Also, read up a bit about it from other sources, see __Google Reveals Technical Specs and Business Rationale for TPU Processor__ (slightly dated, but definitely helpful).

**Here we go now…**

**What is a TPU?**

Just like a GPU is a graphics accelerator __ASIC__ (to help create graphics/images quickly for output to a display device) — which since long have been discovered that can be taken advantage of by using it for massive number crunching. Similarly, a TPU is an AI accelerator __ASIC__ developed by Google specifically for __Neural Network__ __Machine Learning__. One of the differences being TPUs are more about processing high-volume low-precision computation while GPUs for high-volume high-precision computation (please check the Wikipedia links for the interesting differences).

**Then what?**

**TL;DR — how the notebooks and slides came about**

To understand how these devices work, what better approach can we adopt than to benchmark them and then compare the results. Which is why we had a number of notebooks that came out of these experiments. And then we were also coincidentally preparing for __Google Cloud Next 2018__ at the Excel Center, London. And I spent an evening and a weekend preparing the slides for the talk Yaz and I were asked to give — __Harnessing the Powers of Cloud TPU__.

**What happened first, before the other thing happened?**

**TL;DR — how we work at the GDG Cloud meetup events**

It was a mere coincidence while we were meeting regularly at the __GDG Cloud meetups__ in the London chapter, where __Yaz__ would find interesting things to look at during the hack sessions (Pomodoro sessions — as he called them), suggested one session that we play with TPUs and benchmark them. Actually, I remember suggesting we do this during our sessions as an idea, but then we all got distracted with other equally interesting ideas (all saved somewhere on __GitLab__). And then I frantically started playing with two notebooks related to __GPU__ and __TPU__ respectively, provided as __examples by Colab__, which I then got to work on the TPU and then adapted it to work on the GPU (can’t remember anymore which way first). They both were doing slightly different things and measuring the performance of the GPU and TPU and I decided to make them do the same thing and measure the time taken to do it on the different device. Also, display details about the devices themselves (you will see towards the top or bottom of each notebook).

**CPU v/s GPU v/s TPU — Simple benchmarking example via Google Colab**

**CPU v/s GPU — Simple benchmarking**

The __CPU v/s GPU — Simple benchmarking__ notebook finish processing with the below output:

```
TFLOP is a bit of shorthand for “teraflop”, which is a way of
measuring the power of a computer based more on mathematical
capability than GHz. A teraflop refers to the capability of a
processor to calculate one trillion floating-point operations
per second.
```

```
CPU TFlops: 0.53
GPU speedup over CPU: 29x TFlops: 15.70
```

I was curious about the internals of the CPU and GPU so I ran some Linux commands (via Bash) that the notebooks allow (thankfully) and got these bits of info to share:

You can find all those commands and the above output in the notebook as well.

**CPU v/s TPU — Simple benchmarking**

The __TPU — Simple benchmarking__ notebook finish processing with the below output:

```
TFLOP is a bit of shorthand for “teraflop” which is a way of
measuring the power of a computer based more on mathematical
capability than GHz. A teraflop refers to the capability of a
processor to calculate one trillion floating-point operations
per second.
```

```
CPU TFlops: 0.47
TPU speedup over CPU (cold-start): 75x TFlops: 35.47
TPU speedup over CPU (after warm-up): 338x TFlops: 158.91
```

Unfortunately, I haven’t had a chance to play around with the TPU profiler yet to learn more about the internals of this fantastic device.

While there is room for errors and inaccuracies in the above figures, you might be curious about the tasks used for all the runs — it’s the below piece of code that has been making the CPU, GPU and TPU circuitry for the ** Simple benchmarking** notebooks:

```
def cpu_flops():
x = tf.random_uniform([N, N])
y = tf.random_uniform([N, N])
def _matmul(x, y):
return tf.tensordot(x, y, axes=[[1], [0]]), y
```

```
return tf.reduce_sum(
tf.contrib.tpu.repeat(COUNT, _matmul, [x, y])
)
```

**CPU v/s GPU v/s TPU — Time-series prediction via Google Colab**

Running the __TPU version of the Timeseries notebook__ gave us some issues initially, which was reported on a __StackOverflow post__ and a couple of good folks from the Google Cloud TPU team stepped in to help. But we managed to get the __GPU version of the Time-Series Prediction__ notebook to work which clearly showed a much better response than the __CPU version of Time-Series Prediction__ notebook — this version just choked half-way through the CPU-cycles and Colab asked me if I wanted to stop the process because we needed more resources (more memory)!!!

**Time series: GPU version**

Here snapshots of the notebook (** Train the Recurrent Neural Network **section), the full

__notebook can be found on Google Colab__, it’s free to download, share and extract the python code in it.

```
Epoch 1/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0047WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 42s 4s/step - loss: 0.0048
Epoch 2/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0041WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 42s 4s/step - loss: 0.0041
Epoch 3/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0047WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 42s 4s/step - loss: 0.0046
Epoch 4/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0039WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 43s 4s/step - loss: 0.0039
Epoch 5/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0048WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0048
Epoch 6/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0036WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0037
Epoch 7/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0042WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0041
Epoch 8/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0037WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0038
Epoch 9/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0040WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0039
Epoch 10/10
9/10 [==========================>...] - ETA: 4s - loss: 0.0036WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
WARNING:tensorflow:Reduce LR on plateau conditioned on metric `val_loss` which is not available. Available metrics are: loss,lr
10/10 [==============================] - 44s 4s/step - loss: 0.0035
CPU times: user 10min 8s, sys: 1min 22s, total: 11min 30s
Wall time: 7min 12s
```

The above run takes about ~7 mins (or total time of ~8 mins) on the Google Colab GPU:

CPU times:user 10min 8s, sys: 1min 22s, total: 11min 30s

Wall time:7min 12s

I’m still unsure on how to interpret this time-related stats but I will take ~7 mins as our execution time till this point.

We finish with the following stats:

So now you can see why I earlier chose *~7 mins* as the execution time. So it takes about 7 minutes to process this notebook — giving new predictions of temperature, pressure and wind speed and comparing it with the actual values (true values gather from post observations).

**Time series: TPU version**

Here snapshots of the notebook (** Train the Recurrent Neural Network**section), the full

__notebook can be found on Google Colab__, it’s free to download, share and extract the python code in it.

```
Found TPU at: grpc://10.118.17.162:8470
INFO:tensorflow:Querying Tensorflow master (b'grpc://10.118.17.162:8470') for TPU system metadata.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, 11845881175500857789)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5923571607183194652)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 11085218230396215841)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 12636361223481337501)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 14151025931657390984)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, 16816909163217742616)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, 4327750408753767066)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, 504271688162314774)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, 14356678784461051119)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 6767339384180187426)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, 1879489006510593388)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 17179869184, 17850015066511710434)
WARNING:tensorflow:tpu_model (from tensorflow.contrib.tpu.python.tpu.keras_support) is experimental and may change or be removed at any time, and without warning.
Epoch 1/10
INFO:tensorflow:New input shapes; (re-)compiling: mode=train (# of cores 8), [TensorSpec(shape=(32,), dtype=tf.int32, name='core_id0'), TensorSpec(shape=(32, 1344, 20), dtype=tf.float32, name='input_10'), TensorSpec(shape=(32, 1344, 3), dtype=tf.float32, name='Dense-2_target_30')]
INFO:tensorflow:Overriding default placeholder.
INFO:tensorflow:Remapping placeholder for input
INFO:tensorflow:Started compiling
INFO:tensorflow:Finished compiling. Time elapsed: 3.394456386566162 secs
INFO:tensorflow:Setting weights on TPU model.
9/10 [==========================>...] - ETA: 1s - loss: 0.0112WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 14s 1s/step - loss: 0.0115
Epoch 2/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0183WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 501ms/step - loss: 0.0187
Epoch 3/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0260WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 497ms/step - loss: 0.0264
Epoch 4/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0324WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 496ms/step - loss: 0.0327
Epoch 5/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0374WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 470ms/step - loss: 0.0376
Epoch 6/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0378WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 471ms/step - loss: 0.0372
Epoch 7/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0220WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 489ms/step - loss: 0.0211
Epoch 8/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0084WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 493ms/step - loss: 0.0081
Epoch 9/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0044WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 495ms/step - loss: 0.0044
Epoch 10/10
9/10 [==========================>...] - ETA: 0s - loss: 0.0040WARNING:tensorflow:Early stopping conditioned on metric `val_loss` which is not available. Available metrics are: loss
WARNING:tensorflow:Can save best model only with val_loss available, skipping.
10/10 [==============================] - 5s 491ms/step - loss: 0.0040
CPU times: user 10.8 s, sys: 2.7 s, total: 13.5 s
Wall time: 1min 6s
```

There are a few of things to still work on the notebook, one of them being getting rid of the warnings appearing during training on the TPU. As per our previous analysis, running on the TPUs should be way faster than GPUs, correction: TPUs are faster than GPUs even when running the Timeseries TPU version (I aligned the two notebooks i.e. GPU and TPU versions of the Timeseries notebooks and re-ran the two experiments). We haven’t been successful in being able to execute code cells till the end of the notebook due to errors in input shape which needs fixing and re-running the notebook. All of these look like good learning opportunities for me and everyone else. Our new results are more promising as you can see from above. And the whole notebook just took ~2 minutes to finish running, that’s so many times speed up over the GPU version (from the below).

**Observations**

Whilst we couldn’t see the final outcome of the __TPU version of the Timeseries notebook__, the initial ** Simple Benchmarking** related examples did help us make the below observations:

TPUs are~85x to ~312xfaster than CPUs, and GPUs are~30xfaster than CPUs

which also means that

TPUs are~3xto~10xfaster than GPUs, which in turn are~30xfaster than CPUs

**Some graphs plotting the speeds in teraflops and times between CPU v/s GPU v/s TPU:**

*Note:**that during some runs on GCP, the above numbers showed higher than noted here so please beware of that as well. Maybe the notebooks took advantage of some improvements in the GCP infrastructure.*

The different devices ran this simple task (block of code mentioned in the previous section) at different speeds, for a more complex task, the numbers would definitely be different, although we believe that their relative performances shouldn’t deviate too much.

**Conclusion**

Also, want to thank __Yaz__ for being super-encouraging at all times during this whole process including for the presentation at Google Cloud Next 2018. Not forgetting Claudio who contributed quite a bit to the __TPU version of the Timeseries notebook__, whilst we have been debugging it.

In the end, it was all great and in the end, we will solve all problems of humanity but for now, we are occupied with working on the __TPU version of the Timeseries notebook__, I welcome everyone to take a jab at it and see if you can help make it work. Please hit back with your feedback and/or contributions in any case.

**(Third generation TPU at the Google Data centre: TPU 3.0)**

**Have a read of what others (i.e.** __Jeff Hale’s article__**) are talking about the different PUs on various cloud providers and you can see GCP Cloud leading in many such areas.**

*Be ready for some more notebook fests coming up in the form of more blog posts in the near to distant future. Please share your comments, feedback or any contributions to* __@theNeomatrix369,__*you can find more about me via the* __About me page__*.*

**Resources**

- Harnessing the Powers of Cloud TPUs (slides) — __bit.ly/harnessing-tpus__

**Notebooks**

- __Time-Series Prediction — CPU version__

- __Time-Series Prediction — GPU version__

- __Time-Series Prediction — TPU version__

- __Google Reveals Technical Specs and Business Rationale for TPU Processor__

- __Best Deals in Deep Learning Cloud Providers__

**Citations**

Credits to all the images embedded (including the feature image) on this post go to the respective authors/creators/owners of the images.