Local Training

From AWS DeepRacer Community Wiki
Jump to navigation Jump to search

What is local training?

This is a setup that lets you train your DeepRacer without the use of AWS DeepRacer Console, SageMaker or RoboMaker services, either locally or through a single remote server. AWS provide the source code of SageMaker containers, a Jupyter Notebook that is loaded as a sample in Sagemaker Notebook to run the training, and all the setup built on top of rl_coach for both training and simulating DeepRacer. The members of the AWS DeepRacer Community have assembled them into a setup that can be run in Ubuntu Linux and can use a graphics card if one is available. History provides some information into how the local training evolved with time.

When talking about local training, at least for now, three versions are being mentioned:

What is in the local training?

There are three main elements:

  • Minio is a local storage implementing S3 communication protocol. You can use tools from AWS to talk to it. In Chris' config you start up a binary in your system, in Alex' config it's a docker conainer,
  • SageMaker is a python utility that starts up a Redis to store the training data and a Tensorflow-based software that does the training for you - this one uses your GPU if you have one, builds new models based on the data in Redis and saves them in your S3 bucket,
  • RoboMaker is a set of tools that involve a Gazebo simulation engine where the simulated training is executed. It reads data from the bucket, performs a simulation, stores it in Redis from which SageMaker takes them

This is pretty much it. Chris uses python virtualenv to run Sagemaker on your system (it then starts a Docker image), Alex wraps around it with another container (Docker in Docker) with all the tools needed.

An added extra is the log analysis tool that AWS have provided and I have expanded slightly. I used virtualenv to run it, Alex wrapped around it with a container.

Why is there a local training?

There are two main reasons:

  • Costs optimisation - running the training locally or through a cloud-hosted server provides significant savings compared to running through the AWS DeepRacer Console
  • Advanced capabilities - the power of the console lies in simplicity and ease of use for new starters. As you progress into more advanced training approach, local training gives you an option to dive into the internals just like the SageMaker Notebook, but exposes all of the internals at any point of training

Where can I run local training and where I cannot?

We are referring to local training most of the time, but it doesn't have to be very local. Interestingly, running an EC2 or a Spot instance is already cheaper than using the AWS DeepRacer Console. Many environments have been tried already, some worked, some didn't:

  • Ubuntu Linux 18.04 - this is the default one. Things just work. Mostly. I use Ubuntu and am happy with it, I've set the training up three times so far,
  • Other Linux distros - there have been reports of people making it work on Arch, Manjaro and CentOS. I don't think there are any notes for that. Ask if you face problems, take notes, share them,
  • AWS EC2 Instances - there's a number of folks running them, you can be fairly confident you'll get some help for them on the channel. Jarrett Jordaan (jarrett on the community Slack) is your go-to guy for an AMI. It also has been reported by Demeanour that Ubuntu version of deeplearning AMI has all the required tools, just gnome-terminal is missing - there may be some updates and consolidation of information in this area soon,
  • AWS EC2 Spot Instances - they come with greater savings, but you can lose one every now and then in the middle of the training. You don't lose the data though (not with EBS at least). You can get back to training later. Some instances are difficult to get a hold of. Talk to Jarrett, ask in the channel on the community Slack
  • Windows - It does not work. People have tried and there were issues. RayG has assembled his own notes with description of the problems he faced: SettingUpDeepRacerLocalTrainingOnWin10.txt. You can team up with RayG and progress this. I know he went for dual boot and is very happy with his Ubuntu. I read that with most recent WSL2 more can be done in Windows and docker, but GPU is not yet available and no one has tried it yet. Be the first,
  • Macs - some people started the training effortlessly, some are banging their head against the wall. I've seen these notes on setting up AWS DeepRacer local training on mac - could be useful, also Kevin from the community Slack prepared a reviewed and cleaned up version - he will appreciate all feedback. Ask in the local-training-setup channel, maybe there was progress. Many people switched to Ubuntu. Also RayG says Apple no longer supports Nvidia drivers in the latest MacOS, AMD still supported, possibly
  • Nvidia Jetson - RichardFan (I think) got this nice toy from Nvidia and tried to set it up. It has an ARM processor which meant hardly any dependencies were available straight away. After a lot of fight he switched onto EC2 I believe,
  • Google Cloud - Shivam Garg and since then many others have managed to get it to work. Under the free account GPU is not available, when you upgrade your account you still have the credits that you can use and you get to use the GPU, but double check that you will get to use the credits on that (possibly yes). Finlay Macrae from the community wrote an intruction to get GPU training going on GCP and if you want to quickly swap from preemptable or switch region check this out, Paul M. Reese from the Udacity Challenge Slack wrote a very detailed article on setting up local DeepRacer training on Google Cloud Project which you might find handy if you need more details,
  • Google Colab - f-racer in community Slack is wondering if it can be used. Team up with him and you can try. Make notes please.
  • Azure - LarsLL from the community Slack has used Azure NC6 VM with Spot pricing which costs him 0.2€ per hour training. Read his guide to run DeepRacer training on Azure

Using a local machine

If you decide to go for the local-local setup, you might want to read this:

  • you don't need a very powerful machine. You can go slow with your CPU - it will take long to train to something usable, but it may be fine. You're probably here to learn so it's fine,
  • if you have a graphics card, there is a large group of Nvidia cards users and a group of one using AMD. While there are more folks with Nvidia, the AMD guy is your very favourite Chris, the author of deepracer local training repo, so there is some chance for support for either one,
  • if you have an Nvidia, check its compute capability - tensorflow magic won't work easily on cards with compute capability < 3.1. Jouni from the community Slack (the winner of Stockholm Summit race) prepared himself a setup for the other cards, but wasn't able to perform a single training due to lack of RAM on the card (if I understand properly). RAM sounds important, my card has 6GB,
  • box PC > laptop. While you can do local training, your laptop will sound like a hairdryer. I like my laptop and it helped me do awesome stuff. I didn't want it to overheat and get damaged so I magicked up a box PC. Even on CPU only it has a much more efficient cooling,
  • If you already have a computer, it's probably good to just work with it,
  • Some disk space will be needed. The training will generate around 500 MB per model.. Sometimes more, sometimes less. And you will need around 15 GB to install all the pre-requisites.

Using EC2 instance

  • c4.2xls - Nick Kuhl said it's got the CPU power that the cheaper GPUs are missing
  • there are no 2xls that are GPU accelerated and the 4xls are just too expensive (just a comment I've seen)
  • Jarrett got GPU working on g3s.xlarge, had to use some AWS guide to optimize the GPU. He uses a spot instance at $0.25/h
  • Jochem Lugtenburg used p2.xlarge. He tried g3.4xlarge and it run gazebo in real time for $0.4324/h (spot instance, regular is pricier). For Jarrett it kept stopping as they were in high demand. Jochem also tried g2.2xlarge
  • Bobby Stenly runs a C5.2xLarge instance and the policy training takes him 3-10 minutes (slowing down with time)

People use Spot instances, add some 50GB EBS and set it not to be deleted on termination. Be aware there could be data out charges while you watch the stream. It is recommended to use the following arguments to avoid excessive charges:

<ip>:8888/stream_viewer?topic=/racecar/deepracer/kvs_stream&quality=10&width=400&height=300

See Info about EC2 above for more details on the setup.

2020 version

In 2020, AWS announced DeepRacer Evo, which has stereo cameras and a LIDAR sensor. Along with this announcement, AWS also enhanced the DeepRacer League by adding Head-to-Head race, and object avoidance race. In order to catch up with those changes, the community has updated the local training stack.

Pre-season version

AWS rolled out this version during re:Invent 2019, most changes are to adapts new DeepRacer Evo and new racing type

  • Race with objects on track
  • Race with AI cars on track
  • Support stereo camera
  • Support LIDAR sensor
  • Support 3-layer and 5-layer CNN

Repositories

Latest version

On March 2020, AWS rolled out some new features on model training. The community is now working on implementing this version and consolidating efforts from different people into a single stack.

Customise Local Training

Repositories

2019 version

Quick start up guide

Prerequisites

The below are for nvidia GPUs. Read the readmy of Chris' original repository for AMD

  • Docker - Latest docker setup. Do NOT install from default repositories as they always contain older versions. Remember to follow steps from "Post-installation steps for Linux" in there.
  • Docker compose Docker compose setup - Alex used latest syntax of docker-compose.yml so it's preferable to have that one.
  • nvidia driver - a useful guid on how to install nvidia drivers on Ubuntu.
  • nvidia docker image Nvidia docker (if you have and wish to use an Nvidia GPU) - this is a tool that installs a runtime for your docker which utilizes your local gpu. Follow the deprecated section of nvidia-docker-2 readme (Important: remove --upgrade-only from the command). A big small note here: this is likely to change soon. In the latest Docker Engine GPUs are supported out of the box, but docker compose does not support that so there isn't an easy way to set it up and use it. Luckily the setup just works with this integration.
  • CUDA download (for deepracer-for dummies this comes in a docker container)
  • cudnn download (for deepracer-for dummies this comes in a docker container)

How to check the prerequisites are installed

If any results differ, see How to ask questions

Docker

Run

docker run hello-world 

Expected output (the pull part might be missing if the image has already been pulled):

$ docker run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:4fe721ccc2e8dc7362278a29dc660d833570ec2682f4e4194f4ee23e415e1064
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Docker compose

Run

docker-compose version

The expected output (there may be version differences):

$ docker-compose version
docker-compose version 1.24.0, build 0aa59064
docker-py version: 3.7.2
CPython version: 3.6.8
OpenSSL version: OpenSSL 1.1.0j  20 Nov 2018

nvidia drivers

Run

nvidia-smi

The expected output should be something like (versions may not be identical and processes will be different; for older gpus the processes may not be listed and "Not supported" may be written - it's still fine):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:08:00.0  On |                  N/A |
| 27%   31C    P8    23W / 215W |   7962MiB /  7981MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1338      G   /usr/lib/xorg/Xorg                            40MiB |
|    0      1390      G   /usr/bin/gnome-shell                          51MiB |
|    0      1696      G   /usr/lib/xorg/Xorg                           379MiB |
|    0      1828      G   /usr/bin/gnome-shell                         291MiB |
|    0      6521      G   /usr/lib/firefox/firefox                       3MiB |
|    0     18422      C   /usr/bin/python3.6                          7187MiB |
|    0     32185      G   /usr/lib/firefox/firefox                       5MiB |
+-----------------------------------------------------------------------------+

nvidia docker

Run

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

The expected output should be something like with nvidia drivers above.

History

The first version of local training has been shared by Chris Rhodes on his GitHub. Not very long after Alex Schultz has provided a fully dockerized version of DeepRacer training which then was extended by Autonomous Race Car Community members so that there was a UI to train DeepRacer. This was the state in which we remained till AWS re:Invent 2019. After the conference a group within the community has started working on improving the local training and getting it to support the console capabilities released during the conference.

External resource

Frequently asked questions

Local Training FAQ

Troubleshooting

Local Training Troubleshooting