- 1 What is local training?
- 2 Why is there a local training?
- 3 Where can I run local training and where I cannot?
- 4 Easy setup
- 5 New version for 2020
- 6 History
- 7 External resource
- 8 Frequently asked questions
- 9 Troubleshooting
What is local training?
This is a setup that lets you train your DeepRacer without the use of AWS DeepRacer Console, SageMaker or RoboMaker services, either locally or through a single remote server. AWS provide the source code of SageMaker containers, a Jupyter Notebook that is loaded as a sample in Sagemaker Notebook to run the training, and all the setup built on top of rl_coach for both training and simulating DeepRacer. The members of the AWS DeepRacer Community have assembled them into a setup that can be run in Ubuntu Linux and can use a graphics card if one is available. History provides some information into how the local training evolved with time.
When talking about local training, at least for now, three versions are being mentioned:
- Chris' local training - the first version, prepared by Chris Rhodes
- Alex' DeepRacer for Dummies - prepared using Chris' stack, made simpler for use thanks to Docker Compose
- ARCC DeepRacer for Dummies - updated on top of Alex' project, with some modifications to the code and with added graphical user interface
- LarsLL's DeepRacer for Cloud - updated on top of Alex' repo with modifications to enable quick installation in a cloud environment (Azure or AWS) using native cloud features where applicable.
What is in the local training?
There are three main elements:
- Minio is a local storage implementing S3 communication protocol. You can use tools from AWS to talk to it. In Chris' config you start up a binary in your system, in Alex' config it's a docker conainer,
- SageMaker is a python utility that starts up a Redis to store the training data and a Tensorflow-based software that does the training for you - this one uses your GPU if you have one, builds new models based on the data in Redis and saves them in your S3 bucket,
- RoboMaker is a set of tools that involve a Gazebo simulation engine where the simulated training is executed. It reads data from the bucket, performs a simulation, stores it in Redis from which SageMaker takes them
This is pretty much it. Chris uses python virtualenv to run Sagemaker on your system (it then starts a Docker image), Alex wraps around it with another container (Docker in Docker) with all the tools needed.
An added extra is the log analysis tool that AWS have provided and I have expanded slightly. I used virtualenv to run it, Alex wrapped around it with a container.
Why is there a local training?
There are two main reasons:
- Costs optimisation - running the training locally or through a cloud-hosted server provides significant savings compared to running through the AWS DeepRacer Console
- Advanced capabilities - the power of the console lies in simplicity and ease of use for new starters. As you progress into more advanced training approach, local training gives you an option to dive into the internals just like the SageMaker Notebook, but exposes all of the internals at any point of training
Where can I run local training and where I cannot?
We are referring to local training most of the time, but it doesn't have to be very local. Interestingly, running an EC2 or a Spot instance is already cheaper than using the AWS DeepRacer Console. Many environments have been tried already, some worked, some didn't:
- Ubuntu Linux 18.04 - this is the default one. Things just work. Mostly. I use Ubuntu and am happy with it, I've set the training up three times so far,
- Other Linux distros - there have been reports of people making it work on Arch, Manjaro and CentOS. I don't think there are any notes for that. Ask if you face problems, take notes, share them,
- AWS EC2 Instances - there's a number of folks running them, you can be fairly confident you'll get some help for them on the channel. Jarrett Jordaan (jarrett on the community Slack) is your go-to guy for an AMI. It also has been reported by Demeanour that Ubuntu version of deeplearning AMI has all the required tools, just gnome-terminal is missing - there may be some updates and consolidation of information in this area soon,
- AWS EC2 Spot Instances - they come with greater savings, but you can lose one every now and then in the middle of the training. You don't lose the data though (not with EBS at least). You can get back to training later. Some instances are difficult to get a hold of. Talk to Jarrett, ask in the channel on the community Slack
- Windows - It does not work. People have tried and there were issues. RayG has assembled his own notes with description of the problems he faced: SettingUpDeepRacerLocalTrainingOnWin10.txt. You can team up with RayG and progress this. I know he went for dual boot and is very happy with his Ubuntu. I read that with most recent WSL2 more can be done in Windows and docker, but GPU is not yet available and no one has tried it yet. Be the first,
- Macs - some people started the training effortlessly, some are banging their head against the wall. I've seen these notes on setting up AWS DeepRacer local training on mac - could be useful, also Kevin from the community Slack prepared a reviewed and cleaned up version - he will appreciate all feedback. Ask in the local-training-setup channel, maybe there was progress. Many people switched to Ubuntu. Also RayG says Apple no longer supports Nvidia drivers in the latest MacOS, AMD still supported, possibly
- Nvidia Jetson - RichardFan (I think) got this nice toy from Nvidia and tried to set it up. It has an ARM processor which meant hardly any dependencies were available straight away. After a lot of fight he switched onto EC2 I believe,
- Google Cloud - Shivam Garg and since then many others have managed to get it to work. Under the free account GPU is not available, when you upgrade your account you still have the credits that you can use and you get to use the GPU, but double check that you will get to use the credits on that (possibly yes). Finlay Macrae from the community wrote an intruction to get GPU training going on GCP and if you want to quickly swap from preemptable or switch region check this out, Paul M. Reese from the Udacity Challenge Slack wrote a very detailed article on setting up local DeepRacer training on Google Cloud Project which you might find handy if you need more details,
- Google Colab - f-racer in community Slack is wondering if it can be used. Team up with him and you can try. Make notes please.
- Azure - LarsLL from the community Slack has used Azure NC6 VM with Spot pricing which costs him 0.2€ per hour training. Read his guide to run DeepRacer training on Azure
Using a local machine
If you decide to go for the local-local setup, you might want to read this:
- you don't need a very powerful machine. You can go slow with your CPU - it will take long to train to something usable, but it may be fine. You're probably here to learn so it's fine,
- if you have a graphics card, there is a large group of Nvidia cards users and a group of one using AMD. While there are more folks with Nvidia, the AMD guy is your very favourite Chris, the author of deepracer local training repo, so there is some chance for support for either one,
- if you have an Nvidia, check its compute capability - tensorflow magic won't work easily on cards with compute capability < 3.1. Jouni from the community Slack (the winner of Stockholm Summit race) prepared himself a setup for the other cards, but wasn't able to perform a single training due to lack of RAM on the card (if I understand properly). RAM sounds important, my card has 6GB,
- box PC > laptop. While you can do local training, your laptop will sound like a hairdryer. I like my laptop and it helped me do awesome stuff. I didn't want it to overheat and get damaged so I magicked up a box PC. Even on CPU only it has a much more efficient cooling,
- If you already have a computer, it's probably good to just work with it,
- A lot of disk space will be needed. The training can generate as much as 200 GB per day. Sometimes more, sometimes less.
Using EC2 instance
- c4.2xls - Nick Kuhl said it's got the CPU power that the cheaper GPUs are missing
- there are no 2xls that are GPU accelerated and the 4xls are just too expensive (just a comment I've seen)
- Jarrett got GPU working on g3s.xlarge, had to use some AWS guide to optimize the GPU. He uses a spot instance at $0.25/h
- Jochem Lugtenburg used p2.xlarge. He tried g3.4xlarge and it run gazebo in real time for $0.4324/h (spot instance, regular is pricier). For Jarrett it kept stopping as they were in high demand. Jochem also tried g2.2xlarge
- Bobby Stenly runs a C5.2xLarge instance and the policy training takes him 3-10 minutes (slowing down with time)
People use Spot instances, add some 500GB EBS and set it not to be deleted on termination. See Info about EC2 above for more details on the setup.
Quick start up guide
The below are for nvidia GPUs. Read the readmy of Chris' original repository for AMD
- Docker - Latest docker setup. Do NOT install from default repositories as they always contain older versions. Remember to follow steps from "Post-installation steps for Linux" in there.
- Docker compose Docker compose setup - Alex used latest syntax of docker-compose.yml so it's preferable to have that one.
- nvidia driver - a useful guid on how to install nvidia drivers on Ubuntu.
- nvidia docker image Nvidia docker (if you have and wish to use an Nvidia GPU) - this is a tool that installs a runtime for your docker which utilizes your local gpu. Follow the deprecated section of nvidia-docker-2 readme (Important: remove --upgrade-only from the command). A big small note here: this is likely to change soon. In the latest Docker Engine GPUs are supported out of the box, but docker compose does not support that so there isn't an easy way to set it up and use it. Luckily the setup just works with this integration.
- CUDA download (for deepracer-for dummies this comes in a docker container)
- cudnn download (for deepracer-for dummies this comes in a docker container)
How to check the prerequisites are installed
If any results differ, see How to ask questions
docker run hello-world
Expected output (the pull part might be missing if the image has already been pulled):
$ docker run hello-world Unable to find image 'hello-world:latest' locally latest: Pulling from library/hello-world 1b930d010525: Pull complete Digest: sha256:4fe721ccc2e8dc7362278a29dc660d833570ec2682f4e4194f4ee23e415e1064 Status: Downloaded newer image for hello-world:latest Hello from Docker! This message shows that your installation appears to be working correctly. To generate this message, Docker took the following steps: 1. The Docker client contacted the Docker daemon. 2. The Docker daemon pulled the "hello-world" image from the Docker Hub. (amd64) 3. The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading. 4. The Docker daemon streamed that output to the Docker client, which sent it to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker ID: https://hub.docker.com/ For more examples and ideas, visit: https://docs.docker.com/get-started/
The expected output (there may be version differences):
$ docker-compose version docker-compose version 1.24.0, build 0aa59064 docker-py version: 3.7.2 CPython version: 3.6.8 OpenSSL version: OpenSSL 1.1.0j 20 Nov 2018
The expected output should be something like (versions may not be identical and processes will be different; for older gpus the processes may not be listed and "Not supported" may be written - it's still fine):
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce RTX 2070 Off | 00000000:08:00.0 On | N/A | | 27% 31C P8 23W / 215W | 7962MiB / 7981MiB | 6% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1338 G /usr/lib/xorg/Xorg 40MiB | | 0 1390 G /usr/bin/gnome-shell 51MiB | | 0 1696 G /usr/lib/xorg/Xorg 379MiB | | 0 1828 G /usr/bin/gnome-shell 291MiB | | 0 6521 G /usr/lib/firefox/firefox 3MiB | | 0 18422 C /usr/bin/python3.6 7187MiB | | 0 32185 G /usr/lib/firefox/firefox 5MiB | +-----------------------------------------------------------------------------+
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
The expected output should be something like with nvidia drivers above.
New version for 2020
Currently, the community has built the local training stack with CPU support and is now working on GPU version, beginner setup and integration with log analysis
The first version of local training has been shared by Chris Rhodes on his GitHub. Not very long after Alex Schultz has provided a fully dockerized version of DeepRacer training which then was extended by Autonomous Race Car Community members so that there was a UI to train DeepRacer. This was the state in which we remained till AWS re:Invent 2019. After the conference a group within the community has started working on improving the local training and getting it to support the console capabilities released during the conference.
- Tony Markham's guide to set up local training
- Chris Rhodes' repo
- Richard Fan's repo
- Alex Schultz' repo
- initial resource with a lot of information about local training, gradually, transferred onto this page