Local Training FAQ
- 1 Which project should I use to set up the training?
- 2 Can I run the training in an environment not listed in the knowledge base?
- 3 Why are all files owned by root? Why do I need sudo?
- 4 What is the Real Time Factor and how does it influence the training?
- 5 Can I submit a locally trained model to virtual race?
- 6 Hardware
Which project should I use to set up the training?
Each project has something useful. Chris set everything up and exposed most of the geeky stuff to us all. Alex hid some of it to make it easier to set up. ARCC guys are pushing it even further. Some people are trying to remove docker from it all. All projects have some pains of an early stage and low maturity. I'd say Alex' repo or ARCC one are the best ones to use now.
Can I run the training in an environment not listed in the knowledge base?
You can use whatever custom setup you want. Just remember: if you prepare it, you maintain it. Every update to the environment (bug fixes, tracks etc.) may break something and the more users are involved, the more maintained the project is. You may consider whether the benefits of having it your way are worth the effort needed to propagate the changes.
That said, if the benefits are there, ask in the community Slack, you might get others to help.
Why are all files owned by root? Why do I need sudo?
This is related to Docker which operates as root. All files are created as root files for Alex' repo and you will need sudo to work with them. Some files in Chris' repo have the same problem.
What is the Real Time Factor and how does it influence the training?
It informs you about the difference between the simulation time and real time. If it's 0.9 then in the 10 seconds of real time 9 seconds of simulation time elapsed. It influences your training in a way as steps take longer and time may be higher than in race submissions. That said, I have had values between 0.8 and 0.9 and it was fine. 0.9 feels fine in general.
Can I submit a locally trained model to virtual race?
Yes you can. Follow the repo that you're using to know which way to do it. Chris' repo comes with dr_utils.py (which has --help) and Alex' repo comes with scripts/training/upload-snapshot.sh.
In general the rules are as follows: Submission requires a set of files in a folder associated with an existing training in your AWS DeepRacer Console. You can create a training named let's say "Local-Training-Submissions" that you can reuse for this. Such training has a folder in your DeepRacer bucket with the training date in the name which has two folders: ip with entry files to your training (sadly no reward file, would be useful for bookkeeping), but it doesn't matter in evaluation/submissions. The other one, model, is the one that matters. It has many files, but of those only five are important for submissions:
- model_metadata.json - your action space
- X_Step-Y.ckpt.Z - three files where X is the checkpoint number, Y is a number of steps from the start of training which isn't too important (and which can be changed to anything as long as it's also in the checkpoint file) and Z is one of three extensions: index, meta and data-00000-of-00001
- deepracer_checkpoints.json - json file with information about the most recent checkpoint and the best checkpoint that was computed (based on track completion).
- The model_X.pb file doesn't matter. It is used when loading the model onto a car only.
Once your five files are in the folder, if you submit the model, those files will be used regardless of what originally went into the training. The utility scripts in repos make it convenient to do, they only require a configured AWS CLI, the name of the bucket and the name of the folder in which the model is to land. And permissions to add/delete files in there.
What card to get to do training?
That's a tricky question. It seems that VRAM is more important than the power of the GPU itself. Some people got a GeForce 1660 Ti which has 6GB of VRAM which is utilised fully when training. Some people got Geforce 2070 Ti or something and it seemed similar - full VRAM usage, computing power - not so much.
It doesn't have to be latest greatest then to make you happy in terms of training. You may need to wait a couple seconds longer for your training to complete. This article from April 2019 says anything above 1060 should be fine. Also aim for 6GB VRAM or more.
How much disk space is needed?
A lot. Remember, with speed comes data. The training can generate 200GB a day, sometimes more. Make sure you secure space and clean up when possible. Jeff Klopfenstein from the community has said he set his Ubuntu partitions to 160 GB (both data and the system) and it allows him to train for about 6 hours uninterrupted. A 500 GB drive will hold the data for two-three days or so.
Can I use a laptop for training?
Yes, you can use it. Yes, it will sound like a hairdryer. No, mobile graphics cards aren't as good as full size ones.
Can I use an old GPU?
You can try. Many people struggled and even though they rebuilt the software to support them, the memory wasn't enough to support training even with tiny batch size.
Can I use external GPU?
Chris Thompson from the community Slack once reported plans to try. If you want to as well, get in touch with him.
How to confirm I'm using GPU?
Run nvidia-smi while training, there should be a python3.6 process in there consuming a lot of memory.