Local Training Troubleshooting
- 1 Initial checks
- 2 Known specific errors when setting up local training
- 2.1 CUDA 10.2 fatal error: nvscibuf.h: No such file or directory
- 2.2 Ubuntu nvidia driver loaded but not pick up
- 2.3 Vncviewer/gazebo shuts down after a bit of training
- 2.4 There is an error in rviz in vnc viewer
- 2.5 I'm starting the training and am getting "Found a lock file ..., waiting" in the logs
- 2.6 I'm starting the training and am getting "Received termination signal from trainer. Goodbye."
Check if you have everything installed. For that, run the basic checks (execute them in your terminal):
docker run hello-world- it may show some info about pulling the image and then should end with a message "Hello from Docker! (...)". If you're new with Docker, you can have a read of this message to understand what just happened
docker-compose version- it should print out something like docker-compose version 1.24.1, build 4667896b. The version may differ. If it's 1.20 or lower, I think Alex' files won't start for you because of an unsupported version
nvidia-smi- it should print out a table from nvidia with info about your gpu. This is only needed if you intend to use a GPU for training. This means you have a GPU up and running on your computer
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi- this should print the same thing as above, but through a docker container. This is only needed if you intend to use a GPU for training. If you get info about unrecognized runtime nvidia, nvidia-docker2 is missing
If any of the above fail, go back to the setup instruction and install the missing bits.
If all elements are installed but the training is still not working, try to narrow down where the issue occurs.
Training startup checks
At this point you've most likely set everything up, started the training but got info that SageMaker is not running (or something else for that matter). Let's try and work out what could be happening.
- what is the output of the start script? Did you get any errors?
- in a terminal window, execute
docker ps -a- this shows running (and stopped) containers and might help narrow down the issue,
- in the above you should have a number of containers listed, among them some are more interesting than others (I'm referring to them based on the
NAMEScolumn from the command above):
- minio/minio - our S3 bucket lookalike - this one usually starts and causes no issues (not shown on EC2),
- awsdeepracercommunity/deepracer-robomaker - starts up robomaker, gazebo, rviz - everything needed to simulate the environment for DeepRacer,
- awsdeepracercommunity/deepracer-rlcoach - this starts up the code to spin up a sagemaker container, runs Redis db which is gathering experiences from episodes,
- awsdeepracercommunity/deepracer-sagemaker - this runs sagemaker,
- Do not the tags attached to names of container images (listed after a colon, at the end)
- if the setup is fresh, it might need to pull the docker images, if you don't have any exited containers, try waiting a little more
- check logs for errors. Logs can be checked by running
docker logs containername. Use container names from the
docker ps-aoutput. Browse the logs for errors. Some containers might be already down, but still have logs available,
- Not everything that looks like an error is an error. Sagemaker throws some stacktrace on startup and then carries on running. Your best bet for errors is to look in containers that have already exited and towards the end of their log output
I usually start by checking robomaker (if it exits usually sagemaker ignores it and keeps running and waiting), then rl-coach (if it errors you won't see sagemaker container at all) and finally sagemaker (I think this one is least likely to fail).
Known specific errors when setting up local training
CUDA 10.2 fatal error: nvscibuf.h: No such file or directory
Ubuntu nvidia driver loaded but not pick up
Vncviewer/gazebo shuts down after a bit of training
If the reward function raises an exception (has an error in execution), the whole gazebo/robomaker/vncviewer shuts down. Check the logs for robomaker if that is the case. If not, do the full checking as described above.
There is an error in rviz in vnc viewer
This is caused by incomplete configuration of it and is not causing issues for the training. Sagar from the community Slack provided hints on how to get it configured:
- In Global Options - Fixed Frame select chassis (it's a dropdown)
- Press Add in lower left part of the RViz window and select Camera
- Set Camera - Image Topic to something starting with /camera/zed/rgb (it's a dropdown)
You should see what your car sees when driving around the track. Well done on checking such detailed things :)
I'm starting the training and am getting "Found a lock file ..., waiting" in the logs
Your training stopped in the middle of writing files. Just find and remove the .lock file listed, make sure checkpoint file in there points at the last complete set of Step files (there should be three of them) and you'll be good to go, even no restart required.
I'm starting the training and am getting "Received termination signal from trainer. Goodbye."
This happens when your training finishes through a NaN or a maximum reward value. It writes a file .finished in checkpoints folder (location depends on your setup). If it exists on startup, the robomaker finishes work before it starts simulations. Remove the file and restart. The file may be placed in multiple folders.