Local Training Troubleshooting

From AWS DeepRacer Community Wiki
Jump to navigation Jump to search

Initial checks[edit]

Check if you have everything installed. For that, run the basic checks (execute them in your terminal):

  • docker run hello-world - it may show some info about pulling the image and then should end with a message "Hello from Docker! (...)". If you're new with Docker, you can have a read of this message to understand what just happened
  • docker-compose version - it should print out something like docker-compose version 1.24.1, build 4667896b. The version may differ. If it's 1.20 or lower, I think Alex' files won't start for you because of an unsupported version
  • nvidia-smi - it should print out a table from nvidia with info about your gpu. This is only needed if you intend to use a GPU for training. This means you have a GPU up and running on your computer
  • docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi - this should print the same thing as above, but through a docker container. This is only needed if you intend to use a GPU for training. If you get info about unrecognized runtime nvidia, nvidia-docker2 is missing

If any of the above fail, go back to the Local Training article and install the missing bits.

If all elements are installed but the training is still not working, try to narrow down where the issue occurs.

The most common case is that the training doesn't start when using Alex' repo. You run ./start.sh and either something happens or not. Let's try and work out what could be happening.

  • what is the output of the start script? Did you get any errors?
  • the script normally opens two new terminal windows, one with logs and one that runs the vncviewer with gazebo and rviz in it - did you get them?
  • in a terminal window, execute docker ps - this shows running containers and might help narrow down the issue,
  • in the above you should have a number of containers listed, among them some are more interesting than others (I'm referring to them based on the NAMES column from the command above):
    • minio - our S3 bucket lookalike - this one usually starts and causes no issues,
    • robomaker - starts up robomaker, gazebo, rviz,
    • rl_coach - this starts up the code to spin up a sagemaker container,
    • tmpsomething - this runs image crr0004/sagemaker-rl-tensorflow:tag which runs sagemaker. The tag may be nvidia for GPU, amd for another GPU or console for CPU,
  • check logs for errors. Logs can be checked by running docker logs containername. Use container names from the docker ps output. Browse the logs for errors. Some containers might be already down, but still have logs available,
  • I'm not sure how quickly the containers get cleaned up, but sometimes docker logs works right after the failure but not later. Try running it right after the issue occurs. If a container from the above list is missing, also run docker ps -a which will also list containers that are already down.

This is by no means exhaustive list, but may help you narrow down your issue. For instance we have had a number of problems where robomaker would get up, rl_coach would terminate and the training would not proceed, sometimes the vncviewer would also shut down after a couple minutes. It was caused by the nvidia-docker2 not installed which caused SageMaker container to fail to start. It could be seen in rl_coach logs as an error about unrecognized runtime nvidia.

Known specific errors when setting up local training[edit]

CUDA 10.2 fatal error: nvscibuf.h: No such file or directory[edit]

Ubuntu nvidia driver loaded but not pick up[edit]

Vncviewer/gazebo shuts down after a bit of training[edit]

If the reward function raises an exception (has an error in execution), the whole gazebo/robomaker/vncviewer shuts down. Check the logs for robomaker if that is the case. If not, do the full checking as described above.

There is an error in rviz in vnc viewer[edit]

This is caused by incomplete configuration of it and is not causing issues for the training. Sagar from the community Slack provided hints on how to get it configured:

  • In Global Options - Fixed Frame select chassis (it's a dropdown)
  • Press Add in lower left part of the RViz window and select Camera
  • Set Camera - Image Topic to something starting with /camera/zed/rgb (it's a dropdown)

You should see what your car sees when driving around the track. Well done on checking such detailed things :)

I'm starting the training and am getting "Found a lock file ..., waiting" in the logs[edit]

Your training stopped in the middle of writing files. Just find and remove the .lock file listed, make sure checkpoint file in there points at the last complete set of Step files (there should be three of them) and you'll be good to go, even no restart required.

I'm starting the training and am getting "Received termination signal from trainer. Goodbye."[edit]

This happens when your training finishes through a NaN or a maximum reward value. It writes a file .finished in checkpoints folder (location depends on your setup). If it exists on startup, the robomaker finishes work before it starts simulations. Remove the file and restart. The file may be placed in multiple folders.