Local Training Troubleshooting
- 1 Initial checks
- 2 Known specific errors when setting up local training
- 2.1 CUDA 10.2 fatal error: nvscibuf.h: No such file or directory
- 2.2 Ubuntu nvidia driver loaded but not pick up
- 2.3 Vncviewer/gazebo shuts down after a bit of training
- 2.4 There is an error in rviz in vnc viewer
- 2.5 I'm starting the training and am getting "Found a lock file ..., waiting" in the logs
- 2.6 I'm starting the training and am getting "Received termination signal from trainer. Goodbye."
Check if you have everything installed. For that, run the basic checks (execute them in your terminal):
docker run hello-world- it may show some info about pulling the image and then should end with a message "Hello from Docker! (...)". If you're new with Docker, you can have a read of this message to understand what just happened
docker-compose version- it should print out something like docker-compose version 1.24.1, build 4667896b. The version may differ. If it's 1.20 or lower, I think Alex' files won't start for you because of an unsupported version
nvidia-smi- it should print out a table from nvidia with info about your gpu. This is only needed if you intend to use a GPU for training. This means you have a GPU up and running on your computer
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi- this should print the same thing as above, but through a docker container. This is only needed if you intend to use a GPU for training. If you get info about unrecognized runtime nvidia, nvidia-docker2 is missing
If any of the above fail, go back to the Local Training article and install the missing bits.
If all elements are installed but the training is still not working, try to narrow down where the issue occurs.
The most common case is that the training doesn't start when using Alex' repo. You run
./start.sh and either something happens or not. Let's try and work out what could be happening.
- what is the output of the start script? Did you get any errors?
- the script normally opens two new terminal windows, one with logs and one that runs the vncviewer with gazebo and rviz in it - did you get them?
- in a terminal window, execute
docker ps- this shows running containers and might help narrow down the issue,
- in the above you should have a number of containers listed, among them some are more interesting than others (I'm referring to them based on the
NAMEScolumn from the command above):
- minio - our S3 bucket lookalike - this one usually starts and causes no issues,
- robomaker - starts up robomaker, gazebo, rviz,
- rl_coach - this starts up the code to spin up a sagemaker container,
- tmpsomething - this runs image
crr0004/sagemaker-rl-tensorflow:tagwhich runs sagemaker. The
amdfor another GPU or
- check logs for errors. Logs can be checked by running
docker logs containername. Use container names from the
docker psoutput. Browse the logs for errors. Some containers might be already down, but still have logs available,
- I'm not sure how quickly the containers get cleaned up, but sometimes docker logs works right after the failure but not later. Try running it right after the issue occurs. If a container from the above list is missing, also run
docker ps -awhich will also list containers that are already down.
This is by no means exhaustive list, but may help you narrow down your issue. For instance we have had a number of problems where robomaker would get up, rl_coach would terminate and the training would not proceed, sometimes the vncviewer would also shut down after a couple minutes. It was caused by the nvidia-docker2 not installed which caused SageMaker container to fail to start. It could be seen in rl_coach logs as an error about unrecognized runtime nvidia.
Known specific errors when setting up local training
CUDA 10.2 fatal error: nvscibuf.h: No such file or directory
Ubuntu nvidia driver loaded but not pick up
Vncviewer/gazebo shuts down after a bit of training
If the reward function raises an exception (has an error in execution), the whole gazebo/robomaker/vncviewer shuts down. Check the logs for robomaker if that is the case. If not, do the full checking as described above.
There is an error in rviz in vnc viewer
This is caused by incomplete configuration of it and is not causing issues for the training. Sagar from the community Slack provided hints on how to get it configured:
- In Global Options - Fixed Frame select chassis (it's a dropdown)
- Press Add in lower left part of the RViz window and select Camera
- Set Camera - Image Topic to something starting with /camera/zed/rgb (it's a dropdown)
You should see what your car sees when driving around the track. Well done on checking such detailed things :)
I'm starting the training and am getting "Found a lock file ..., waiting" in the logs
Your training stopped in the middle of writing files. Just find and remove the .lock file listed, make sure checkpoint file in there points at the last complete set of Step files (there should be three of them) and you'll be good to go, even no restart required.
I'm starting the training and am getting "Received termination signal from trainer. Goodbye."
This happens when your training finishes through a NaN or a maximum reward value. It writes a file .finished in checkpoints folder (location depends on your setup). If it exists on startup, the robomaker finishes work before it starts simulations. Remove the file and restart. The file may be placed in multiple folders.