Debugging Spark can be quite painful. The error that you see in your application, or even in your job, might not reflect the underlying problem. Here’s how I dig into the executor logs to find what (might) be really happening.
From the YARN job list, find your job. If it’s no longer running you’ll need to go to the History page.
From here you can navigate to the driver log. There are two here, because the job was retried.
The driver log records the Spark startup, and will log some of the information that the executors provide. Unfortunately if they crash or become unresponsive they often won’t report their status – they’ll just stop.
The driver log will tell you which executor exited, which is a nice way of saying that it failed. You will have to trawl the driver log a little to find these; there will probably be multiple per job. You now know the name of the container that failed.
Now search the driver log for the launch details for the problem container (just do a CTRL+C, CTRL+F in Chrome). Each executor has a “YARN executor launch context” block like the below.
If you browse to the SPARK_LOG_URL_STDERR path you will see the detailed log for that container / executor, wherein you will (hopefully) see a more detailed description of what went wrong. At least you should be able to identify the root ‘Caused by’ exception, and Google from there.