Diagnosing Failed Batch Jobs

In this guide we will show you some common sources of errors encountered by users running jobs in batch on the Rescale platform. We will try to show you how to diagnose some of these errors. We will also discuss ways on how to avoid and correct them.

Job Status page

  • Review the output on the Job Status page
    • Examine the job history on the Gantt chart

      • Does the command properly pass the validation step (Validating Input) with a green check?
      • Does the Running Job time seem reasonable for the size of the simulation you are running? Job status
    • Are there error messages present in the Job Logs section? Platform error

Results page

  • Do the number of and size of the output files listed on the Results page(s) match your expectations? Results page files

process_output.log

The first step we would always suggest for every job, successful or otherwise is to review the process_output.log file. This file logs the standard output from the software analysis method you are running. It will also log any potential error messages.

  • To find the process_output.log file go to the Job Results page

  • Query for the file in the search bar

    • Usually log or process is a sufficient search term
  • View the process_output.log file using the screen icon in the Actions column Log viewer

    • If the file is too large, you may first need to download the file before viewing it in a text editor
  • Carefully review this log file and look for warning or error messages

    • Most of the time, the error can be identified here Process_output.log

Exit Codes

One key field to look for is the "exit code" at the end of the process_output.log file. If an analysis method runs smoothly and exits cleanly without posting any error messages it should produce:

Exited with code 0

While a job may complete with code 0, this simply means that the process ran without producing any errors. This of course does not guarantee that the job ran as you intended. When the program does encounter an explicit system error (ie. out of memory, core dump, out of disk space, etc), the process will produce a non-zero exit code. Common exit codes you may encounter:

Exit Code Meaning
1 Catchall for general errors
2 Misuse of shell builtins
126 Command invoked cannot execute
127 Command not found
128 Invalid argument to exit
128+n Fatal error signal "n"
130 Script terminated by Ctrl-C
137 Killed by admin
255* Exit status out of range

These error codes, of course, are not the most instructive, however, they provide a starting point for debugging.

Basic Debugging Steps

While there are a wide variety of failure mechanisms, some of the most common issues as well as steps to diagnose and avoid them are listed below.

Missing input files

  • Ensure that all of the required files have been included in the job either individually or in a compressed (zip, tarball, 7zip, etc) input file deck

Incorrect file paths

  • Ensure that the compressed input file deck will expand to the proper directory paths
  • Use relative file paths in scripts, input files and other job definitions
  • Rescale will execute the software Command(s) indicated on the Software Settings page in the same working directory where the compressed files will be unpacked
    • The Rescale platform will assume that the prepared input files are packaged at the top directory level
    • zip/tar/etc the input files at the directory level where the command will be executed
    • Or, if a case subdirectory is used (ie run01_configB), ensure that you preface the analysis software command with the proper change in directory such as: cd run01_configB && run_analysis
      • Note that this is not the preferred workflow on the Rescale platform

Multi-node jobs that require access to a common file system

For most analysis methods where the master process handles the file I/O and communication with the slave process, Rescale by default places user specified input files into ~/work. Some analyses, however, launch slave processes on nodes that also require access to a shared file system.

  • On the Rescale platform, the ~/work/shared directory is NFS mounted to all of the compute nodes in your job
    • Rescale has identified most of these analysis methods and by default start jobs in the ~/work/shared directory
    • However, occasionally, due to a runtime customization or option, the slave processes running on the nodes will require access to input files, load runtime libraries, or writing output files
  • The Command on the Software Settings page should be prepended with move and change directory commands:
mv * shared
cd shared
<run_analysis>

Error reading input files

  • Ensure that the input files are properly constructed as expected by your analysis software
  • Ensure that the proper software Version is selected on the "Software Settings" dropdown page
  • Ensure that text input files are in the proper format
    • Batch compute nodes are generally Linux machines. Depending on the type of text editor you use, sometimes end of line/end of file characters are encoded differently
    • Often times Windows text editors will result in files with the ^M newline character that Linux does not use
      • To replace this using a text editor such as VI/VIM you can replace these characters with the following command: :%s/^M$//
      • Note: ^M is entered as ctrl-V and ctrl-M

Examine other log files from your analysis method

  • While the Rescale Platform will output standard output messages to process_output.log, some analysis methods will print critical information to other log files

  • These output files will usually have file extensions of 'log', 'out', 'live', or 'dat', but may vary depending on the analysis method. Please refer to the software vendor's documentation

  • On the "Results" page, filter your result files to find the appropriate logs Job logs

  • These log files will also generally be ASCII text files, so you can view them using the small screen icon in the right hand column next to the filename

    • As with the process_output.log file, if it is too large, download them to you local workstation and view them using a text editor

Missing library files

  • Ensure that the process has proper access and path definitions to any custom library files used in the job
  • Rescale Support may have to install additional libraries for your application

Out of system resources

  • Check that your simulation process has sufficient physical memory and storage during runtime
    • Some codes will change their memory footprint size during runtime depending on the analysis, so sufficient memory at start-up may not persist
    • Some codes will also generate a large amount of scratch data files that could bloat the storage footprint beyond that of the final output files
  • Refer to the "Cluster Status" on the bottom of the Job Status page for monitors of free memory and disk space
  • Reduce the size of your mesh/simulation to see if the job can run successfully
  • Run on more cores and/or nodes
  • Select specialty core-types with more physical memory or storage

Proper license access

  • Ensure that your license information on the Software Settings page is properly defined
  • Ensure that you are checking out features and versions authorized by your license License status
  • Check that your license has not expired
  • If you are hosting the floating license and connecting through a proxy:
    • Check that your license server is up and has network access
    • Make sure that the SSH tunnel is running on your side
    • Please refer to our guide on Using In-House Floating Licenses for more details

Debugging your workflow

  • Before diving in with a production sized run, set up a small test case that exercises your workflow
  • Check that your pre- and post-processing steps are properly integrated into your Analysis Options > Command
  • Run a test job interactively
    • Replace the existing Command with sleep 3600
    • ssh into the compute node once it starts
    • Change into the appropriate directory for your analysis method (~/work or ~/work/shared)
    • Attempt to launch the job interactively
    • Record all of the commands that produce a successful outcome
    • Modify your Command(s) accordingly
      • The Command input window will accept line breaks, ; or && marks to separate commands
      • Note: commands separated by && will only execute if the previous completes with an code 0, while those following ; will always execute after the previous

If these common debugging steps still do not resolve your problems, please contact and share your job with Rescale Support.