Skip to content

Running Jobs

Machine Learning Applications

This subsection provides a brief description about how you can run and monitor your ML jobs on the CS-3 cloud. ML users are to leverage training-docs.cerebras.ai, calling out these specific sections in particular:

Note: due to the upgraded cluster software version we have in the cloud in order to support SDK and ML development simultaneously, modelzoo users should know they need to append —disable_version_check​ to their cszoo fit and cszoo eval commands.

Explanation of Terms in the Cerebras Compile Report

When submitting jobs, the Cerebras Compile Report provides insights into:

  • Model Compilation Time: Duration required for the model to be compiled.
  • Resource Allocation: CS-3 systems allocated for the job.
  • Memory Utilization: Reports the efficiency of memory usage.
  • Execution Status: Whether the job is QUEUED, RUNNING, FAILED, or COMPLETED.
  • Optimization Suggestions: Any recommendations to enhance efficiency.

Job Submission and Monitoring Procedures

Sample commands are as follows:

Submitting a Job

Each project has a dedicated directory for training jobs, for example, /cra-XYZ/demo/trials. To submit a job:

  1. Navigate to the directory of the desired model.
  2. Run the experiment script: bash run.sh

Monitoring Jobs

To check job status: csctl get jobs -a

Running jobs will have a 'RUNNING' status, and queued jobs will have a 'QUEUED' status.

Monitoring with TensorBoard

To visualize training progress: tensorboard --logdir=. --bind_all --port 6006

Access TensorBoard from your browser:

Killing a Job

To terminate a running job: csctl cancel job <jobID>

To find <jobID>, using csctl get jobs -a

Resource Utilization Best Practices

  • Use tmux to avoid job termination due to disconnection: tmux new -s my_session
  • Activate the Cerebras Virtual Environment before running jobs: source /cra-XYZ/venvs/2.4.0/bin/activate
  • Submit jobs in advance if running a model for the first time, as compilation may take time.
  • Store data properly in /cra-XYZ to ensure access.

SDK Application

To get started, you can find the basic logon/access information with the CS-3 cloud via https://sdk.cerebras.net/appliance-mode