Running Jobs¶
Machine Learning Applications¶
This subsection provides a brief description about how you can run and monitor your ML jobs on the CS-3 cloud. ML users are to leverage training-docs.cerebras.ai, calling out these specific sections in particular:
- Setup & Installation and the Modelzoo CLI to get up and running with basic commands.
- Modelzoo Overview to build a mental model of the library and understand where they should begin based on their skill level.
- The Converter Tool for leveraging pretrained models.
- Writing a Custom Training Loop for learning how to integrate existing workflows via
cstorch
.
Note: due to the upgraded cluster software version we have in the cloud in order to support SDK and ML development simultaneously, modelzoo users should know they need to append —disable_version_check
to their cszoo fit
and cszoo eval
commands.
Explanation of Terms in the Cerebras Compile Report¶
When submitting jobs, the Cerebras Compile Report provides insights into:
- Model Compilation Time: Duration required for the model to be compiled.
- Resource Allocation: CS-3 systems allocated for the job.
- Memory Utilization: Reports the efficiency of memory usage.
- Execution Status: Whether the job is QUEUED, RUNNING, FAILED, or COMPLETED.
- Optimization Suggestions: Any recommendations to enhance efficiency.
Job Submission and Monitoring Procedures¶
Sample commands are as follows:
Submitting a Job¶
Each project has a dedicated directory for training jobs, for example, /cra-XYZ/demo/trials
.
To submit a job:
- Navigate to the directory of the desired model.
- Run the experiment script:
bash run.sh
Monitoring Jobs¶
To check job status:
csctl get jobs -a
Running jobs will have a 'RUNNING'
status, and queued jobs will have a 'QUEUED'
status.
Monitoring with TensorBoard¶
To visualize training progress: tensorboard --logdir=. --bind_all --port 6006
Access TensorBoard from your browser:¶
- Default link:
http://cg3-us27.dfw1.cerebrascloud.com:6006
- If inaccessible, try using the IP address:
http://172.16.4.243:6006/
Killing a Job¶
To terminate a running job: csctl cancel job <jobID>
To find <jobID>
, using csctl get jobs -a
Resource Utilization Best Practices¶
- Use tmux to avoid job termination due to disconnection:
tmux new -s my_session
- Activate the Cerebras Virtual Environment before running jobs:
source /cra-XYZ/venvs/2.4.0/bin/activate
- Submit jobs in advance if running a model for the first time, as compilation may take time.
- Store data properly in
/cra-XYZ
to ensure access.
SDK Application¶
To get started, you can find the basic logon/access information with the CS-3 cloud via https://sdk.cerebras.net/appliance-mode