Running Jobs via Pegasus
Introduction¶
This page provides researchers with the necessary resources to learn about running jobs on Neocortex using the Pegasus Workflow Management System. Pegasus streamlines complex workflows, making it easier to manage and execute computational tasks on high-performance computing (HPC) resources like Neocortex.
By utilizing Pegasus, researchers can achieve multiple benefits:
- Workflow Management Efficiency: Automate complex workflows, saving time and reducing errors.
- Scalability: Manage workflows of varying sizes efficiently on Neocortex's parallel computing resources.
- Reproducibility: Ensure the consistent and repeatable execution of workflows across different runs.
- Data Provenance Tracking: Keep track of the origin and processing steps of data used in the workflow, enhancing accuracy and transparency.
Self-Guided Learning Resources¶
Before scheduling a specialized training session with the Pegasus team, we recommend familiarizing yourself with the following resources (mainly the ACCESS Pegasus Overview item):
- ACCESS Pegasus Overview: This resource offers a high-level introduction to Pegasus, its key features, and its benefits for research workflows.
- ACCESS Pegasus Documentation: This in-depth documentation delves into the practical aspects of using Pegasus on ACCESS, including specific configuration details and code examples relevant to Neocortex.
- Pegasus User Guide: The official Pegasus User Guide provides comprehensive documentation on Pegasus's functionalities, architecture, installation, and usage. This guide is a valuable resource for researchers who want a deeper understanding of the system.
Next Steps¶
Once you have reviewed the self-guided learning resources and feel comfortable with the basic concepts of Pegasus, you can schedule a specialized training session (office hour) with the Pegasus team for more advanced topics or specific Neocortex configurations.
Launching Pegasus Jobs on Neocortex¶
The Pegasus Workflow Management System can be used on Neocortex in two different interfaces: a command line (CLI), or through Jupyter Notebook.
The Neocortex team initially focused on getting the CLI version going by going, so please use that one while we finish deploying the Jupyter Notebook option.
Also, the easiest way to get started running jobs with Pegasus on Neocortex is to build on top of the existing examples created by the Pegasus and Neocortex teams, so please see below for how to run the examples using both the CLI and Jupyter Notebook options.
Pegasus on the CLI¶
-
Step 1: for using Pegasus you need to be on the Neocortex Pegasus machine. For getting there, connect to a Neocortex terminal using either SSH or the Open OnDemand terminal.
-
If using SSH, login into Neocortex as usual and then start an interactive shell while specifying to use the
pegasus
partition:srun --partition=pegasus --pty bash
-
If using the Open OnDemand terminal, get to Open OnDemand and launch the terminal shown in the home page, or Applications menu. Connect to Open OnDemand (OOD) following the instructions outlined in the Neocortex Open OnDemand section.
-
-
Step 2: Double-check that you are indeed in the right machine with access to the Pegasus commands. You can run a
condor_version
orcondor_q
command to se what the HTCondor service returns.Example output from the
condor_version
command, which verifies HTCondor is set up:condor_version
-
Step 3: Clone the Pegasus Team Cerebras Modelzoo workflows for using with Pegasus.
git clone git@github.com:pegasus-isi/cerebras-modelzoo.git
Change directories into the just cloned repository:
cd cerebras-modelzoo/pt/
-
Step 4: Double-check the model and dataset configuration. Customize as needed.
-
Take a look at the
README.md
file for more details:cat README.md
-
Then run the
prepare_inputs.sh
script to download the dataset.Example output from the
prepare_inputs.sh
command:./executables/prepare_inputs.sh
Going to rsync the modelzoo repository into ./input sending incremental file list modelzoo/ modelzoo/.gitignore 94 100% 0.00kB/s 0:00:00 (xfr#1, to-chk=770/772) modelzoo/LICENSE 11,357 100% 10.83MB/s 0:00:00 (xfr#2, to-chk=769/772) [...] OUTPUT TRIMMED ON PURPOSE FOR CONCISENESS [...] modelzoo/user_scripts/csrun_wse 11,308 100% 19.97kB/s 0:00:00 (xfr#615, to-chk=0/772) rsync of the modelzoo repository into ./input completed. Removing params.yaml from pytorch config dir Tarring up the git checkout to $HOME/cerebras-modelzoo/pt/input/modelzoo-raw.tgz
-
You can double-check that it downloaded as expected by listing the files under the
input/
directory:Example output from the
ls input/
command:ls input/
-
You can also view or customize the
params.yaml
configuration file as needed:Example output from the
cat input/params.yaml
command:cat input/params.yaml
-
After this, you will be ready to run the example, but you need to specify the Neocortex Allocation identifier to use for running this job (the charge-id to use with Slurm). You can take a look at your available ones with the
groups
command:Example output from the
groups
command:groups
-
-
Step 5: Run the example.
-
As soon as you have the project id to use, specify it using the
--project=
argument and run the job with the following command:./cerebras-modelzoo-pt.py --project=cis000000p
-
You should get some output explaining that the job was submitted, and also some commands to manage the job lifecycle. For example, for monitoring the status of the job, use the
pegasus-status -l
command:Example output from the
pegasus-status
command:pegasus-status -l $HOME/cerebras-modelzoo/pt/submit/researcher/pegasus/cerebras-model-zoo-pt/run0001
Press Ctrl+C to exit (pid=2260123) Fri Dec-13-2024 16:18:44 ID SITE STAT IN_STATE JOB 140 local Run 03:01 cerebras-model-zoo-pt-0 ($HOME/cerebras-modelzoo/pt/submit/researcher/pegasus/cerebras-model-zoo-pt/run0001) 145 neocortex Run 00:19 ┗━validate_ID0000001 Summary: 2 Condor jobs total (R:2) UNREADY READY PRE IN_Q POST DONE FAIL %DONE STATE DAGNAME 8 0 0 1 0 4 0 30.77 Running cerebras-model-zoo-pt-0.dag Summary: 1 DAG total (Running:1)
-
You can also use the
condor_q
command to see the job status and the HTCondor queue: Example output from thecondor_q
command:condor_q
-- Schedd: pegasus.neocortex.psc.edu : <IP.ADD.RE.SS:9618?... @ 12/13/24 16:27:45 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS researcher cerebras-model-zoo-pt-0.dag+140 12/13 16:15 10 1 _ 13 150.0 Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for researcher: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
-
Pegasus on Jupyter Notebook via Open OnDemand¶
- Step 1: Connect to Open OnDemand (OOD), following the instructions outlined in the Neocortex Open OnDemand section.
- Step 2: Launch Jupyter Notebook
- Once logged in, you will be directed to the OnDemand dashboard. From here, navigate to the Interactive Apps menu.
- Under Interactive Apps, select Jupyter Notebook.
- Configure your session by specifying the desired options, such as the number of CPU cores, memory, and time limit. The number of hours can be set based on your needs (for example, 1 hour), and the Number of Nodes should be set to 1.
- In the "Extra Slurm Args" field, specify the "pegasus" partition with the following argument:
- Click Launch to start the Jupyter Notebook session.
- You are now ready to work with Jupyter Notebook on the Neocortex system.