Running Jobs via Pegasus

Introduction¶

This page provides researchers with the necessary resources to learn about running jobs on Neocortex using the Pegasus Workflow Management System. Pegasus streamlines complex workflows, making it easier to manage and execute computational tasks on high-performance computing (HPC) resources like Neocortex.

By utilizing Pegasus, researchers can achieve multiple benefits:

Workflow Management Efficiency: Automate complex workflows, saving time and reducing errors.
Scalability: Manage workflows of varying sizes efficiently on Neocortex's parallel computing resources.
Reproducibility: Ensure the consistent and repeatable execution of workflows across different runs.
Data Provenance Tracking: Keep track of the origin and processing steps of data used in the workflow, enhancing accuracy and transparency.

Self-Guided Learning Resources¶

Before scheduling a specialized training session with the Pegasus team, we recommend familiarizing yourself with the following resources (mainly the ACCESS Pegasus Overview item):

ACCESS Pegasus Overview: This resource offers a high-level introduction to Pegasus, its key features, and its benefits for research workflows.
ACCESS Pegasus Documentation: This in-depth documentation delves into the practical aspects of using Pegasus on ACCESS, including specific configuration details and code examples relevant to Neocortex.
Pegasus User Guide: The official Pegasus User Guide provides comprehensive documentation on Pegasus's functionalities, architecture, installation, and usage. This guide is a valuable resource for researchers who want a deeper understanding of the system.

Next Steps¶

Once you have reviewed the self-guided learning resources and feel comfortable with the basic concepts of Pegasus, you can schedule a specialized training session (office hour) with the Pegasus team for more advanced topics or specific Neocortex configurations.

Launching Pegasus Jobs on Neocortex¶

The Pegasus Workflow Management System can be used on Neocortex in two different interfaces: a command line (CLI), or through Jupyter Notebook.

The Neocortex team initially focused on getting the CLI version going by going, so please use that one while we finish deploying the Jupyter Notebook option.

Also, the easiest way to get started running jobs with Pegasus on Neocortex is to build on top of the existing examples created by the Pegasus and Neocortex teams, so please see below for how to run the examples using both the CLI and Jupyter Notebook options.

Pegasus on the CLI¶

Step 1: for using Pegasus you need to be on the Neocortex Pegasus machine. For getting there, connect to a Neocortex terminal using either SSH or the Open OnDemand terminal.
1. If using SSH, login into Neocortex as usual and then start an interactive shell while specifying to use the pegasus partition:
  
  srun --partition=pegasus --pty bash
2. If using the Open OnDemand terminal, get to Open OnDemand and launch the terminal shown in the home page, or Applications menu. Connect to Open OnDemand (OOD) following the instructions outlined in the Neocortex Open OnDemand section.
Step 2: Double-check that you are indeed in the right machine with access to the Pegasus commands. You can run a condor_version or condor_q command to se what the HTCondor service returns.

Example output from the condor_version command, which verifies HTCondor is set up:

condor_version
```
$CondorVersion: 23.7.2 2024-05-16 BuildID: 733409 PackageID: 23.7.2-1 GitSHA: 585ec167 $
$CondorPlatform: x86_64_AlmaLinux8 $
```
Step 3: Clone the Pegasus Team Cerebras Modelzoo workflows for using with Pegasus.

git clone git@github.com:pegasus-isi/cerebras-modelzoo.git

Change directories into the just cloned repository:

cd cerebras-modelzoo/pt/

Step 4: Double-check the model and dataset configuration. Customize as needed.

Take a look at the README.md file for more details:

cat README.md

Then run the prepare_inputs.sh script to download the dataset.

Example output from the prepare_inputs.sh command:

./executables/prepare_inputs.sh

Going to rsync the modelzoo repository into ./input
sending incremental file list
modelzoo/
modelzoo/.gitignore
             94 100%    0.00kB/s    0:00:00 (xfr#1, to-chk=770/772)
modelzoo/LICENSE
         11,357 100%   10.83MB/s    0:00:00 (xfr#2, to-chk=769/772)

[...] OUTPUT TRIMMED ON PURPOSE FOR CONCISENESS [...]

modelzoo/user_scripts/csrun_wse
     11,308 100%   19.97kB/s    0:00:00 (xfr#615, to-chk=0/772)

rsync of the modelzoo repository into ./input completed.
Removing params.yaml from pytorch config dir
Tarring up the git checkout to $HOME/cerebras-modelzoo/pt/input/modelzoo-raw.tgz

You can double-check that it downloaded as expected by listing the files under the input/ directory:

Example output from the ls input/ command:

ls input/
```
modelzoo-raw.tgz  params.yaml
```

You can also view or customize the params.yaml configuration file as needed:

Example output from the cat input/params.yaml command:

cat input/params.yaml

train_input:
data_dir: "./data/mnist/train"
batch_size: 128
drop_last_batch: True
shuffle: True

[...] OUTPUT TRIMMED ON PURPOSE FOR CONCISENESS [...]

After this, you will be ready to run the example, but you need to specify the Neocortex Allocation identifier to use for running this job (the charge-id to use with Slurm). You can take a look at your available ones with the groups command:

Example output from the groups command:

groups
```
cis000000p
```

Step 5: Run the example.

As soon as you have the project id to use, specify it using the --project= argument and run the job with the following command:

./cerebras-modelzoo-pt.py --project=cis000000p

You should get some output explaining that the job was submitted, and also some commands to manage the job lifecycle. For example, for monitoring the status of the job, use the pegasus-status -l command:

Example output from the pegasus-status command:

pegasus-status -l $HOME/cerebras-modelzoo/pt/submit/researcher/pegasus/cerebras-model-zoo-pt/run0001

Press Ctrl+C to exit                                 (pid=2260123)                              Fri Dec-13-2024 16:18:44

   ID        SITE      STAT  IN_STATE  JOB
  140        local      Run    03:01   cerebras-model-zoo-pt-0 ($HOME/cerebras-modelzoo/pt/submit/researcher/pegasus/cerebras-model-zoo-pt/run0001)
  145      neocortex    Run    00:19   ┗━validate_ID0000001
Summary: 2 Condor jobs total (R:2)

UNREADY READY  PRE  IN_Q  POST  DONE  FAIL %DONE  STATE  DAGNAME
   8      0     0    1     0     4     0   30.77 Running cerebras-model-zoo-pt-0.dag
Summary: 1 DAG total (Running:1)

You can also use the condor_q command to see the job status and the HTCondor queue: Example output from the condor_q command:

condor_q

-- Schedd: pegasus.neocortex.psc.edu : <IP.ADD.RE.SS:9618?... @ 12/13/24 16:27:45
OWNER  BATCH_NAME                         SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
researcher cerebras-model-zoo-pt-0.dag+140  12/13 16:15     10      1      _     13 150.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for researcher: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Pegasus on Jupyter Notebook via Open OnDemand¶

Step 1: Connect to Open OnDemand (OOD), following the instructions outlined in the Neocortex Open OnDemand section.
Step 2: Launch Jupyter Notebook
1. Once logged in, you will be directed to the OnDemand dashboard. From here, navigate to the Interactive Apps menu.
2. Under Interactive Apps, select Jupyter Notebook.
3. Configure your session by specifying the desired options, such as the number of CPU cores, memory, and time limit. The number of hours can be set based on your needs (for example, 1 hour), and the Number of Nodes should be set to 1.
4. In the "Extra Slurm Args" field, specify the "pegasus" partition with the following argument:
```
--partition=pegasus
```
5. Click Launch to start the Jupyter Notebook session.
6. You are now ready to work with Jupyter Notebook on the Neocortex system.