Reference Compilation Example¶

Neocortex¶

Connect to the login node.

ssh researcher@neocortex.psc.edu
researcher@neocortex.psc.edu's password: ****************

********************************* W A R N I N G ********************************
You have connected to one of the Neocortex login nodes.
LOG OFF IMMEDIATELY if you do not agree to the conditions stated in this warning
********************************* W A R N I N G ********************************
For documentation on Neocortex, please see https://portal.neocortex.psc.edu/docs/
Please contact neocortex@psc.edu with any comments/concerns.

[researcher@neocortex-login023 ~]$

Take a look at the project grants available. There seem to be 2 different grants available. One for a different research project, and then the one for Neocortex. Since the latter is the one that has the SU, we should specify it for the different commands.

[researcher@neocortex-login023 ~]$ projects | grep "Project\|Title"

    Project: CIS000000P 
      Title: A Very Important Project

    Project: CIS123456P                             # << This one
      Title: P99-Neocortex Research Project         # << This one

Let's take a look at the output of the groups command, since the groups are usually all lowercase but the projects output isn't.

[researcher@neocortex-login023 ~]$ groups

    cis000000p cis123456p

"cis000000p" is showing as the first in that line (leftmost). That means that it's the primary group. What we want is to have the P## group to be the primary for all of the following commands, so let's run the "newgrp" command specifying it so that happens.

[researcher@neocortex-login023 ~]$ newgrp cis123456p

Now, by running the groups command one more time we see that the "cis123456p" group is showing as primary, just like we need.

[researcher@neocortex-login023 ~]$ groups
    cis123456p cis000000p

Since we have the correct group showing as primary, we can now proceed to start a job for copying files and running the actual compilation steps. This will start the SLURM job using the correct Project Allocation id (--account=GROUPID).

We should start by setting some variables for copying the data.

[researcher@neocortex-login023 ~]$ export CEREBRAS_DIR=/ocean/neocortex/cerebras/
[researcher@neocortex-login023 ~]$ echo $PROJECT
    /ocean/projects/cis123456p/researcher

In this case, we are copying the files by using rsync, since this command will update the target directory with any changes/updates from the origin path. That will not be the case with cp, as that command will complain if the target directory already exists. Also, if there are no new files under the $CEREBRAS_DIR/modelzoo, the output will only have "sending incremental file list" and nothing else will be transferred since the updated files would already be in place. Additionally, please have in mind that the "modelzoo" folder being copied should belong to the correct group after running the following commands. For this specific case, to "cis123456p" and not to "cis000000p".

[researcher@sdf-1 ~]$ rsync -PaL --chmod u+w $CEREBRAS_DIR/modelzoo $PROJECT/
    sending incremental file list
    modelzoo/
    modelzoo/LICENSE

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

[researcher@sdf-1 ~]$ ls $PROJECT/
    modelzoo

Then change into the modelzoo folder of the model we want to evaluate/compile/train:

[researcher@neocortex-login023 ~]$ cd $PROJECT/modelzoo/fc_mnist/tf

This command will start a shell using the latest Cerebras container

[researcher@neocortex-login023 tf]$ srun --pty --cpus-per-task=28 --kill-on-bad-exit singularity shell --cleanenv --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif

Singularity>

Inside that shell, you will get to run the different validation and compilation commands. For example, for running a validate_only process:

Singularity> python run.py --mode train --validate_only --model_dir validate

    INFO:tensorflow:TF_CONFIG environment variable: {}
    Downloading and preparing dataset mnist (11.06 MiB) to cerebras/data/tfds/mnist/1.0.0...
    Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00&lt;00:00, 12.50 url/s]
    Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:00&lt;00:00,  6.24 file/s]
    Extraction completed...: 100%|█████████████████████████████████████████████████████████████████████████████| 4/4 [00:00&lt;00:00,  5.84 file/s]
    Dl Size...: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00&lt;00:00, 15.59 MiB/s]
    Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00&lt;00:00,  6.23 url/s]
    0 examples [00:00, ? examples/s]2021-02-17 15:54:14.174234: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============                                                                            
    Cerebras compilation completed: 100%|██████████████████████████████████████████████████████████████████████████| 2/2 [00:02s,  1.12s/stages]
    =============== Cerebras Compilation Completed ===============

Singularity>

In the same way, a compile_only process looks like this:

Singularity> python run.py --mode train --compile_only --model_dir compile

    INFO:tensorflow:TF_CONFIG environment variable: {}

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============                                                                            
    Cerebras compilation completed: |                                                                           | 19/? [00:26s,  1.37s/stages]
    =============== Cerebras Compilation Completed ===============

Singularity>

Now, the different parameter files used for the validation/compilation/training processes can be specified. Let's say that you want to use not the default "configs/params.yaml" file but one in a different (custom) directory (--params custom_configs/params.yaml). This can be done by using the original "params.yaml" file and setting the values there, and then the contents of the output can also be written into a different path (--model_dir custom_output_dir):

Singularity> cp -r configs custom_configs

Singularity> vi custom_configs/params.yaml

Singularity> python run.py --mode train --compile_only --params custom_configs/params.yaml --model_dir custom_output_dir

    INFO:tensorflow:TF_CONFIG environment variable: {}

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============                                                                            
    Cerebras compilation completed: |                                                                           | 19/? [00:25s,  1.34s/stages]
    =============== Cerebras Compilation Completed ===============

Singularity>

The contents of the custom_configs and custom_output_dir have the parameters used and the output for this example compilation process. Please note that the group ownership is still pointing to the correct group ("cis123456p" for this example), since the account information to use was automatically passed to SLURM.

Singularity> ls -lash | grep custom
    4.0K drwxr-sr-x 2 researcher cis123456p 4.0K Feb 17 17:07 custom_configs
    4.0K drwxr-sr-x 3 researcher cis123456p 4.0K Feb 17 17:07 custom_output_dir

Singularity> ls -lsh custom*
    custom_configs:
    total 4.0K
    4.0K -rw-r--r-- 1 researcher cis123456p 1.3K Feb 17 17:07 params.yaml

    custom_output_dir:
    total 16K
     12K drwxr-sr-x 4 researcher cis123456p 12K Feb 17 17:08 cs_518e82fcc3928d8e9da4ffc039506e6f0019b41b46bc53085af34c080de4054e
    4.0K -rw-r--r-- 1 researcher cis123456p 534 Feb 17 17:07 params.txt

Now, for training the model (since it's compiling without issues), we can create the wrapper scripts that save time and also make sure the right syntax is used for SLURM, Singularity, and the Python command for training inside the container.

This wrapper is tailored for the setup of Neocortex:

[researcher@neocortex-login023 tf]$ vim srun_train
    #!/usr/bin/bash
    srun --account=cis123456p --gres=cs:cerebras:1 --ntasks=7 --cpus-per-task=14 --kill-on-bad-exit singularity exec --bind /local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,$PROJECT /local1/cerebras/cbcore_latest.sif ./run_train "$@"

[researcher@neocortex-login023 tf]$ vim run_train
    #!/usr/bin/bash
    python run.py --cs_ip ${CS_IP_ADDR} --mode train "$@"

[researcher@neocortex-login023 tf]$ chmod +x srun_train run_train

[researcher@neocortex-login023 tf]$ ./srun_train --model_dir training_example
    INFO:tensorflow:TF_CONFIG environment variable: {'cluster': {'chief': ['sdf-1:23111'], 'worker': ['sdf-1:23112', 'sdf-1:23113', 'sdf-1:23114', 'sdf-1:23115', 'sdf-1:23116', 'sdf-1:23117']}, 'task': {'type': 'chief', 'index': 0}}
    WARNING:tensorflow:From /cb/toolchains/buildroot/monolith-default/202010061651-75-61959232/rootfs-x86_64/usr/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
    Instructions for updating:
    If using Keras pass *_constraint arguments to layers.
    WARNING:tensorflow:From /cb/toolchains/buildroot/monolith-default/202010061651-75-61959232/rootfs-x86_64/usr/lib/python3.7/site-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
    2021-02-20 17:58:56.115455: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
    2021-02-20 17:58:56.132002: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2700000000 Hz
    2021-02-20 17:58:56.133795: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4800490 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2021-02-20 17:58:56.133815: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2021-02-20 17:58:56.133907: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
    WARNING:root:[input_fn] - flat_map(): use map() instead of flat_map() to improve performance and parallelize reads. If you are not calling `flat_map` directly, check if you are using: from_generator, TextLineDataset, TFRecordDataset, or FixedLenthRecordDataset. If so, set `num_parallel_reads` to > 1 or tf.data.experimental.AUTOTUNE, and TF will use map() automatically.
    2021-02-20 17:58:57.534881: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX512F
    2021-02-20 17:58:57.561731: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2700000000 Hz
    2021-02-20 17:58:57.563606: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6af0720 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2021-02-20 17:58:57.563627: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2021-02-20 17:58:57.622282: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:267] number of function defs:1
    2021-02-20 17:58:57.622307: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:268] cluster_9063863211648629377
    2021-02-20 17:58:57.622313: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:269] xla args number:23
    2021-02-20 17:58:57.622317: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:270] fdef_args number:23
    2021-02-20 17:58:57.622321: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:275] fdef output mapping signature -> node_def: 
    2021-02-20 17:58:57.622325: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:277]  "mean_1_0_retval" -> "Mean_1:output:0"
    2021-02-20 17:58:57.688709: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:52] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired.
    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============
    Cerebras compilation completed: |          | 19/? [00:25s,  1.35s/stages]2021-02-20 17:59:25.699705: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

    WARNING:tensorflow:From /cbcore/py_root/cerebras/tf/cs_estimator.py:558: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
    Instructions for updating:
    Prefer Variable.assign which has equivalent behavior in 2.X.
    INFO:tensorflow:Create CheckpointSaverHook.
    INFO:tensorflow:Graph was finalized.
    2021-02-20 17:59:25.862027: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
    INFO:tensorflow:Saving checkpoints for 0 into training_example/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
    INFO:tensorflow:Programming CS-2 fabric. This may take couple of minutes - please do not interrupt.
    INFO:tensorflow:Fabric programmed
    INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.42% out of 308274 cores on the fabric)
    INFO:tensorflow:Graph was finalized.
    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Waiting for 6 streamer(s) to prime the data pipeline
    INFO:tensorflow:Streamers are ready
    INFO:tensorflow:global step 1: loss = 2.3671875 (1.19 steps/sec)
    INFO:tensorflow:global step 100: loss = 0.2467041015625 (89.75 steps/sec)
    INFO:tensorflow:global step 200: loss = 0.1527099609375 (167.0 steps/sec)

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:global step 99700: loss = 0.0003674030303955078 (471.25 steps/sec)
    INFO:tensorflow:global step 99800: loss = 0.038543701171875 (471.75 steps/sec)
    INFO:tensorflow:global step 99900: loss = 0.0 (472.0 steps/sec)
    INFO:tensorflow:Training finished with 25600000 samples in 211.863 seconds, 120832.56 samples / second
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100000...
    INFO:tensorflow:Saving checkpoints for 100000 into training_example/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100000...
    INFO:tensorflow:global step 100000: loss = 0.0823974609375 (471.75 steps/sec)
    INFO:tensorflow:global step 100000: loss = 0.0823974609375 (471.75 steps/sec)
    INFO:tensorflow:Loss for final step: 0.0824.
    =============== Cerebras Compilation Completed ===============
    [researcher@neocortex-login023 tf]$

Finally, if you want to perform these steps in batch mode instead of interactively, you can ran all of them from a single sbatch file. Like this:

[researcher@neocortex-login023 tf]$ vim mnist.sbatch
    #!/usr/bin/bash
    #SBATCH --gres=cs:cerebras:1
    #SBATCH --ntasks=7
    #SBATCH --cpus-per-task=14
    #SBATCH --account=cis123456p

    newgrp cis123456p 
    cp ${0} slurm-${SLURM_JOB_ID}.sbatch

    YOUR_DATA_DIR=${LOCAL}/cerebras/data
    YOUR_MODEL_ROOT_DIR=${PROJECT}/modelzoo/
    YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/fc_mnist/tf
    BIND_LOCATIONS=/local1/cerebras/data,/local2/cerebras/data,/local3/cerebras/data,/local4/cerebras/data,${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR}
    CEREBRAS_CONTAINER=/ocean/neocortex/cerebras/cbcore_latest.sif
    cd ${YOUR_ENTRY_SCRIPT_LOCATION}

    srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate
    srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile
    srun --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --model_dir training_example --cs_ip ${CS_IP_ADDR}

[researcher@neocortex-login023 tf]$ sbatch mnist.sbatch 
    Submitted batch job 345

[researcher@neocortex-login023 tf]$ squeue
     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       345       sdf mnist.sb    researcher  R       0:02      1 sdf-1

[researcher@neocortex-login023 tf]$ tail -f slurm-345.out
    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:Cached compilation found for this model configuration

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 100000...
    INFO:tensorflow:Saving checkpoints for 100000 into training_example/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 100000...
    INFO:tensorflow:Programming CS-2 fabric. This may take couple of minutes - please do not interrupt.
    INFO:tensorflow:Fabric programmed
    INFO:tensorflow:Coordinator fully up. Waiting for Streaming (using 0.42% out of 308274 cores on the fabric)
    INFO:tensorflow:Graph was finalized.
    INFO:tensorflow:Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    INFO:tensorflow:Waiting for 6 streamer(s) to prime the data pipeline
    INFO:tensorflow:Streamers are ready
    INFO:tensorflow:global step 100001: loss = 0.0 (0.43 steps/sec)
    INFO:tensorflow:global step 100100: loss = 0.0 (37.72 steps/sec)
    INFO:tensorflow:global step 100200: loss = 0.0019931793212890625 (72.94 steps/sec)

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    INFO:tensorflow:global step 199700: loss = 0.0 (470.75 steps/sec)
    INFO:tensorflow:global step 199800: loss = 3.814697265625e-06 (471.25 steps/sec)
    INFO:tensorflow:global step 199900: loss = 3.814697265625e-06 (469.0 steps/sec)
    INFO:tensorflow:Training finished with 25600000 samples in 213.044 seconds, 120162.88 samples / second
    INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 200000...
    INFO:tensorflow:Saving checkpoints for 200000 into training_example/model.ckpt.
    INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 200000...
    INFO:tensorflow:global step 200000: loss = 0.0 (469.0 steps/sec)
    INFO:tensorflow:global step 200000: loss = 0.0 (469.0 steps/sec)
    INFO:tensorflow:Loss for final step: 0.0.

As it can be seen on the output above, the previous training was resumed since the initial checkpoint finished at step 100000 and we ran the batch job specifying the same model output directory (--model_dir training_example).

Bridges-2¶

Connect to the login node.

ssh researcher@bridges2.psc.edu
researcher@bridges2.psc.edu's password: ****************
********************************* W A R N I N G ********************************
You have connected to br012.ib.bridges2.psc.edu, a login node of Bridges 2.
LOG OFF IMMEDIATELY if you do not agree to the conditions stated in this warning
********************************* W A R N I N G ********************************
[---OUTPUT SNIPPED---]

Projects
------------------------------------------------------------
Project: cis000000p PI: Paola Buitrago ***** default charging project *****
  Extreme Memory             1,000 SU remain of 1,000 SU        active: Yes
  GPU AI                     2,500 SU remain of 2,500 SU        active: Yes
  Regular Memory            49,999 SU remain of 50,000 SU       active: Yes
  Ocean /ocean/projects/cis000000p 14.43G used of 1000G 
Project: cis123456p PI: Paola Buitrago
  Extreme Memory             1,000 SU remain of 1,000 SU        active: Yes
  GPU AI                     2,500 SU remain of 2,500 SU        active: Yes
  Regular Memory            50,000 SU remain of 50,000 SU       active: Yes
  Ocean /ocean/projects/cis123456p 26.97G used of 1000G

[researcher@bridges2-login012 ~]$

Note

Please have in mind that your Neocortex allocation/account has access to more than one partition. RM for Regular Memory and EM for Extreme Memory. The default one to use should be the RM one, since there are more SU available there, and you should only switch to the EM one if needed after testing everything on RM so your SU don't run out prematurely. Additionally, the EM partition (those nodes) do not allow you to run commands interactively via the "interact" command, thus you will be required to either submit on batch mode, or to run the srun command shown below.

Take a look at the project grants available. There seem to be 2 different grants available. One for a different research project, and then the one for Neocortex. Since the latter is the one that has the SU, we should specify it for the different commands.

[researcher@bridges2-login012 ~]$ projects | grep "Project\|Title"
    Project: CIS000000P
      Title: A Very Important Project

    Project: CIS123456P                             # << This one
      Title: P99-Neocortex Research Project         # << This one

Let's take a look at the output of the groups command, since the groups are usually all lowercase but the projects output isn't.

[researcher@bridges2-login012 ~]$ groups
    cis000000p cis123456p

"cis000000p" is showing as the first in that line (leftmost). That means that it's the primary group. What we want is to have the P## group to be the primary for all of the following commands, so let's run the "newgrp" command specifying it so that happens.

[researcher@bridges2-login012 ~]$ newgrp cis123456p

Now, by running the groups command one more time we see that the "cis123456p" group is showing as primary, just like we need.

[researcher@bridges2-login012 ~]$ groups
    cis123456p cis000000p

Since we have the correct group showing as primary, we can now proceed to start a job for copying files and running the actual compilation steps. This will start the SLURM job using the correct Project Allocation id (--account=GROUPID).

We should start by running a simple interact job while specifying the allocation to use, like this:

[researcher@bridges2-login012 ~]$ CEREBRAS_DIR=/ocean/neocortex/cerebras/
[researcher@bridges2-login012 ~]$ interact -A cis123456p -p RM

    A command prompt will appear when your session begins
    "Ctrl+d" or "exit" will end your session

    --partition=RM
    salloc -J Interact --partition=RM
    salloc: Pending job allocation 312345
    salloc: job 312345 queued and waiting for resources
    salloc: job 312345 has been allocated resources
    salloc: Granted job allocation 312345
    salloc: Waiting for resource configuration
    salloc: Nodes r051 are ready for job

[researcher@r051 ~]$

Note

Please remember that the interactive mode can only be used for RM nodes. For EM nodes, the batch mode has to be used.

As seen from the previous output, the prompt changed from saying that we were in a the "bridges2-login012" node to "r051" on the RM partition. It's now time to set some variables for copying the data.

[researcher@r051 ~]$ CEREBRAS_DIR=/ocean/neocortex/cerebras/
[researcher@r051 ~]$ echo $PROJECT
    /ocean/projects/cis123456p/researcher

In this case, we are copying the files by using rsync, since this command will update the target directory with any changes/updates from the origin path. That will not be the case with cp, as that command will complain if the target directory already exists. Also, if there are no new files under the $CEREBRAS_DIR/modelzoo, the output will only have "sending incremental file list" and nothing else will be transferred since the updated files would already be in place. Additionally, please have in mind that the "modelzoo" folder being copied should belong to the correct group after running the following commands. For this specific case, to "cis123456p" and not to "cis000000p".

[researcher@r051 ~]$ rsync -PaL --chmod u+w $CEREBRAS_DIR/modelzoo $PROJECT/
    sending incremental file list
    modelzoo/
    modelzoo/LICENSE

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

[researcher@r051 ~]$ ls $PROJECT/
    modelzoo

Since the information is already in place, we should exit that simple interactive mode and start the actual compilation with more resources. We can exit that interactive mode by typing exit or pressing Ctrl+D.

[researcher@r051 ~]$ exit
    exit
    salloc: Relinquishing job allocation 345678

[researcher@bridges2-login012 ~]$

Then change into the modelzoo folder of the model we want to evaluate/compile/train:

[researcher@bridges2-login012 ~]$ cd $PROJECT/modelzoo/fc_mnist/tf

This command will start a shell using the latest Cerebras container. Please have in mind that it might take a while for the job to start:

[researcher@bridges2-login012 tf]$ srun --pty --cpus-per-task=28 --account=cis123456p --partition=RM --kill-on-bad-exit singularity shell --cleanenv --bind $CEREBRAS_DIR/data,$PROJECT $CEREBRAS_DIR/cbcore_latest.sif
    srun: job 345678 queued and waiting for resources
    srun: job 345678 has been allocated resources

Singularity>

Inside that shell, you will get to run the different validation and compilation commands. For example, for running a validate_only process:

Singularity> python run.py --mode train --validate_only --model_dir validate

    INFO:tensorflow:TF_CONFIG environment variable: {}
    Downloading and preparing dataset mnist (11.06 MiB) to cerebras/data/tfds/mnist/1.0.0...
    Dl Completed...: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.65 url/s]
    Extraction completed...: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.29 file/s]
    Extraction completed...: 100%|█████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.81 file/s]
    Dl Size...: 100%|██████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 15.72 MiB/s]
    Dl Completed...: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.28 url/s]
    0 examples [00:00, ? examples/s]2021-03-01 17:53:53.757037: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2245750000 Hz

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============                                                                                                                                                   
    Cerebras compilation completed: 100%|██████████████████████████████████████████████████| 2/2 [00:04s,  2.03s/stages]
    =============== Cerebras Compilation Completed ===============

Singularity>

In the same way, a compile_only process looks like this:

Singularity> python run.py --mode train --compile_only --model_dir compile

    INFO:tensorflow:TF_CONFIG environment variable: {}
    WARNING:root:[input_fn] - flat_map(): use map() instead of flat_map() to improve performance and parallelize reads. If you are not calling `flat_map` directly, check if you are using: from_generator, TextLineDataset, TFRecordDataset, or FixedLenthRecordDataset. If so, set `num_parallel_reads` to > 1 or tf.data.experimental.AUTOTUNE, and TF will use map() automatically.
    WARNING:tensorflow:From /cb/toolchains/buildroot/monolith-default/202010061651-75-61959232/rootfs-x86_64/usr/lib/python3.7/site-packages/tensorflow/python/ops/resource_variable_ops.py:1666: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
    Instructions for updating:
    If using Keras pass *_constraint arguments to layers.
    2021-03-01 17:56:29.146050: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2245750000 Hz
    2021-03-01 17:56:29.151928: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6308140 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2021-03-01 17:56:29.151990: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    2021-03-01 17:56:29.182298: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:267] number of function defs:1
    2021-03-01 17:56:29.182327: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:268] cluster_9063863211648629377
    2021-03-01 17:56:29.182337: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:269] xla args number:23
    2021-03-01 17:56:29.182344: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:270] fdef_args number:23
    2021-03-01 17:56:29.182350: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:275] fdef output mapping signature -> node_def: 
    2021-03-01 17:56:29.182357: I tensorflow/tools/xla_extract/tf_graph_to_xla_lib.cc:277]  "mean_1_0_retval" -> "Mean_1:output:0"
    2021-03-01 17:56:29.187951: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:52] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired.
    XLA Extraction Complete
    INFO:tensorflow:Cached compilation found for this model configuration

Singularity>

Now, the different parameter files used for the validation/compilation/training processes can be specified. Let's say that you want to use not the default "configs/params.yaml" file but one in a different (custom) directory (--params custom_configs/params.yaml). This can be done by using the original "params.yaml" file and setting the values there, and then the contents of the output can also be written into a different path (--model_dir custom_output_dir):

Singularity> cp -r configs custom_configs

Singularity> vi custom_configs/params.yaml

Singularity> python run.py --mode train --compile_only --params custom_configs/params.yaml --model_dir custom_output_dir

    INFO:tensorflow:TF_CONFIG environment variable: {}

    [--- OUTPUT SNIPPED FOR KEEPING THIS EXAMPLE SHORT ---]

    XLA Extraction Complete
    =============== Starting Cerebras Compilation ===============                                                                            
    Cerebras compilation completed: |                                                                                                                                                 | 19/? [00:31s,  1.63s/stages]
    =============== Cerebras Compilation Completed ===============

Singularity>

The contents of the custom_configs and custom_output_dir have the parameters used and the output for this example compilation process. Please note that the group ownership is still pointing to the correct group ("cis123456p" for this example), since the account information to use was automatically passed to SLURM.

Singularity> ls -lash | grep custom
    4.0K drwxr-sr-x  2 researcher cis123456p 4.0K Mar  1 17:57 custom_configs
    4.0K drwxr-sr-x  3 researcher cis123456p 4.0K Mar  1 17:58 custom_output_dir

Singularity> ls -lsh custom*
    custom_configs:
    total 4.0K
    4.0K -rw-r--r-- 1 researcher cis123456p 1.3K Mar  1 17:57 params.yaml

    custom_output_dir:
    total 16K
     12K drwxr-sr-x 4 researcher cis123456p 12K Mar  1 17:58 cs_518e82fcc3928d8e9da4ffc039506e6f0019b41b46bc53085af34c080de4054e
    4.0K -rw-r--r-- 1 researcher cis123456p 534 Mar  1 17:58 params.txt

Now, regarding training the model (since it's compiling without issues), this training cannot be donen using Bridges-2. You will have to connect to Neocortex and follow the steps shown for training in the "Reference Compilation Example: Using Neocortex" section.

Finally, if you want to perform these steps in batch mode instead of interactively via srun, you can ran all of them from a single sbatch file. This will allow us to use the Extreme Memory nodes in the EM partition. Like this:

 git clone git@github.com:Cerebras/modelzoo.git
[researcher@neocortex-login023 tf]$ vim mnist.sbatch
    #!/usr/bin/bash
    #SBATCH --cpus-per-task=28
    #SBATCH --account=cis123456p
    #SBATCH --partition=EM
    #SBATCH --time=60:00

    newgrp cis123456p
    cp ${0} slurm-${SLURM_JOB_ID}.sbatch

    CEREBRAS_DIR=/ocean/neocortex/cerebras/
    rsync -PaL --chmod u+w $CEREBRAS_DIR/modelzoo $PROJECT/

    YOUR_DATA_DIR=$CEREBRAS_DIR/data
    YOUR_MODEL_ROOT_DIR=${PROJECT}/
    YOUR_ENTRY_SCRIPT_LOCATION=${YOUR_MODEL_ROOT_DIR}/modelzoo/fc_mnist/tf
    BIND_LOCATIONS=${YOUR_DATA_DIR},${YOUR_MODEL_ROOT_DIR},/local
    CEREBRAS_CONTAINER=$CEREBRAS_DIR/cbcore_latest.sif
    cd ${YOUR_ENTRY_SCRIPT_LOCATION}

    srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --validate_only --model_dir validate
    srun --ntasks=1 --kill-on-bad-exit singularity exec --bind ${BIND_LOCATIONS} ${CEREBRAS_CONTAINER} python run.py --mode train --compile_only --model_dir compile

Note

If you run into problems when running jobs on Bridges-2, please remember to also take a look at the Bridges-2 User Guide User Guide.