High Performance Computing Cluster Setup: A Tutorial
Pub. online: 26 November 2024
Type: Computing In Data Science
Open Access
Received
26 May 2024
26 May 2024
Accepted
30 October 2024
30 October 2024
Published
26 November 2024
26 November 2024
Abstract
When computations such as statistical simulations need to be carried out on a high performance computing (HPC) cluster, typical questions arise among researchers or practitioners. How do I interact with a HPC cluster? Do I need to type a long host name and also a password on every single login or file transfer? Why does my locally working code not run anymore on the HPC cluster? How can I install the latest versions of software on a HPC cluster to match my local setup? How can I submit a job and monitor its progress? This tutorial provides answers to such questions with experiments on an example HPC cluster.
Supplementary material
Supplementary MaterialThe Supplementary Material contains detailed information on how R can be installed entirely in one’s home directory without the permission to write to system directories. We also provide more details about the options specified in our starter.sh Bash script.
References
Abadi M, et al. (2024). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. URL: tensorflow.org.
Adaptive Computing (2024). TORQUE Resource Manager. URL: adaptivecomputing.com/cherry-services/torque-resource-manager.
Anaconda Inc (2024). The Operating System for AI. URL: anaconda.com.
Conda (2024). Conda. URL: docs.conda.io/en/latest/.
Hofert M, Mächler M (2016a). Parallel and other simulations in R made easy: An end-to-end study. Journal of Statistical Software, 69(4). https://doi.org/10.18637/jss.v069.i04
Hofert M, Mächler M (2016b). simsalapar: Tools for Simulation Studies in Parallel with R. CRAN.R-project.org/package=simsalapar.
HPC (2024). SLURM Job Scheduler. URL: hpc.hku.hk/guide/slurm-guide.
HTCondor (2024). HTCondor Software Suite. URL: htcondor.org.
Slurm Workload Manager (2024). Documentation. URL: slurm.schedmd.com.
Thain D, Tannenbaum T, Livny M (2005). Distributed computing in practice: The Condor experience. Concurrency and Computation: Practice and Experience, 17(2–4): 323–356. https://doi.org/10.1002/cpe.938
The National Radio Astronomy Observatory (2024). Translating between Torque, Slurm, and HTCondor. URL: info.nrao.edu/computing/guide/cluster-processing/appendix/translating-between-torque-htcondor-and-slurm.
The PyTorch Foundation (2024). PyTorch. URL: pytorch.org.
The R Foundation (2024). The R Project for Statistical Computing. URL: r-project.org.
Wikipedia (2024a). Environment Modules (software). URL: en.wikipedia.org/wiki/Environment_Modules_(software).
Wikipedia (2024b). Minimal reproducible example. URL: en.wikipedia.org/wiki/Minimal_reproducible_example.
Wikipedia (2024c). RSA (cryptosystem). URL: en.wikipedia.org/wiki/RSA_(cryptosystem).
Wikipedia (2024d). Shell (computing). URL: en.wikipedia.org/wiki/Shell_(computing).
Wikipedia (2024e). Slurm Workload Manager. URL: en.wikipedia.org/wiki/Slurm_Workload_Manager.
Wikipedia (2024f). TOP500. URL: en.wikipedia.org/wiki/TOP500.