slurm_sweep: A Lightweight Utility for Hyperparameter Sweeps on SLURM ๐Ÿงช๐Ÿ”

Connecting Weights & Biases with SLURM job arrays for easier hyperparameter optimization

For those of you working on SLURM clusters who struggle with running hyperparameter sweeps, I’ve released a small utility package called slurm_sweep that might save you some time and effort.

What is slurm_sweep? ๐Ÿค”

slurm_sweep is a command-line utility that bridges the gap between Weights & Biases (W&B) hyperparameter sweeps and SLURM job arrays. It solved a specific workflow problem I kept encountering: efficiently parallelizing hyperparameter sweeps on cluster infrastructure while keeping experiment tracking organized.

How it works: Bringing two powerful tools together ๐Ÿ”—

The package combines two key ingredients:

  1. Weights & Biases (W&B) ๐Ÿ“ˆ - A robust experiment tracking platform that provides basic hyperparameter optimization strategies. W&B handles the parameter space exploration, tracking metrics, and visualizing results.

  2. simple_slurm ๐Ÿ–ฅ๏ธ - A Python interface to SLURM that makes it easy to generate SLURM job scripts programmatically, without having to deal with bash scripts directly.

By combining these components through a Python-based CLI, slurm_sweep eliminates the need for custom boilerplate code that connects hyperparameter selection with cluster job management.

The workflow ๐Ÿ”„

The basic workflow is straightforward:

  1. Create a YAML configuration file with your W&B and SLURM settings โš™๏ธ
  2. Write your training script that uses W&B ๐Ÿ“
  3. Use slurm-sweep to validate your config and generate a submission script โœ…
  4. Submit your job array to SLURM ๐Ÿš€

Strengths and limitations โš–๏ธ

What slurm_sweep does well โœ…

  • Quick setup: Get up and running with just a YAML config and your training script ๐ŸŽ๏ธ
  • Efficient parallelization: Leverages SLURM job arrays for highly efficient parallel execution โšก
  • Minimal overhead: Lightweight implementation with few dependencies ๐Ÿชถ
  • Good integration: Combines the experiment tracking capabilities of W&B with SLURM’s resource management ๐Ÿ”„

Current limitations โš ๏ธ

  • Search algorithm variety: slurm_sweep relies on W&B’s search algorithms, which are more limited than specialized tools
    • W&B offers basic grid search, random search, Bayes optimization, and a basic hyperband implementation
    • Tools like Optuna provide many more specialized algorithms

Your choice between slurm_sweep and other tools will depend on whether you value simplicity and W&B integration over having access to more advanced search algorithms and features.

Example usage ๐Ÿ’ป

# Validate your config
slurm-sweep validate_config config.yaml

# Generate a submission script
slurm-sweep configure-sweep config.yaml

# Submit the job array
sbatch submit.sh

Alternative approaches ๐Ÿ”„

If you’re looking for hyperparameter optimization solutions, there are several excellent alternatives worth considering:

  • Optuna ๐Ÿ”ฎ: A powerful optimization framework that supports various search algorithms and has built-in visualization tools.
  • Hydra ๐Ÿ’ง: A framework for elegantly configuring complex applications with dynamic configurations.
  • Ray Tune โ˜€๏ธ: A scalable hyperparameter tuning library with advanced scheduling algorithms and integration with various ML frameworks.
  • SEML ๐Ÿงช: A framework specifically designed for ML experiment management on SLURM clusters from TUM.

Each of these tools has its own strengths, and your choice might depend on your specific workflow needs, the scale of your experiments, and your preferred optimization strategies.

If you’re interested in trying out slurm_sweep, you can install it with pip:

pip install slurm_sweep

For more details and examples, check out the GitHub repository.

Edit this page

Marius Lange
Marius Lange
Postdoctoral researcher

Interested in single-cell genomics and machine learning.

Related