slurm_sweep: A Lightweight Utility for Hyperparameter Sweeps on SLURM ๐งช๐
Connecting Weights & Biases with SLURM job arrays for easier hyperparameter optimization
For those of you working on SLURM clusters who struggle with running hyperparameter sweeps, I’ve released a small utility package called slurm_sweep that might save you some time and effort.
What is slurm_sweep? ๐ค
slurm_sweep
is a command-line utility that bridges the gap between Weights & Biases (W&B) hyperparameter sweeps and SLURM job arrays. It solved a specific workflow problem I kept encountering: efficiently parallelizing hyperparameter sweeps on cluster infrastructure while keeping experiment tracking organized.
How it works: Bringing two powerful tools together ๐
The package combines two key ingredients:
Weights & Biases (W&B) ๐ - A robust experiment tracking platform that provides basic hyperparameter optimization strategies. W&B handles the parameter space exploration, tracking metrics, and visualizing results.
simple_slurm ๐ฅ๏ธ - A Python interface to SLURM that makes it easy to generate SLURM job scripts programmatically, without having to deal with bash scripts directly.
By combining these components through a Python-based CLI, slurm_sweep
eliminates the need for custom boilerplate code that connects hyperparameter selection with cluster job management.
The workflow ๐
The basic workflow is straightforward:
- Create a YAML configuration file with your W&B and SLURM settings โ๏ธ
- Write your training script that uses W&B ๐
- Use
slurm-sweep
to validate your config and generate a submission script โ - Submit your job array to SLURM ๐
Strengths and limitations โ๏ธ
What slurm_sweep does well โ
- Quick setup: Get up and running with just a YAML config and your training script ๐๏ธ
- Efficient parallelization: Leverages SLURM job arrays for highly efficient parallel execution โก
- Minimal overhead: Lightweight implementation with few dependencies ๐ชถ
- Good integration: Combines the experiment tracking capabilities of W&B with SLURM’s resource management ๐
Current limitations โ ๏ธ
- Search algorithm variety: slurm_sweep relies on W&B’s search algorithms, which are more limited than specialized tools
- W&B offers basic grid search, random search, Bayes optimization, and a basic hyperband implementation
- Tools like Optuna provide many more specialized algorithms
Your choice between slurm_sweep and other tools will depend on whether you value simplicity and W&B integration over having access to more advanced search algorithms and features.
Example usage ๐ป
# Validate your config
slurm-sweep validate_config config.yaml
# Generate a submission script
slurm-sweep configure-sweep config.yaml
# Submit the job array
sbatch submit.sh
Alternative approaches ๐
If you’re looking for hyperparameter optimization solutions, there are several excellent alternatives worth considering:
- Optuna ๐ฎ: A powerful optimization framework that supports various search algorithms and has built-in visualization tools.
- Hydra ๐ง: A framework for elegantly configuring complex applications with dynamic configurations.
- Ray Tune โ๏ธ: A scalable hyperparameter tuning library with advanced scheduling algorithms and integration with various ML frameworks.
- SEML ๐งช: A framework specifically designed for ML experiment management on SLURM clusters from TUM.
Each of these tools has its own strengths, and your choice might depend on your specific workflow needs, the scale of your experiments, and your preferred optimization strategies.
If you’re interested in trying out slurm_sweep
, you can install it with pip:
pip install slurm_sweep
For more details and examples, check out the GitHub repository.