Issue
I tried to simplify my issue as much as possible and still getting the error.
The whole idea is that I want to execute (inside a much more complex workflow) the command:
gmx mdrun -nt 12 -deffnm emin -cpi
on a cluster. For that I have a conda environment with GROMACS and Snakemake.
On a traditional way I have the jobscript (traditional_job.sh
) with:
#!/bin/bash
#SBATCH --partition=uds-hub
#SBATCH --nodes=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=5000
#SBATCH --time=15:00
#SBATCH --job-name=reproduce_error
#SBATCH --output=reproduce_error.o
#SBATCH --error=reproduce_error.e
gmx mdrun -nt 12 -deffnm emin -cpi
And everything works as expected after sbatch traditional_job.sh
. However, if I try to use Snakemake instead, the problems start.
My Snakefile
is:
rule gmx:
input:
tpr = "emin.tpr"
output:
out = 'emin.gro'
shell:
'''
gmx mdrun -nt 12 -deffnm emin -cpi
'''
And my job.sh
:
#!/bin/bash
snakemake \
--jobs 10000 \
--verbose \
--debug-dag \
--latency-wait 50 \
--cluster-cancel scancel \
--rerun-incomplete \
--keep-going \
--cluster '
sbatch \
--partition=uds-hub \
--nodes=1 \
--cpus-per-task=12 \
--mem=5000 \
--time=15:00 \
--job-name=reproduce_error \
--output=reproduce_error.o \
--error=reproduce_error.e '
After ./job.sh
, the GROMACS's error (written out on reproduce_error.e
) is:
Program: gmx mdrun, version 2022.2-conda_forge
Source file: src/gromacs/taskassignment/resourcedivision.cpp (line 220)
Fatal error:
When using GPUs, setting the number of OpenMP threads without specifying the
number of ranks can lead to conflicting demands. Please specify the number of
thread-MPI ranks as well (option -ntmpi).
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
I observed that in the output of snakemake
is written a different shell (#!/bin/sh
). But honestly, I do not know if that could be the problem, neither how to solve it if that is the case:
Jobscript:
#!/bin/sh
# properties = {"type": "single", "rule": "gmx", "local": false, "input": ["emin.tpr"], "output": ["emin.gro"], "wildcards": {}, "params": {}, "log": [], "threads": 1, "resources": {"mem_mb": 1000, "mem_mib": 954, "disk_mb": 1000, "disk_mib": 954, "tmpdir": "<TBD>"}, "jobid": 0, "cluster": {}}
cd '/home/uds_alma015/GIT/BindFlow/ideas/reproduce error' && /home/uds_alma015/.conda/envs/abfe/bin/python3.9 -m snakemake --snakefile '/home/uds_alma015/GIT/BindFlow/ideas/reproduce error/Snakefile' --target-jobs 'gmx:' --allowed-rules 'gmx' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1000' 'mem_mib=954' 'disk_mb=1000' 'disk_mib=954' --wait-for-files '/home/uds_alma015/GIT/BindFlow/ideas/reproduce error/.snakemake/tmp.lfq5jq4u' 'emin.tpr' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers 'software-env' 'params' 'input' 'mtime' 'code' --skip-script-cleanup --conda-frontend 'mamba' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 50 --scheduler 'ilp' --scheduler-solver-path '/home/uds_alma015/.conda/envs/abfe/bin' --default-resources 'mem_mb=max(2*input.size_mb, 1000)' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' --mode 2 && touch '/home/uds_alma015/GIT/BindFlow/ideas/reproduce error/.snakemake/tmp.lfq5jq4u/0.jobfinished' || (touch '/home/uds_alma015/GIT/BindFlow/ideas/reproduce error/.snakemake/tmp.lfq5jq4u/0.jobfailed'; exit 1)
P.s. ChatGPT goes in circle with this question :-)
Update
I even isolate more the error. The following other_job.sh
(submitted to the cluster as sbatch other_job.sh
) script also gave the same error:
#!/bin/bash
#SBATCH --partition=uds-hub
#SBATCH --nodes=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=5000
#SBATCH --time=15:00
#SBATCH --job-name=reproduce_error
#SBATCH --output=reproduce_error.o
#SBATCH --error=reproduce_error.e
/home/uds_alma015/.conda/envs/abfe/bin/python3.9 -m snakemake --snakefile '/home/uds_alma015/GIT/BindFlow/ideas/reproduce_error/Snakefile' --target-jobs 'gmx:' --allowed-rules 'gmx' --cores 'all' --attempt 1 --force-use-threads --resources 'mem_mb=1000' 'mem_mib=954' 'disk_mb=1000' 'disk_mib=954' --wait-for-files '/home/uds_alma015/GIT/BindFlow/ideas/reproduce_error/.snakemake/tmp.mqan6qbp' 'emin.tpr' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers 'mtime' 'software-env' 'params' 'code' 'input' --skip-script-cleanup --conda-frontend 'mamba' --wrapper-prefix 'https://github.com/snakemake/snakemake-wrappers/raw/' --latency-wait 50 --scheduler 'ilp' --scheduler-solver-path '/home/uds_alma015/.conda/envs/abfe/bin' --default-resources 'mem_mb=max(2*input.size_mb, 1000)' 'disk_mb=max(2*input.size_mb, 1000)' 'tmpdir=system_tmpdir' --mode 2
And this is the command build by snakemake
. It looks like that command does not interact (in some how) with the definitions of SBATCH
. But still not sure.
Solution
I had a similar issue with snakemake, but with a different program. In the end I had to specifically unset some environment variables, because snakemake sets them for each instantiated shell. In particular OMP_NUM_THREADS
was the root cause of my problem.
Make sure to compare which environment variables are set within snakemake and in your regular script, this might give you a hint to find the culprit variable.
Answered By - Chris Answer Checked By - Mary Flores (WPSolving Volunteer)