Friday, October 7, 2022

[SOLVED] Is there a way to introduce a temporary wildcard in Snakemake from a list?

Issue

I recently started using Snakemake. I am building a pipeline where the goal is to separate fractions of reads (based on their taxID by classification by Kraken2). I am having problems understanding how I may introduce a new wildcard based on a list of taxIDs.

I have:

  1. A list of taxIDs that I want to examine further.
  2. A script to extract reads from my original .fastq file based on taxID and the Kraken2 classification

The code displayed below is a subset of the pipeline but captures the problem. The R-script extract_microbial_taxids.R generates a file with each taxID in a separate line. The function SplitGenusList() splits this file into an array with each element being a taxID.

I would like to split an input .fastq-file and introduce taxID as a new wildcard - likely based on the taxID list from the R-script. Any clues as to how I may achieve this? The wildcard will then be used for a couple of rules, before only the {sample} wildcard is relevant again.

rule all:
    input:
        expand("data/microbial_taxids/{sample}/genus_list.txt", sample = config["SAMPLE"])

rule extract_microbial_taxids:
    input:
        "data/kraken2_classification/{sample}/{sample}.report",
        config["Rlibpath"]
    output:
        "data/microbial_taxids/{sample}/genus_list.txt"
    script:
        "code/extract_microbial_taxids.R"

    
def SplitGenusList(genus_list):
        genus_list = open(genus_list)
        contents = genus_list.read()
        contents_split = contents.splitlines()
        return contents_split    
    
rule extract_microbial_reads:
        input:
            taxID_list = "data/microbial_taxids/{sample}/genus_list.txt",
            kraken_file = "data/kraken2_classification/{sample}/{sample}.kraken",
            reads = config["reads"]+"{sample}.fastq"
        output:
            taxID_reads = expand("data/microbial_reads/{sample}/{taxID}.fastq", taxID = SplitGenusList(input.taxID_list))
        shell:
            """
            for taxID in genus_list
            do
            python code/extract_kraken_reads.py --kraken {input.kraken_file} -s {input.reads} -o "data/microbial_reads/{sample}/$taxID\_.fastq" -t $taxID --include-children
            done
            """

UPDATE

After a lot of help from @Oliver and @dariober, I managed to construct a workflow that does as intended. I learned a few things from this:

1: Rule variables (such as {input.taxID_list}) should be written differently for python, input.taxID_list, and shell {input.taxID_list}.

2: the run command allows to combine python and shell (with the shell()function suggested by @dariober) in a way that seems very handy.

Here is the working code:

rule all:
    input:
        expand("data/microbial_taxids/{sample}/genus_list.txt", sample = config["SAMPLE"]),
     expand("data/microbial_reads/{sample}/extracting_microbial_reads.done", sample = config["SAMPLE"])


rule extract_microbial_taxids:
    input:
        "data/kraken2_classification/{sample}/{sample}.report",
        config["Rlibpath"]
    output:
        "data/microbial_taxids/{sample}/genus_list.txt"
    script:
        "code/extract_microbial_taxids.R"

def SplitGenusList(genus_list):
    genus_list = open(genus_list)
    contents = genus_list.read()
    contents_split = contents.splitlines()
    return contents_split

rule extract_microbial_reads:
    input:
        taxID_list="data/microbial_taxids/{sample}/genus_list.txt",
        kraken_file="data/kraken2_classification/{sample}/{sample}.kraken",
        reads=config["reads"] + "{sample}.fastq",
        report_file="data/kraken2_classification/{sample}/{sample}.report"
    output:
        touch("data/microbial_reads/{sample}/extracting_microbial_reads.done")
    run:
        import os
        genus_list = SplitGenusList(input.taxID_list)
        for taxID in genus_list:
            shell('python code/extract_kraken_reads.py -k {input.kraken_file} -s {input.reads} -o "data/microbial_reads/{wildcards.sample}/' + taxID + '_.fastq" -t ' + taxID + ' -r {input.report_file} --include-children') 

    

Solution

I am not sure whether I understood you correctly, but essentially your just looking for a way to reference in your rule the variable genus_list, which is supposed to hold the taxIDs or interest, right?

Well, I think one major problem is here, that Snakemake needs to be able to derive concrete outputs in order to create the DAG, since it determines the order of rule executions. So even though you place taxID_reads = expand(...) in the output, still Snakemake will try to figure out the actual results for that. Hence, the function SplitGenusList is called, where then an error is produced due to non-existence of the file genus_list.

Thus, I would propose not to call this function in the input or output, but simply in the code block. In order to document a successful execution of the given rule, you can simply create a dummy-file.

My suggestion:

rule all:
    input:
        expand("data/microbial_reads/{sample}/extracting_microbial_reads.done", sample=config["SAMPLE"])

rule extract_microbial_taxids:
    input:
        "data/kraken2_classification/{sample}/{sample}.report",
        config["Rlibpath"]
    output:
        "data/microbial_taxids/{sample}/genus_list.txt"
    script:
        "code/extract_microbial_taxids.R"


def SplitGenusList(genus_list):
    genus_list = open(genus_list)
    contents = genus_list.read()
    contents_split = contents.splitlines()
    return contents_split


rule extract_microbial_reads:
    input:
        taxID_list="data/microbial_taxids/{sample}/genus_list.txt",
        kraken_file="data/kraken2_classification/{sample}/{sample}.kraken",
        reads=config["reads"] + "{sample}.fastq"
    output:
        touch("data/microbial_reads/{sample}/extracting_microbial_reads.done")
    run:
        import os
        genus_list = SplitGenusList("data/microbial_taxids/" + wildcards.sample + "/genus_list.txt")
        for taxID in genus_list:
            os.system('python code/extract_kraken_reads.py --kraken ' + input.kraken_file + ' -s ' + input.reads +
                      ' -o "data/microbial_reads/' + wildcards.sample + '/' + taxID + '_.fastq" ' +
                      '-t ' + taxID + ' --include-children')


Answered By - Oliver Küchler
Answer Checked By - Gilberto Lyons (WPSolving Admin)