I have a list of samples going through Snakemake. When I arrive at my fastqc step, I suddenly have two files per sample (an R1 and R2 file). Consider the following rule:
rule fastqc:
input:
os.path.join(fastq_dir, '{sample}_R1_001.fastq.gz'),
os.path.join(fastq_dir, '{sample}_R2_001.fastq.gz')
output:
os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
os.path.join(fastq_dir, '{sample}_R2_fastq.html')
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
This does not work. I also tried the following:
rule fastqc:
input:
expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2']
output:
expand([os.path.join(fastq_dir, '{sample}_R{read}_fastq.html')], read=['1', '2']
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
Which also does not work, I get:
No values given for wildcard 'sample'.
Then I tried:
rule fastqc:
input:
expand([os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')], read=['1', '2'], sample=samples['samples'])
output:
expand([os.path.join(fastqc_dir, '{sample}_R{read}_fastqc.html')], read=['1', '2'], sample=samples['samples'])
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
But this feeds all fastq files there are into one shell script, it seems.
How should I properly "loop" over multiple inputs for 1 sample?
Highest regards.
Edit:
My rule all looks like this, probably I should also change that, right (see the last 2 lines for fastqc)?
# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
input:
expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
os.path.join(rseqc_dir, '{sample}.junction.xls'),
os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.GC.xls'),
os.path.join(rseqc_dir, '{sample}.NVC.xls'),
os.path.join(rseqc_dir, '{sample}.qual.r'),
os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
os.path.join(fastq_dir, '{sample}_R1_fastq.html'),
os.path.join(fastq_dir, '{sample}_R2_fastq.html')],
sample=samples['samples'])
Yes, this one I figured out by "myself". The magic is in the "rule all" part.
This combination of rules works:
reads = ['1', '2']
# Rule all is a pseudo-rule that tells snakemake what final files to generate.
rule all:
input:
expand([os.path.join(analyzed_dir, '{sample}.genes.results'),
os.path.join(rseqc_dir, '{sample}.bam_stat.txt'),
os.path.join(rseqc_dir, '{sample}.clipping_profile.xls'),
os.path.join(rseqc_dir, '{sample}.deletion_profile.txt'),
os.path.join(rseqc_dir, '{sample}.infer_experiment.txt'),
os.path.join(rseqc_dir, '{sample}.geneBodyCoverage.txt'),
os.path.join(rseqc_dir, '{sample}.inner_distance.txt'),
os.path.join(rseqc_dir, '{sample}.insertion_profile.xls'),
os.path.join(rseqc_dir, '{sample}.junction.xls'),
os.path.join(rseqc_dir, '{sample}.junctionSaturation_plot.r'),
os.path.join(rseqc_dir, '{sample}.mismatch_profile.xls'),
os.path.join(rseqc_dir, '{sample}.read_distribution.txt'),
os.path.join(rseqc_dir, '{sample}.pos.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.seq.DupRate.xls'),
os.path.join(rseqc_dir, '{sample}.GC.xls'),
os.path.join(rseqc_dir, '{sample}.NVC.xls'),
os.path.join(rseqc_dir, '{sample}.qual.r'),
os.path.join(rseqc_dir, '{sample}.RNA_fragment_size.txt'),
os.path.join(rseqc_dir, '{sample}.STAR.genome.sorted.summary.txt'),
os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')],
sample=samples['samples'], read=reads)
Notice the simple addition of {read} to the otherwise the same fastqc part and the definition or "reads" at the top (samples is a standard samples list).
The I use this fastqc rule:
rule fastqc:
input:
os.path.join(fastq_dir, '{sample}_R{read}_001.fastq.gz')
output:
os.path.join(fastqc_dir, '{sample}_R{read}_001_fastqc.html')
conda:
"../envs/fastqc.yaml"
shell:
'''
#!/bin/bash
fastqc {input} --outdir={fastqc_dir}
'''
It has the same line as the "rule all" (as usual). This works, thanx for the upvotes, Freek out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With