Daily Bits - July 2022 – Sam’s Notebook

20220727

Ran qPCRs using 8 primer sets on Dorothy’s cDNA from yesterday.

20220726

In lab to finish Dorothy’s mussel RNA extraction and reverse transcription.

20220725

Read Ch. 7 of “Fresh Banana Leaves” for book club.
Lab meeting discussing Ch. 7 of “Fresh Banana Leaves.”

20220721

More testing with EpiDiverse/wgbs and EpiDiverse/snp pipelines (conda environments). This time ran trimming with an additional 5’ 10bp hard trim, per Bismark recommendations (they actually recommend 8bp).
Did a lot of reading about Nextflow and trying to get a grasp on how it all works so we might be able to develop our own pipelines and/or feel confident modifying existing pipelines (e.g. modify EpiDiverse/snp pipeline to analyze all BAMs in a directory without the need to explicitly declare how many at run time).

20220720

More testing with EpiDiverse/wgbs pipeline. Per this GitHub Issue, I tried using the conda test profile and things ran smoothly without any errors in the bam_statistics part of the pipeline. Steps:

Create conda environment:

conda env create --name epidiverse-wgbs-current -f /home/shared/epidiverse-pipelines/wgbs-current

Activate conda environment:

conda activate epdiverse-wgbs-current

Run conda test profile:

NXF_VER=20.07.1 /home/shared/nextflow run /home/shared/epidiverse-pipelines/wgbs-current -profile test,conda

Did same conda test process for EpiDiverse/snp and it also ran without issue.
Started adding instructions for using Nextflow EpiDiverse pipelines on Raven and Mox to Roberts Lab Handbook.
Decided to re-run the EpiDiverse/wgbs pipeline using the conda environment and enable the FastQC process for post-trimming assessment. Will allow me to compare with our previous trimming results and try to determine why our trimming generates different results when passing our FastQs into the EpiDiverse/snp pipeline.

20220719

Assisted Dorothy with prep for big RNA extraction from mussel gills. Homogenized 10 control and 10 heat treated gill samples in 1mL RNAzol RT and stored @ -80C. Extraction to happen next Tuesday.

20220718

Ran some more tests of the EpiDiverse/wgbs and EpiDiverse/snp pipelines to compare the Oly trimmed and untrimmed FastQs as inputs.

20220717

Ran some more tests of the EpiDiverse/wgbs and EpiDiverse/snp pipelines to compare the Oly trimmed and untrimmed FastQs as inputs.

20220716

Ran some more tests of the EpiDiverse/wgbs and EpiDiverse/snp pipelines to compare the Oly trimmed and untrimmed FastQs as inputs.

20220715

Managed to resolve the EpiDiverse/wgbs test issue I encountered yesterday. See the link for details (involves binding local directory to Singularity container).
Successfully ran the EpiDiverse/wgbs pipeline on a subset of the Ostrea lurida BSseq data! Now to test this in the EpiDiverse/snp pipeline to see if output is same as when running Bismark BAMs through the EpiDiverse/snp pipeline…
Tested Bismark BAMs from different data set (Panopea generosa: https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/1231/) in EpiDiverse/snp pipeline to see how results compare to the Ostrea lurida Bismark BAMs.

20220714

Installed Singularity on Raven (required Go, so installed that, too).
Tried to run EpiDiverse/wgbs test run on Raven, but it failed. See the GitHub Issue I posted for deets.

20220713

Got response to my EpiDiverse/snp GitHub issue regarding the pipeline not processing all BAM files in directory! Will re-ran pipeline and add the --take <INTEGER> parameter to handle all 18 BAM files.
Fixed issue with CEABIGR transcript count stats. Original code assumed that calculations were only performed on input data and did not process newly added data generated by mutate() function. Fixed that by using the -any_of() fucntion to skip a vector of column names. The - preceding the any_of() is the instruction to exclude things entered in the any_of() function.

20220712

In lab: walked Dorothy through the basics of RNA isolation (RNAzol and Direct-zol Mini Kit) and quantification (Qubit RNA BR Assay). Processed two fresh mussel ctenidia samples.

20220711

Lab meeting: Discussed Ch.6 of Fresh Banana Leaves
Oyster gene expression meeting: Steven did some plotting of Crassostrea virginica (Eastern oyster) transcript count data.
Nextflow epidiverse/snp: opted to try a subsetted version of the Olurida_v081 genome and BSseq alignments (GitHub Issue):

Looks like it’s working (got past the previous hangups on 20220708)!

N E X T F L O W  ~  version 20.07.1
Launching `epidiverse/snp` [sick_hoover] - revision: 9c814703c6 [master]

================================================
E P I D I V E R S E - S N P    P I P E L I N E
================================================
~ version 1.0

input dir     : /home/shared/8TB_HDD_01/sam/data/O_lurida/BSseq/bams/070322-olymerge-snp
reference     : /home/shared/8TB_HDD_01/sam/data/O_lurida/genomes/Olurida_v081-mergecat98.fa
output dir    : snps
variant calls : enabled
clustering    : enabled

================================================
RUN NAME: sick_hoover

executor >  local (60)
[2c/3853d0] process > SNPS:preprocessing (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)        [100%] 10 of 10 ✔
[90/dea51f] process > SNPS:masking (zr1394_9_s456_trimmed_bismark_bt2.deduplicated - clustering) [100%] 20 of 20 ✔
[8a/ce2b44] process > SNPS:extracting (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)           [100%] 10 of 10 ✔
[01/ef9503] process > SNPS:khmer (zr1394_6_s456_trimmed_bismark_bt2.deduplicated)                [ 40%] 4 of 10
[-        ] process > SNPS:kwip                                                                  -
[-        ] process > SNPS:clustering                                                            -
[83/238cfa] process > SNPS:sorting (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)              [100%] 10 of 10 ✔
[03/1b910d] process > SNPS:freebayes (zr1394_7_s456_trimmed_bismark_bt2.deduplicated)            [  0%] 0 of 10
[-        ] process > SNPS:bcftools                                                              -
[-        ] process > SNPS:plot_vcfstats

Well, well, well! It worked!!

N E X T F L O W  ~  version 20.07.1
Launching `epidiverse/snp` [sick_hoover] - revision: 9c814703c6 [master]

================================================
E P I D I V E R S E - S N P    P I P E L I N E
================================================
~ version 1.0

input dir     : /home/shared/8TB_HDD_01/sam/data/O_lurida/BSseq/bams/070322-olymerge-snp
reference     : /home/shared/8TB_HDD_01/sam/data/O_lurida/genomes/Olurida_v081-mergecat98.fa
output dir    : snps
variant calls : enabled
clustering    : enabled

================================================
RUN NAME: sick_hoover

executor >  local (92)
[2c/3853d0] process > SNPS:preprocessing (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)        [100%] 10 of 10 ✔
[90/dea51f] process > SNPS:masking (zr1394_9_s456_trimmed_bismark_bt2.deduplicated - clustering) [100%] 20 of 20 ✔
[8a/ce2b44] process > SNPS:extracting (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)           [100%] 10 of 10 ✔
[a4/dc11ce] process > SNPS:khmer (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)                [100%] 10 of 10 ✔
[95/79be9f] process > SNPS:kwip                                                                  [100%] 1 of 1 ✔
[c6/24b89b] process > SNPS:clustering                                                            [100%] 1 of 1 ✔
[83/238cfa] process > SNPS:sorting (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)              [100%] 10 of 10 ✔
[63/2c5bc5] process > SNPS:freebayes (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)            [100%] 10 of 10 ✔
[00/f4af8b] process > SNPS:bcftools (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)             [100%] 10 of 10 ✔
[df/1e78de] process > SNPS:plot_vcfstats (zr1394_9_s456_trimmed_bismark_bt2.deduplicated)        [100%] 10 of 10 ✔

Pipeline execution summary
---------------------------
Name         : sick_hoover
Profile      : docker
Launch dir   : /home/shared/8TB_HDD_01/sam/analyses/20220707-olur-epidiverse
Work dir     : /home/shared/8TB_HDD_01/sam/analyses/20220707-olur-epidiverse/work (cleared)
Status       : success
Error report : -

Completed at: 11-Jul-2022 18:47:04
Duration    : 3h 17m 49s
CPU hours   : 140.1
Succeeded   : 92

I’ll obviously share this with Steven, but will also see if I can get the original dataset to run on Mox, which has twice the memory available as Raven…

Actually, I glanced at the data and it didn’t really work. It only seems to have analyzed some of the BAMs in the directory! There are 18 BAMs, but it only processed 10 of them.. Odd (and annoying). I’ll re-run to see if it does the same thing.

20220710

Created erne-bs5 genome index to attempt full epidiverse pipeline, starting with EpiDiverse/wgbs:

/home/shared/erne-2.1.1-linux/bin/erne-create \
--methyl-hash \
--fasta Olurida_v081.fa \
--output-prefix Olurida_v081

Realized I had experienced previous issues (see this GitHub Issue and this GitHub Issue) previously, which is not encouraging.

Tried to run the test protocol for this pipeline “locally” (i.e. offline) using Mox sbatch script. To do so, I’ve downloaded all the of the input files listed in test.config. I’ve also downloaded the Singularity image (singularity pull docker://epidiverse/wgbs:1.0) and changed the nextflow.config file to specify the Singularity image location, like so:

// -profile singularity
    singularity {
        includeConfig "${baseDir}/config/base.config"
        singularity.enabled = true
        process.container = '/gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/singularity/wgbs_1.0.sif'
    }

That seemed like that should be all that was needed, but when I execute the test command (NXF_VER=20.07.1 /gscratch/srlab/programs/nextflow-21.10.6-all run /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/wgbs-1.0 -profile test,singularity), it fails with this error:

executor >  local (10)
[c4/79070c] process > INDEX:erne_bs5_indexing        [100%] 1 of 1 ✔
[30/202688] process > INDEX:segemehl_indexing        [100%] 1 of 1 ✔
[07/dc2230] process > WGBS:read_trimming (sampleB)   [100%] 8 of 8, failed: 8...
[-        ] process > WGBS:read_merging              -
[-        ] process > WGBS:fastqc                    -
[-        ] process > WGBS:erne_bs5                  -
[-        ] process > WGBS:segemehl                  -
[-        ] process > WGBS:erne_bs5_processing       -
[-        ] process > WGBS:segemehl_processing       -
[-        ] process > WGBS:bam_merging               -
[-        ] process > WGBS:bam_subsetting            -
[-        ] process > WGBS:bam_filtering             -
[-        ] process > WGBS:bam_statistics            -
[-        ] process > CALL:bam_processing            -
[-        ] process > CALL:Picard_MarkDuplicates     -
[-        ] process > CALL:MethylDackel              -
[-        ] process > CALL:conversion_rate_estima... -

Pipeline execution summary
---------------------------
Name         : infallible_mccarthy
Profile      : test,singularity
Launch dir   : /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test
Work dir     : /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work
Status       : failed
Error report : Error executing process > 'WGBS:read_trimming (sampleA)'

Caused by:
  Process `WGBS:read_trimming (sampleA)` terminated with an error exit status (1)

Command executed:

  mkdir fastq fastq/logs
  cutadapt -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -q 20 -m 36 -O 3 \
  -o fastq/merge.null \
  -p fastq/merge.g null g \
  > fastq/logs/cutadapt.sampleA.merge.log 2>&1

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/12/6ee9cc7a7372a97f34f21a4f79efb3

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

Error executing process > 'WGBS:read_trimming (sampleA)'

Caused by:
  Process `WGBS:read_trimming (sampleA)` terminated with an error exit status (1)

Command executed:

  mkdir fastq fastq/logs
  cutadapt -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC \
  -q 20 -m 36 -O 3 \
  -o fastq/merge.null \
  -p fastq/merge.g null g \
  > fastq/logs/cutadapt.sampleA.merge.log 2>&1

Command exit status:
  1

Command output:
  (empty)

Work dir:
  /gscratch/srlab/sam/analyses/20220710-olu-epidiverse_wgbs-test/work/12/6ee9cc7a7372a97f34f21a4f79efb3

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

When I look at the Cutadapt log file, this is what is shown:

cat cutadapt.sampleA.merge.log 
This is cutadapt 2.10 with Python 3.6.7
Command line parameters: -j 2 -a AGATCGGAAGAGC -A AGATCGGAAGAGC -q 20 -m 36 -O 3 -o fastq/merge.null -p fastq/merge.g null g
Processing reads on 2 cores in paired-end mode ...
ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 477, in run
    with xopen(self.file, 'rb') as f:
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/xopen/__init__.py", line 407, in xopen
    return open(filename, mode)
IsADirectoryError: [Errno 21] Is a directory: 'null'

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 477, in run
    with xopen(self.file, 'rb') as f:
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/xopen/__init__.py", line 407, in xopen
    return open(filename, mode)
IsADirectoryError: [Errno 21] Is a directory: 'null'

ERROR: Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 540, in run
    raise e
IsADirectoryError: [Errno 21] Is a directory: 'null'

Traceback (most recent call last):
  File "/opt/conda/envs/wgbs/bin/cutadapt", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/__main__.py", line 855, in main
    stats = r.run()
  File "/opt/conda/envs/wgbs/lib/python3.6/site-packages/cutadapt/pipeline.py", line 770, in run
    raise e
IsADirectoryError: [Errno 21] Is a directory: 'null'

I posted an Issue, but I’m not really expecting to get a response.

20220709

Nextflow epidiverse/snp: Let it run overnight and woke up to the same timeout error message and kernel memory warnings, despite changes in config file.. Although, I had been running it with the -resume flag… I’ll remove all previous snps/ and work/ dirs and run with the following command:
```
NXF_VER=20.07.1 /home/shared/nextflow run epidiverse/snp \
-c nextflow-docker_permissions.config \
-profile docker \
--input /home/shared/8TB_HDD_01/sam/data/O_lurida/BSseq/ \
--reference /home/shared/8TB_HDD_01/sam/data/O_lurida/genomes/Olurida_v081.fa
```
- Well that didn’t work… I’m going to try again with a limited data set, as the Olurida_v081.fa has a TON of contigs, which has caused issues with other bisfulfite analysis software.

20220708

Spent time at Science Hour with Steven visualizing some of the Ostrea lurida (Olympia oyster) transcript counts stats in R - I mostly just watched and nodded. :)

Nextflow epidiverse/snp problems. Got to deal with these shenanigans:

================================================
E P I D I V E R S E - S N P    P I P E L I N E
================================================
~ version 1.0

input dir     : /home/shared/8TB_HDD_01/sam/data/O_lurida/BSseq
reference     : /home/shared/8TB_HDD_01/sam/data/O_lurida/genomes/Olurida_v081.fa
output dir    : snps
variant calls : enabled
clustering    : enabled

================================================
RUN NAME: nostalgic_bhabha

executor >  local (34)
[d7/390a8d] process > SNPS:preprocessing (zr1394_all_s456_trimmed_bismark_bt2.deduplicated.sorted)     [100%] 10 of 10 ✔
[49/25512a] process > SNPS:masking (zr1394_11_s456_trimmed_bismark_bt2.deduplicated.sorted - variants) [ 60%] 24 of 40, failed: 24, retries: 20
[-        ] process > SNPS:extracting                                                                  -
[-        ] process > SNPS:khmer                                                                       -
[-        ] process > SNPS:kwip                                                                        -
[-        ] process > SNPS:clustering                                                                  -
[-        ] process > SNPS:sorting                                                                     -
[-        ] process > SNPS:freebayes                                                                   -
[-        ] process > SNPS:bcftools                                                                    -
[-        ] process > SNPS:plot_vcfstats                                                               -
Error executing process > 'SNPS:masking (zr1394_17_s456_trimmed_bismark_bt2.deduplicated.sorted - clustering)'

Caused by:
  Process exceeded running time limit (4h)

Command executed:

  change_sam_queries.py -Q -T 16 -t . -G calmd.bam clustering.bam || exit $?
  find -mindepth 1 -maxdepth 1 -type d -exec rm -r {} \;

Command exit status:
  -

Command output:
  (empty)

Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.

Work dir:
  /home/shared/8TB_HDD_01/sam/analyses/20220707-olur-epidiverse/work/8b/636e263c8924aa275cd5f1fc7787e8

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

Updated custom config file (which was originally used solely to address permissions issues on the outptut files generated by the Docker container) to include improved time limits:

  docker {
  fixOwnership = true
}


process {

    // Process-specific resource requirements
    withLabel:process_low {
        time   = 30.d
    }
    withLabel:process_medium {
        time   = 30.d
    }
    withLabel:process_high {
        time   = 30.d
    }
    withLabel:process_long {
        time   = 30.d
    }
}

Seemed to fix timeout issue as this ran for nearly 6hrs without timing out error, but still getting the kernel memory warning in log files and constant failures at the masking step. Added this line to config file:

docker.runOptions = "--memory-swap '-1'"

Config file now looks like this (grabbed idea to use this line to resolve kernel issue from here):

     docker {
         docker.runOptions = "--memory-swap '-1'"
         fixOwnership = true
 }


 process {

     // Process-specific resource requirements
     withLabel:process_low {
         time   = 30.d
     }
     withLabel:process_medium {
         time   = 30.d
     }
     withLabel:process_high {
         time   = 30.d
     }
     withLabel:process_long {
         time   = 30.d
     }
 }

20220707

In order to start Nextflow epidiverse/snp, genome FastA file requires FastA index file; this is not documented, but pipeline throws an error when a FastA index is not found.
started Nextflow epidiverse/snp on raven

20220706

fixed Nextflow epidiverse/snp problem (GitHub Issue)
dealt with lake trout BioProject PRJNA316738 data retrieval and QC (GitHub Issue)

20220705

CEABIGR meeting:
- added gene names to transcript counts files
- added code to cover all desired iterations of output files (i.e. missing controls males files)