Jay asked me to help get his A.elegantissima (aggregating anenome) NanoPore gDNA sequencing data submitted to NCBI Sequencing Read Archive (SRA). He sent a hard drive (HDD) with all the NanoPore sequencing Fast5 files. The HDD was received on 2/2/2021. Here’re are details provided in the reamde file in the Ae_ONT directory.
Readme file:
samb@computer:/media/samb/SeagatePortableDrive/Ae_ONT$ cat readme.txt
This directory contains genomic data for the sea anemone Anthopleura elegantissima from multiple Oxford Nanopore MinION DNA sequencing runs, as well as a genome assembly.
Folders with the 'bar' prefix contain data for a particular barcoded sample. Each barcode folder contains the merged fastq file plus a folder with raw fast5 data. Three libraries with one aposymbiotic individual and one symbiotic individual each, both individually barcoded, were prepared using the PCR-free ONT Ligation Sequencing Kit with the Native Barcoding Expansion Kit (SQK-LSK109 and EXP-NBD103), following the manufacturer’s 1D native barcoding gDNA protocol. Each library was run on a separate FLO-MIN 106D R9.4 flow cell. Basecalling and demultiplexing was performed with ONT Albacore Sequencing Pipeline Software v. 2.3.3, and sequencing adapters were removed with Porechop v. 0.2.4 (Wick et al. 2017).
Sample info
bar01 = anemone A4 (aposymbiotic)
bar02 = anemone G2 (symbiotic)
bar03 = anemone A3 (aposymbiotic)
bar04 = anemone G4 (symbiotic)
bar05 = anemone A1 (aposymbiotic)
bar06 = anemone G3 (symbiotic)
The 'Genome_data' folder contains data specific to the genome assembly, which was generated using exclusively aposymbiotic anemone samples. The file 'merged2.fq.zip' is a fastq file containing all reads from aposymbiotic anemones used for the genome assembly. However, in addition to containing data from the Ligation sequencing libraries described above, this file contains additional reads from two flow cell runs of another aposymbiotic sample prepared using the PCR-free, transposase-based ONT Rapid Sequencing Kit (SQK-RAD004) following manufacturer guidelines. Sequencing was done on two FLO-MIN 106D R9.4 flow cells.
Within the 'Genome_data' folder, the folder 'wtdbg2_genome' contains the genome assembly generated by the software program wtdbg2. The wtdbg2-generated draft genome comprises 243 Mb, including 5359 contigs with an N50 of 87 kb and N90 of 19.2 kb. All aposymbiotic sequences were used to generate the draft genome, providing a total of 5.6 Gb and an estimated coverage of 23x.
Due to a few factors (no write permissions on HDD, insufficient space on local HDD), it took me awhile to get this data processed. I ended up rsync-ing just the Fast5 files to dedicated directories on Gannet. After that, I compressed them to gzipped tarballs (tar.gz
), per the requirements for submitting ONT Fast5 files to NCBI Sequencing Read Archive (SRA). I generated checksums for these gzipped tarballs and then rsync’ed to our Nightingales sequencing repository on Owl. Additionally, the file transfers themselves took quite some time, as they constitute a large amount of data (>300GB).
rsync
command:
samb@computer:/media/samb/SeagatePortableDrive/Ae_ONT$ time for dir in bar0*
do
cd "${dir}" || exit
if [ "${dir}" = "bar01" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}."
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A4_aposymb/"
echo "Finished syncing ${dir}."
echo ""
elif [ "${dir}" = "bar02" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}"
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G2_symb/"
echo "Finished syncing ${dir}."
echo ""
elif [ "${dir}" = "bar03" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}"
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A3_aposymb/"
echo "Finished syncing ${dir}."
echo ""
elif [ "${dir}" = "bar04" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}"
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G4_symb/"
echo "Finished syncing ${dir}."
echo ""
elif [ "${dir}" = "bar05" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}"
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_A1_aposymb/"
echo "Finished syncing ${dir}."
echo ""
elif [ "${dir}" = "bar06" ]; then
# Print dividing line
printf '=%.0s' {1..50}
echo ""
echo "Syncing ${dir}"
# Run rsync to only sync Fast5 files.
# Utilizes a "named pipe" to send output of find as a files list for rsync to use
# The command will only copy files and will not replicate directory structure.
rsync -0t --no-r --no-R --no-dirs --files-from=<(find . -type f -name "*.fast5" -print0) "$PWD"/ "gannet:/volume2/web/tmp/aele_G3_symb/"
echo "Finished syncing ${dir}."
echo ""
fi
cd "${ext_hdd}" || exit
done
==================================================
Syncing bar01.
Finished syncing bar01.
==================================================
Syncing bar02
Finished syncing bar02.
==================================================
Syncing bar03
Finished syncing bar03.
==================================================
Syncing bar04
Finished syncing bar04.
==================================================
Syncing bar05
Finished syncing bar05.
==================================================
Syncing bar06
Finished syncing bar06.
real 885m51.779s
user 24m10.217s
sys 78m51.111s
Files were compressed into gzipped tarballs, MD5 checksums generated, rsync’d to Owl, and checksums verified:
I updated our Nightingales Google Sheet.