Data Received - Olympia oyster PacBio Data

Back in December 2016, we sent off Ostrea lurida DNA to the UW PacBio sequencing facility. This is an attempt to fill in the gaps left from the BGI genome sequencing project.

See the GitHub Wiki dedicated to this for an overview of this UW PacBio sequencing.

I downloaded the data to https://owl.fish.washington.edu/nightingales/O_lurida/20170323_pacbio/ using the required browser plugin, Aspera Connect. Technically, saving the data to a subfolder within a given species’ data folder goes against our data management plan (DMP) for high-throughput sequencing data, but the sequencing data output is far different than what we normally receive from an Illumina sequencing run. Instead of a just FASTQ files, we received the following from each PacBio SMRT cell we had run (we had 10 SMRT cells run):

├── Analysis_Results
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.bax.h5
│   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.bax.h5
│   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.bas.h5
├── filter
│   ├── data
│   │   ├── control_reads.cmp.h5
│   │   ├── control_results_by_movie.csv
│   │   ├── data.items.json
│   │   ├── data.items.pickle
│   │   ├── filtered_regions
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   │   ├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── filtered_regions.fofn
│   │   ├── filtered_subread_summary.csv
│   │   ├── filtered_subreads.fasta
│   │   ├── filtered_subreads.fastq
│   │   ├── filtered_summary.csv
│   │   ├── nocontrol_filtered_subreads.fasta
│   │   ├── post_control_regions.chunk001of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.1.rgn.h5
│   │   ├── post_control_regions.chunk002of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.3.rgn.h5
│   │   ├── post_control_regions.chunk003of003
│   │   │   └── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.2.rgn.h5
│   │   ├── post_control_regions.fofn
│   │   └── slots.pickle
│   ├── index.html
│   ├── input.fofn
│   ├── input.xml
│   ├── log
│   │   ├── P_Control
│   │   │   ├── align.cmpH5.Gather.log
│   │   │   ├── align.plsFofn.Scatter.log
│   │   │   ├── align_001of003.log
│   │   │   ├── align_002of003.log
│   │   │   ├── align_003of003.log
│   │   │   ├── noControlSubreads.log
│   │   │   ├── summaryCSV.log
│   │   │   ├── updateRgn.noCtrlFofn.Gather.log
│   │   │   ├── updateRgn_001of003.log
│   │   │   ├── updateRgn_002of003.log
│   │   │   └── updateRgn_003of003.log
│   │   ├── P_ControlReports
│   │   │   └── statsJsonReport.log
│   │   ├── P_Fetch
│   │   │   ├── adapterRpt.log
│   │   │   ├── overviewRpt.log
│   │   │   └── toFofn.log
│   │   ├── P_Filter
│   │   │   ├── filter.rgnFofn.Gather.log
│   │   │   ├── filter.summary.Gather.log
│   │   │   ├── filter_001of003.log
│   │   │   ├── filter_002of003.log
│   │   │   ├── filter_003of003.log
│   │   │   ├── subreadSummary.log
│   │   │   ├── subreads.subreadFastq.Gather.log
│   │   │   ├── subreads.subreads.Gather.log
│   │   │   ├── subreads_001of003.log
│   │   │   ├── subreads_002of003.log
│   │   │   └── subreads_003of003.log
│   │   ├── P_FilterReports
│   │   │   ├── loadingRpt.log
│   │   │   ├── statsRpt.log
│   │   │   └── subreadRpt.log
│   │   ├── master.log
│   │   └── smrtpipe.log
│   ├── metadata.rdf
│   ├── results
│   │   ├── adapter_observed_insert_length_distribution.png
│   │   ├── adapter_observed_insert_length_distribution_thumb.png
│   │   ├── control_non-control_readlength.png
│   │   ├── control_non-control_readlength_thumb.png
│   │   ├── control_non-control_readquality.png
│   │   ├── control_non-control_readquality_thumb.png
│   │   ├── control_report.html
│   │   ├── control_report.json
│   │   ├── filter_reports_adapters.html
│   │   ├── filter_reports_adapters.json
│   │   ├── filter_reports_filter_stats.html
│   │   ├── filter_reports_filter_stats.json
│   │   ├── filter_reports_filter_subread_stats.html
│   │   ├── filter_reports_filter_subread_stats.json
│   │   ├── filter_reports_loading.html
│   │   ├── filter_reports_loading.json
│   │   ├── filtered_subread_report.png
│   │   ├── filtered_subread_report_thmb.png
│   │   ├── overview.html
│   │   ├── overview.json
│   │   ├── post_filter_readlength_histogram.png
│   │   ├── post_filter_readlength_histogram_thumb.png
│   │   ├── post_filterread_score_histogram.png
│   │   ├── post_filterread_score_histogram_thumb.png
│   │   ├── pre_filter_readlength_histogram.png
│   │   ├── pre_filter_readlength_histogram_thumb.png
│   │   ├── pre_filterread_score_histogram.png
│   │   └── pre_filterread_score_histogram_thumb.png
│   ├── toc.xml
│   └── workflow
│       ├── P_Control
│       │   ├── align.cmpH5.Gather.sh
│       │   ├── align.plsFofn.Scatter.sh
│       │   ├── align_001of003.sh
│       │   ├── align_002of003.sh
│       │   ├── align_003of003.sh
│       │   ├── noControlSubreads.sh
│       │   ├── summaryCSV.sh
│       │   ├── updateRgn.noCtrlFofn.Gather.sh
│       │   ├── updateRgn_001of003.sh
│       │   ├── updateRgn_002of003.sh
│       │   └── updateRgn_003of003.sh
│       ├── P_ControlReports
│       │   └── statsJsonReport.sh
│       ├── P_Fetch
│       │   ├── adapterRpt.sh
│       │   ├── overviewRpt.sh
│       │   └── toFofn.sh
│       ├── P_Filter
│       │   ├── filter.rgnFofn.Gather.sh
│       │   ├── filter.summary.Gather.sh
│       │   ├── filter_001of003.sh
│       │   ├── filter_002of003.sh
│       │   ├── filter_003of003.sh
│       │   ├── subreadSummary.sh
│       │   ├── subreads.subreadFastq.Gather.sh
│       │   ├── subreads.subreads.Gather.sh
│       │   ├── subreads_001of003.sh
│       │   ├── subreads_002of003.sh
│       │   └── subreads_003of003.sh
│       ├── P_FilterReports
│       │   ├── loadingRpt.sh
│       │   ├── statsRpt.sh
│       │   └── subreadRpt.sh
│       ├── Workflow.details.dot
│       ├── Workflow.details.html
│       ├── Workflow.details.svg
│       ├── Workflow.profile.html
│       ├── Workflow.rdf
│       ├── Workflow.summary.dot
│       ├── Workflow.summary.html
│       └── Workflow.summary.svg
├── filtered_subreads.fasta.gz
├── filtered_subreads.fastq.gz
├── m170211_224036_42134_c101073082550000001823236402101737_s1_X0.metadata.xml
└── nocontrol_filtered_subreads.fasta.gz

That’s 20 directories and 127 files - for a single SMRT cell!

Granted, there is the familiar FASTQ file (filtered_subreads.fastq), which is what will likely be used for downstream analysis, but it’s hard to make a decision on how we manage this data under the guidelines of our current DMP. It’s possible we might separate data files from the numerous other files (the other files are, essentially, metadata), but we need to decide which file type(s) (e.g. .h5 files, .fastq files) will server as the data files people will rely on for analysis. So, for the time being, this will be how the data is stored.

I’ll update the readme file to reflect the addition of the top level folders (e.g. ../20170323_pacbio/170210_PCB-CC_MS_EEE_20kb_P6v2_D01_1/).

I’ll also update the GitHub Wiki