Initial analysis, comparing barcode identification methods, revealed the following info about demultiplexing on untrimmed sequences:
Using grep:
long barcodes: Found in ~12% of all reads
short barcodes: Found in ~25% of all reads
Using fastx_barcode_splitter:
long barcodes, beginning of line: Found in ~15% of all reads
long barcodes, end of line: Found in < 0.008% of all reads (yes, that is actually percentage)
short barcodes, beginning of line: Found in ~1.3% of all reads
short barcodes, end of line: Found in ~2.7% of all reads
Decided to determine what percentage of the sequences in this FASTQ file have just the beginning of the adaptor sequence (up to the 6bp barcode/index):
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
This was done to see if the numbers increased without the barcode index (i.e. see if majority of sequences are being generated from “empty” adaptors lacking barcodes).
The analysis was performed in a Jupyter (IPython) notebook and the notebook is linked, and embedded, below.
NBViewer: 20150316_LSU_OilSpill_Adapter_ID.ipynb
Results:
Using grep:
15% of the sequences match
That’s about 3% more than when the adaptor and barcode are searched as one sequence.
Using fastx_barcode_splitter:
beginning of line - 17% match
end of line - 0.06% match
The beginning of line matches are ~2% higher than when the adaptor and barcode are searched as one sequence.
Will contact Univ. of Oregon to see if they can shed any light and/or help with the demultiplexing dilemma we have here. Lots of sequence, but how did it get generated if adaptors aren’t present on all of the reads?