Data Received - Bisulfite-treated Illumina Sequencing from Genewiz

Received notice the sequencing data was ready from Genewiz for the samples submitted 20151222.

Download the FASTQ files from Genewiz project directory:

<code>wget -r -np -nc -A "*.gz" ftp://username:password@ftp2.genewiz.com/Project_BS1512183</code>

Since two species were sequenced (C.gigas & O.lurida), the corresponding files are in the following locations:

https://owl.fish.washington.edu/nightingales/O_lurida/

https://owl.fish.washington.edu/nightingales/C_gigas/

In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called ‘bsseq’:

<code>bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")</code>

Explanation:

<code>bsseq=</code>

This initializes a variable called “bsseq” to the values contained in the command following the equals sign.

$(ls | grep ’¹{1}_*’ | grep -v “2bRAD”)
This lists (ls) all files, pipes them to the grep command (|), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results (|) to another grep command which excludes (-v) any results containing the text “2bRAD”.

FILENAME	SAMPLE NAME	SPECIES
1_ATCACG_L001_R1_001.fastq.gz	1NF11	O.lurida
2_CGATGT_L001_R1_001.fastq.gz	1NF15	O.lurida
3_TTAGGC_L001_R1_001.fastq.gz	1NF16	O.lurida
4_TGACCA_L001_R1_001.fastq.gz	1NF17	O.lurida
5_ACAGTG_L001_R1_001.fastq.gz	2NF5	O.lurida
6_GCCAAT_L001_R1_001.fastq.gz	2NF6	O.lurida
7_CAGATC_L001_R1_001.fastq.gz	2NF7	O.lurida
8_ACTTGA_L001_R1_001.fastq.gz	2NF8	O.lurida
9_GATCAG_L001_R1_001.fastq.gz	M2	C.gigas
10_TAGCTT_L001_R1_001.fastq.gz	M3	C.gigas
11_GGCTAC_L001_R1_001.fastq.gz	NF2_6	O.lurida
12_CTTGTA_L001_R1_001.fastq.gz	NF_18	O.lurida

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

<code>totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads</code>

Total reads = 138,530,448

C.gigas reads: 22,249,631

O.lurida reads: 116,280,817

Code explanation:

<code>totalreads=0;</code>

Creates variable called “totalreads” and initializes value to 0.

for i in $bsseq;
Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.

do linecount=
Creates variable called “linecount” that stores the results of the following command:

gunzip -c "$i" | wc -l;
Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.

readcount=$((linecount/4));
Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.

totalreads=$((readcount+totalreads));
Adds the readcount for the current file and adds the value to totalreads.

done;
End the for loop.

echo $totalreads
Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

<code>for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >> readme.md; done</code>

Code explanation:

<code>for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));</code>

Same for loop as above that calculates the number of reads in each FASTQ file.

printf “%s%s” “$i" "$readcount” >> readme.md;
This formats the the printed output. The “%scode>>> readme.md; done
This appends the result from each loop to the readme.md file and ends the for loop (done).

Footnotes

0-9↩︎