Data Received - Bisulfite-treated Illumina Sequencing from Genewiz

Received notice the sequencing data was ready from Genewiz for the samples submitted 20151222.

Download the FASTQ files from Genewiz project directory:

<code>wget -r -np -nc -A "*.gz"</code>

Since two species were sequenced (C.gigas & O.lurida), the corresponding files are in the following locations:

In order to process the files, I needed to identify just the FASTQ files from this project and save the list of files to a bash variable called ‘bsseq’:

<code>bsseq=$(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")</code>


  • This initializes a variable called “bsseq” to the values contained in the command following the equals sign.

    $(ls | grep '^[0-9]\{1\}_*' | grep -v "2bRAD")

  • This lists (ls) all files, pipes them to the grep command ( ), grep finds those files that begin with (^) one or two digits followed by an underscore ([0-9{1}_*), pipes those results ( ) to another grep command which excludes (-v) any results containing the text “2bRAD”.
<td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >FILENAME </td> <td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >SAMPLE NAME </td> <td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >SPECIES </td> <td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >1_ATCACG_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >1NF11 </td> <td data-sheets-value="[null,2,"1_ATCACG_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"2_CGATGT_L001_R1_001.fastq.gz"]" >2_CGATGT_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"2_CGATGT_L001_R1_001.fastq.gz"]" >1NF15 </td> <td data-sheets-value="[null,2,"2_CGATGT_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"3_TTAGGC_L001_R1_001.fastq.gz"]" >3_TTAGGC_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"3_TTAGGC_L001_R1_001.fastq.gz"]" >1NF16 </td> <td data-sheets-value="[null,2,"3_TTAGGC_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"4_TGACCA_L001_R1_001.fastq.gz"]" >4_TGACCA_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"4_TGACCA_L001_R1_001.fastq.gz"]" >1NF17 </td> <td data-sheets-value="[null,2,"4_TGACCA_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"5_ACAGTG_L001_R1_001.fastq.gz"]" >5_ACAGTG_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"5_ACAGTG_L001_R1_001.fastq.gz"]" >2NF5 </td> <td data-sheets-value="[null,2,"5_ACAGTG_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"6_GCCAAT_L001_R1_001.fastq.gz"]" >6_GCCAAT_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"6_GCCAAT_L001_R1_001.fastq.gz"]" >2NF6 </td> <td data-sheets-value="[null,2,"6_GCCAAT_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"7_CAGATC_L001_R1_001.fastq.gz"]" >7_CAGATC_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"7_CAGATC_L001_R1_001.fastq.gz"]" >2NF7 </td> <td data-sheets-value="[null,2,"7_CAGATC_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"8_ACTTGA_L001_R1_001.fastq.gz"]" >8_ACTTGA_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"8_ACTTGA_L001_R1_001.fastq.gz"]" >2NF8 </td> <td data-sheets-value="[null,2,"8_ACTTGA_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"9_GATCAG_L001_R1_001.fastq.gz"]" >9_GATCAG_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"9_GATCAG_L001_R1_001.fastq.gz"]" >M2 </td> <td data-sheets-value="[null,2,"9_GATCAG_L001_R1_001.fastq.gz"]" >C.gigas </td> <td data-sheets-value="[null,2,"10_TAGCTT_L001_R1_001.fastq.gz"]" >10_TAGCTT_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"10_TAGCTT_L001_R1_001.fastq.gz"]" >M3 </td> <td data-sheets-value="[null,2,"10_TAGCTT_L001_R1_001.fastq.gz"]" >C.gigas </td> <td data-sheets-value="[null,2,"11_GGCTAC_L001_R1_001.fastq.gz"]" >11_GGCTAC_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"11_GGCTAC_L001_R1_001.fastq.gz"]" >NF2_6 </td> <td data-sheets-value="[null,2,"11_GGCTAC_L001_R1_001.fastq.gz"]" >O.lurida </td> <td data-sheets-value="[null,2,"12_CTTGTA_L001_R1_001.fastq.gz"]" >12_CTTGTA_L001_R1_001.fastq.gz </td> <td data-sheets-value="[null,2,"12_CTTGTA_L001_R1_001.fastq.gz"]" >NF_18 </td> <td data-sheets-value="[null,2,"12_CTTGTA_L001_R1_001.fastq.gz"]" >O.lurida </td>

I wanted to add some information about the project to the readme file, like total number of sequencing reads generated and the number of reads in each FASTQ file.

Here’s how to count the total of all reads generated in this project

<code>totalreads=0; for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$((linecount/4)); totalreads=$((readcount+totalreads)); done; echo $totalreads</code>

Total reads = 138,530,448

C.gigas reads: 22,249,631

O.lurida reads: 116,280,817

Code explanation:

  • Creates variable called “totalreads” and initializes value to 0.

    for i in $bsseq;

  • Initiates a for loop to process the list of files stored in $bsseq variable. The FASTQ files have been compressed with gzip and end with the .gz extension.

    do linecount=

  • Creates variable called “linecount” that stores the results of the following command:

    `gunzip -c "$i" | wc -l`;

  • Unzips the files ($i) to stdout (-c) instead of actually uncompressing them. This is piped to the word count command, with the line flag (wc -l) to count the number of lines in the files.


  • Divides the value stored in linecount by 4. This is because an entry for a single Illumina read comprises four lines. This value is stored in the “readcount” variable.


  • Adds the readcount for the current file and adds the value to totalreads.


  • End the for loop.

    echo $totalreads

  • Prints the value of totalreads to the screen.

Next, I wanted to generate list of the FASTQ files and corresponding read counts, and append this information to the readme file.

<code>for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4)); printf "%s\t%s\n%s\t\t\n" "$i" "$readcount" >>; done</code>

Code explanation:

<code>for i in $bsseq; do linecount=`gunzip -c "$i" | wc -l`; readcount=$(($linecount/4));</code>
  • Same for loop as above that calculates the number of reads in each FASTQ file.

    printf "%s\t%s\n\n" "$i" "$readcount" >>;

  • This formats the the printed output. The “%s\t%s\n\n” portion prints the value in $i as a string (%s), followed by a tab (\t), followed by the value in $readcount as a string (%s), followed by two consecutive newlines (\n\n) to provide an empty line between the entries. See the readme file linked above to see how the output looks.

    >>; done

  • This appends the result from each loop to the file and ends the for loop (done).