Steven tasked me with processing ~90 FastA files containing gene sequences from C.virginica in this GitHub Issue. He needed to determine the Observed/Expected (O/E) ratio of CpGs in each FastA. He provided this example code and this link to all the files. Additionally, today, he tasked Kaitlyn with merging all of the output CpG O/E values for each sample in to a single file, but I decided to tackle it anyway.
The CpG O/E determination was done in a Jupyter Notebook:
- 20190225_swoose_CpG_OE.ipynb (GitHub)
Interestingly, the processing (which relied on awk
) required the use of gawk
, due to the high number of output fields. The default implementation of awk
on the version of Ubuntu I was using was not gawk
.
The creation of a single file with all of the CpG O/E info is detailed in this bash
script:
#!/bin/bash
## Script to append sample-specific headers to each ID_CpG
## file and join all ID_CpG files.
## Run file from within this directory.
# Temp file placeholder
tmp=$(mktemp)
# Create array of subdirectories.
array=(*/)
# Create column headers for ID_CpG files using sample name from directory name.
# Expects directory names like this:
# Combined.SNP.TRSdp5g95FnDNAmaf05.sorted.ANACfill.HC_VA_3_GENE_analysis
# Command will remove the "_GENE_analysis" portion.
for file in "${array[@]}"
do
gene=$(echo "${file}" \
| awk -F[.] '{print $6}' \
| rev \
| cut -d "_" -f3- \
| rev)
sed "1iID\t${gene}" "${file}"ID_CpG \
> "${file}"ID_CpG_labelled
done
# Create initial file for joining
cp "${array[0]}"ID_CpG_labelled ID_CpG_labelled_all.tab
# Loop through array and performs joins.
# Looping starts at array index 1, since index 0 was processed above.
# Assumes input files were already sorted by chromosome, in column 1.
# Outputs tab-delimited file
for file in "${array[@]:1}"
do
join \
--nocheck-order \
ID_CpG_labelled_all.tab "${file}"ID_CpG_labelled \
| tr ' ' '\t' \
> "${tmp}" \
&& mv "${tmp}" ID_CpG_labelled_all.tab
done
All data was rsync
’d to my folder on Gannet.
RESULTS
Output folder:
The output folder contains a subdirectory for each of the ~90 genes that were processed. Each contains the following files (text):
C
(count of Cs in sequences)CG
(count of CGs in sequences)*_tab
(FastA to tab file created withseqkit
)*_tab2
(Just sequences from each entry within the FastA)G
(count of Gs in sequences)ID_CpG
(initial CpG O/E determination - no header)ID_CpG_labelled
(CpG O/E determination - with header)comb
(Tabbed FastA with seq lengths, CG/C/G counts)
The combined CpG O/E file (text):