I previously identified variants across the four Day 2 DEG pooled RNAseq samples (380822, 380823, 380824, 380825) on 20210909 within the cbai_transcriptome_v3.1
transcriptome assembly. Now, I needed to actually do some analysis on the SNPs for the revisions to the Tanner crab gene expression paper (GitHub Repo).
I created an R Project and an R Markdown file to perform the analysis:
The R Markdown file (20211005_cbai_SNP_stats/20211005_cbai_SNP_stats.Rmd)will allow for complete reproduction of the analysis, including downloading necessary input files, as well as generating the final figure. Dependencies (i.e. external programs for VCF manipulation, R libraries, etc) are detailed at the beginning of the R Markdown file.
RESULTS
The knitr
to markdown rendering of the completed R Markdown file is here:
Rendering is pasted below:
Explore and extract SNP data from C.bairdi transcritpome assembly v3.1
REQUIRES Linux-based system to run all cells properly; some cells will not work on Mac OS!
REQUIRES the following programs:
REQUIRES the following R libraries and dependencies:
goslim_generic.obo
Downloaded from http://geneontology.org/docs/go-subset-guide/- then i moved it to the R library for GSEABase in the extdata folder in addition to using the command here - I think they’re both required.
GSEABase
(BioConductor)tidyverse
Display system info
## TODAY'S DATE:
## Thu 07 Oct 2021 12:54:49 PM PDT
## ------------
##
## No LSB modules are available.
## Distributor ID: Ubuntu
## Description: Ubuntu 20.04.3 LTS
## Release: 20.04
## Codename: focal
##
## ------------
## HOSTNAME:
## computer
##
## ------------
## Computer Specs:
##
## Architecture: x86_64
## CPU op-mode(s): 32-bit, 64-bit
## Byte Order: Little Endian
## Address sizes: 45 bits physical, 48 bits virtual
## CPU(s): 2
## On-line CPU(s) list: 0,1
## Thread(s) per core: 1
## Core(s) per socket: 1
## Socket(s): 2
## NUMA node(s): 1
## Vendor ID: GenuineIntel
## CPU family: 6
## Model: 165
## Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
## Stepping: 2
## CPU MHz: 2400.007
## BogoMIPS: 4800.01
## Hypervisor vendor: VMware
## Virtualization type: full
## L1d cache: 64 KiB
## L1i cache: 64 KiB
## L2 cache: 512 KiB
## L3 cache: 32 MiB
## NUMA node0 CPU(s): 0,1
## Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
## Vulnerability L1tf: Not affected
## Vulnerability Mds: Not affected
## Vulnerability Meltdown: Not affected
## Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
## Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
## Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
## Vulnerability Srbds: Not affected
## Vulnerability Tsx async abort: Not affected
## Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities
##
## ------------
##
## Memory Specs
##
## total used free shared buff/cache available
## Mem: 54Gi 5.6Gi 43Gi 135Mi 5.3Gi 48Gi
## Swap: 2.0Gi 0B 2.0Gi
User-defined bash variables
Set program paths for your own computer.
Variables are saved to a “dot file” and that file needs to be sourced in each Bash chunk to have access to the Bash variables across Bash chunks.
{
echo "# CPU threads"
echo 'export threads=8'
echo ""
echo "# Programs"
echo 'export seqtk="/home/sam/programs/seqtk-1.3/seqtk"'
echo 'export bcftools="/home/sam/programs/bcftools-1.13/bcftools"'
echo 'export samtools="/home/sam/programs/samtools-1.12/samtools"'
echo ""
} > .rvars
Input/output files variables
{
echo "# SNP coverage"
echo 'export SNP_coverage=10'
echo ""
echo "# SNP quality"
echo 'export SNP_quality=30'
echo ""
echo "# Input files"
echo ""
echo "## Transcriptome assembly"
echo 'export orig_fasta_url_dir="https://owl.fish.washington.edu/halfshell/genomic-databank"'
echo 'export transcriptome_fasta="cbai_transcriptome_v3.1.fasta"'
echo ""
echo "## Transcriptome md5 checksum"
echo 'export transcriptome_fasta_md5="aeec8ffbf8fa44fb1750caee6abaf68a"'
echo ""
echo "## Transcriptome GO terms"
echo 'export cbai_v3_1_GO_url="https://gannet.fish.washington.edu/Atumefaciens/20200828_cbai_trinotate_transcriptome-v3.1"'
echo 'export cbai_v3_1_GO="20200828.cbai_transcriptome_v3.1.fasta.trinotate.go_annotations.txt"'
echo ""
echo "## VCF with variant calls"
echo 'export orig_vcf_url_dir="https://gannet.fish.washington.edu/Atumefaciens/20210909-cbai-bcftools-snp_calling"'
echo 'export orig_vcf="cbai_v3.1-SNPS.vcf"'
echo ""
echo "# Output files"
echo 'export transcriptome_SNPS_fasta="cbai_transcriptome_v3.1_SNPs-${SNP_quality}Q-${SNP_coverage}x.fasta"'
echo 'export contigs_list="cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_contig-IDs.txt"'
echo 'export genes_list="cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_gene-IDs.txt"'
echo 'export vcf_filtered="cbai_v3.1-SNPS-${SNP_quality}Q-${SNP_coverage}x.vcf"'
echo 'export genes_GO_list"=cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_GO.tab"'
echo 'export flattened_GO="cbai_v3_1-SNPS_${SNP_quality}Q-${SNP_coverage}x_GO.flattened-go.txt"'
echo ""
echo "# Print formatting"
echo 'export line="-------------------------------------------------------------------------------------------------"'
echo ""
} >> .rvars
Confirm variables are accessible.
# Confirm contents of .rvars
cat .rvars
# Load contents of .rvars into the environment
source .rvars
echo ""
echo "Confirm variables are accessible."
echo "Checking the variable \$line:"
echo "${line}"
## # CPU threads
## export threads=8
##
## # Programs
## export seqtk="/home/sam/programs/seqtk-1.3/seqtk"
## export bcftools="/home/sam/programs/bcftools-1.13/bcftools"
## export samtools="/home/sam/programs/samtools-1.12/samtools"
##
## # SNP coverage
## export SNP_coverage=10
##
## # SNP quality
## export SNP_quality=30
##
## # Input files
##
## ## Transcriptome assembly
## export orig_fasta_url_dir="https://owl.fish.washington.edu/halfshell/genomic-databank"
## export transcriptome_fasta="cbai_transcriptome_v3.1.fasta"
##
## ## Transcriptome md5 checksum
## export transcriptome_fasta_md5="aeec8ffbf8fa44fb1750caee6abaf68a"
##
## ## Transcriptome GO terms
## export cbai_v3_1_GO_url="https://gannet.fish.washington.edu/Atumefaciens/20200828_cbai_trinotate_transcriptome-v3.1"
## export cbai_v3_1_GO="20200828.cbai_transcriptome_v3.1.fasta.trinotate.go_annotations.txt"
##
## ## VCF with variant calls
## export orig_vcf_url_dir="https://gannet.fish.washington.edu/Atumefaciens/20210909-cbai-bcftools-snp_calling"
## export orig_vcf="cbai_v3.1-SNPS.vcf"
##
## # Output files
## export transcriptome_SNPS_fasta="cbai_transcriptome_v3.1_SNPs-${SNP_quality}Q-${SNP_coverage}x.fasta"
## export contigs_list="cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_contig-IDs.txt"
## export genes_list="cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_gene-IDs.txt"
## export vcf_filtered="cbai_v3.1-SNPS-${SNP_quality}Q-${SNP_coverage}x.vcf"
## export genes_GO_list"=cbai_v3.1-SNPS_${SNP_quality}Q-${SNP_coverage}x_GO.tab"
## export flattened_GO="cbai_v3_1-SNPS_${SNP_quality}Q-${SNP_coverage}x_GO.flattened-go.txt"
##
## # Print formatting
## export line="-------------------------------------------------------------------------------------------------"
##
##
## Confirm variables are accessible.
## Checking the variable $line:
## -------------------------------------------------------------------------------------------------
Get VCF
# Load contents of .rvars into the environment
source .rvars
# Download with wget. Use --no-check-certificate to avoid issues with Gannet certificate
# Use --quiet option to prevent wget output from printing too many lines to notebook
wget --continue --no-check-certificate --quiet ${orig_vcf_url_dir}/${orig_vcf} \
--directory-prefix ./data
wget --continue --no-check-certificate --quiet ${orig_vcf_url_dir}/checksums.md5 \
--directory-prefix ./data
# Confirm checksum for transcriptome FastA is good
cd ./data
md5sum --check checksums.md5 | grep "${orig_vcf}"
## md5sum: 20210909-cbai-bcftools-snp_calling.sh: No such file or directory
## md5sum: input_bam_checksums.md5: No such file or directory
## md5sum: slurm-2194047.out: No such file or directory
## md5sum: WARNING: 3 listed files could not be read
## cbai_v3.1-SNPS.vcf: OK
Get transcriptome
# Load contents of .rvars into the environment
source .rvars
# Download with wget. Use --no-check-certificate to avoid issues with Gannet certificate
# Use --quiet option to prevent wget output from printing too many lines to notebook
wget --continue --no-check-certificate --quiet ${orig_fasta_url_dir}/${transcriptome_fasta} \
--directory-prefix ./data
# Confirm checksum for transcriptome FastA is good
# Uses grep to highlight the desired file.
if [ "$(md5sum ./data/${transcriptome_fasta} | awk '{print $1}')" = "${transcriptome_fasta_md5}" ]; then echo "Checksums match"; fi
## Checksums match
Get transcriptome GO annotations file
# Load contents of .rvars into the environment
source .rvars
# Download with wget. Use --no-check-certificate to avoid issues with Gannet certificate
# Use --quiet option to prevent wget output from printing too many lines to notebook
wget --continue --no-check-certificate --quiet ${cbai_v3_1_GO_url}/${cbai_v3_1_GO} \
--directory-prefix ./data
echo ""
head ./data/${cbai_v3_1_GO}
##
## TRINITY_DN100045_c0_g1 GO:0000323,GO:0001508,GO:0002027,GO:0003279,GO:0003283,GO:0003674,GO:0005198,GO:0005200,GO:0005488,GO:0005515,GO:0005575,GO:0005739,GO:0005764,GO:0005768,GO:0005769,GO:0005773,GO:0005829,GO:0005856,GO:0005886,GO:0005911,GO:0006810,GO:0006811,GO:0006812,GO:0006816,GO:0006873,GO:0006874,GO:0006875,GO:0006888,GO:0006897,GO:0006937,GO:0006942,GO:0007009,GO:0007043,GO:0007154,GO:0007165,GO:0008016,GO:0008092,GO:0008104,GO:0008150,GO:0009893,GO:0009987,GO:0010033,GO:0010468,GO:0010522,GO:0010604,GO:0010628,GO:0010646,GO:0010880,GO:0010881,GO:0010882,GO:0010927,GO:0010959,GO:0014704,GO:0015031,GO:0015833,GO:0016020,GO:0016043,GO:0016192,GO:0016323,GO:0016324,GO:0019222,GO:0019722,GO:0019725,GO:0019899,GO:0019900,GO:0019901,GO:0019932,GO:0022607,GO:0022898,GO:0023051,GO:0023052,GO:0030001,GO:0030003,GO:0030018,GO:0030054,GO:0030315,GO:0030507,GO:0030674,GO:0030913,GO:0031410,GO:0031430,GO:0031647,GO:0031672,GO:0031982,GO:0032409,GO:0032411,GO:0032412,GO:0032414,GO:0032501,GO:0032502,GO:0032844,GO:0032879,GO:0032970,GO:0033036,GO:0033267,GO:0033292,GO:0033365,GO:0034329,GO:0034330,GO:0034394,GO:0034613,GO:0034762,GO:0034764,GO:0034765,GO:0034767,GO:0035556,GO:0035637,GO:0036309,GO:0036371,GO:0042221,GO:0042383,GO:0042391,GO:0042592,GO:0042886,GO:0043034,GO:0043194,GO:0043226,GO:0043227,GO:0043228,GO:0043229,GO:0043231,GO:0043232,GO:0043266,GO:0043268,GO:0043269,GO:0043270,GO:0044057,GO:0044093,GO:0044291,GO:0044325,GO:0044422,GO:0044424,GO:0044425,GO:0044444,GO:0044449,GO:0044456,GO:0044459,GO:0044463,GO:0044464,GO:0044699,GO:0044700,GO:0044707,GO:0044763,GO:0044767,GO:0044802,GO:0045121,GO:0045184,GO:0045211,GO:0045216,GO:0046907,GO:0048193,GO:0048518,GO:0048522,GO:0048646,GO:0048856,GO:0048878,GO:0050789,GO:0050794,GO:0050801,GO:0050821,GO:0050896,GO:0051049,GO:0051050,GO:0051117,GO:0051179,GO:0051234,GO:0051239,GO:0051270,GO:0051279,GO:0051282,GO:0051597,GO:0051641,GO:0051649,GO:0051899,GO:0051924,GO:0051928,GO:0055037,GO:0055065,GO:0055074,GO:0055080,GO:0055082,GO:0055117,GO:0060090,GO:0060255,GO:0060306,GO:0060307,GO:0060341,GO:0061024,GO:0061337,GO:0065007,GO:0065008,GO:0065009,GO:0070296,GO:0070727,GO:0070838,GO:0070972,GO:0071702,GO:0071705,GO:0071840,GO:0072503,GO:0072507,GO:0072511,GO:0072657,GO:0072659,GO:0086001,GO:0086002,GO:0086004,GO:0086005,GO:0086010,GO:0086012,GO:0086014,GO:0086015,GO:0086046,GO:0086065,GO:0086066,GO:0086070,GO:0086091,GO:0090257,GO:0090279,GO:0097060,GO:0097458,GO:0097708,GO:0098589,GO:0098590,GO:0098771,GO:0098857,GO:0098900,GO:0098901,GO:0098907,GO:0098910,GO:0099623,GO:0140031,GO:1901016,GO:1901018,GO:1901019,GO:1901021,GO:1901379,GO:1901381,GO:1902578,GO:1902580,GO:1903115,GO:1903169,GO:1903522,GO:1903779,GO:1904062,GO:1904064,GO:1904427,GO:1990778,GO:2000021,GO:2001257,GO:2001259
## TRINITY_DN10007_c0_g1 GO:0003674,GO:0003824,GO:0004497,GO:0005488,GO:0005506,GO:0005575,GO:0006082,GO:0006629,GO:0006631,GO:0008150,GO:0008152,GO:0009987,GO:0016021,GO:0016491,GO:0016705,GO:0016712,GO:0016713,GO:0018685,GO:0019752,GO:0020037,GO:0031224,GO:0032787,GO:0043167,GO:0043169,GO:0043436,GO:0044237,GO:0044238,GO:0044255,GO:0044281,GO:0044425,GO:0044699,GO:0044710,GO:0044763,GO:0046872,GO:0046906,GO:0046914,GO:0055114,GO:0070330,GO:0071704,GO:0097159,GO:1901363
## TRINITY_DN10008_c0_g1 GO:0003674,GO:0003824,GO:0005488,GO:0005506,GO:0005575,GO:0005634,GO:0005654,GO:0005739,GO:0005829,GO:0006139,GO:0006259,GO:0006281,GO:0006304,GO:0006307,GO:0006725,GO:0006807,GO:0006950,GO:0006974,GO:0008150,GO:0008152,GO:0008198,GO:0008283,GO:0009451,GO:0009987,GO:0016070,GO:0016491,GO:0016705,GO:0016706,GO:0016787,GO:0032451,GO:0033554,GO:0034641,GO:0035510,GO:0035511,GO:0035513,GO:0035514,GO:0035515,GO:0035552,GO:0035553,GO:0043167,GO:0043169,GO:0043170,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043412,GO:0043734,GO:0044237,GO:0044238,GO:0044260,GO:0044422,GO:0044424,GO:0044428,GO:0044444,GO:0044446,GO:0044464,GO:0044699,GO:0044710,GO:0044728,GO:0044763,GO:0046483,GO:0046872,GO:0046914,GO:0050896,GO:0051213,GO:0051716,GO:0051747,GO:0055114,GO:0070988,GO:0070989,GO:0071704,GO:0080111,GO:0090304,GO:0090305,GO:1901360,GO:1990930
## TRINITY_DN10009_c0_g1 GO:0003674,GO:0005488,GO:0005515,GO:0005575,GO:0005768,GO:0006810,GO:0008104,GO:0008150,GO:0015031,GO:0015833,GO:0016192,GO:0016197,GO:0031410,GO:0031982,GO:0033036,GO:0042886,GO:0043226,GO:0043227,GO:0043229,GO:0044424,GO:0044444,GO:0044464,GO:0045184,GO:0051179,GO:0051234,GO:0071702,GO:0071705,GO:0097708,GO:0098876,GO:1990126
## TRINITY_DN1000_c1_g1 GO:0005575,GO:0005634,GO:0005737,GO:0008150,GO:0010494,GO:0010717,GO:0030529,GO:0032991,GO:0035770,GO:0036464,GO:0043226,GO:0043227,GO:0043228,GO:0043229,GO:0043231,GO:0043232,GO:0044424,GO:0044444,GO:0044464,GO:0045595,GO:0050789,GO:0050793,GO:0050794,GO:0051239,GO:0065007,GO:1990904,GO:2000026
## TRINITY_DN10011_c0_g1 GO:0002082,GO:0003674,GO:0003824,GO:0004129,GO:0005215,GO:0005575,GO:0005739,GO:0005746,GO:0006140,GO:0006461,GO:0006996,GO:0007005,GO:0008150,GO:0008324,GO:0009055,GO:0009987,GO:0015002,GO:0015075,GO:0015077,GO:0015078,GO:0016021,GO:0016043,GO:0016491,GO:0016675,GO:0016676,GO:0019219,GO:0019220,GO:0019222,GO:0022607,GO:0022857,GO:0022890,GO:0022891,GO:0022892,GO:0031224,GO:0031323,GO:0034622,GO:0042325,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0043467,GO:0043623,GO:0043933,GO:0044422,GO:0044424,GO:0044425,GO:0044429,GO:0044444,GO:0044446,GO:0044455,GO:0044464,GO:0050789,GO:0050794,GO:0051171,GO:0051174,GO:0065003,GO:0065007,GO:0070469,GO:0071822,GO:0071840,GO:0080090,GO:0097250,GO:1900542,GO:1903578
## TRINITY_DN100166_c0_g1 GO:0003674,GO:0003676,GO:0003677,GO:0005488,GO:0005575,GO:0005634,GO:0006355,GO:0008150,GO:0009889,GO:0010468,GO:0010556,GO:0019219,GO:0019222,GO:0031323,GO:0031326,GO:0043167,GO:0043169,GO:0043226,GO:0043227,GO:0043229,GO:0043231,GO:0044424,GO:0044464,GO:0046872,GO:0050789,GO:0050794,GO:0051171,GO:0051252,GO:0060255,GO:0065007,GO:0080090,GO:0097159,GO:1901363,GO:1903506,GO:2000112,GO:2001141
## TRINITY_DN10017_c0_g1 GO:0000375,GO:0000377,GO:0000381,GO:0000398,GO:0003674,GO:0003676,GO:0003723,GO:0005488,GO:0005575,GO:0006139,GO:0006396,GO:0006397,GO:0006725,GO:0006807,GO:0007275,GO:0008150,GO:0008152,GO:0008380,GO:0009987,GO:0010468,GO:0016070,GO:0016071,GO:0016604,GO:0016607,GO:0019219,GO:0019222,GO:0031323,GO:0032501,GO:0032502,GO:0034641,GO:0043170,GO:0043484,GO:0044237,GO:0044238,GO:0044260,GO:0044422,GO:0044424,GO:0044428,GO:0044446,GO:0044451,GO:0044464,GO:0044699,GO:0044707,GO:0044767,GO:0045292,GO:0046483,GO:0048024,GO:0048856,GO:0050684,GO:0050789,GO:0050794,GO:0051171,GO:0051252,GO:0060255,GO:0065007,GO:0071704,GO:0080090,GO:0090304,GO:0097159,GO:1901360,GO:1901363,GO:1903311
## TRINITY_DN1001_c0_g1 GO:0000096,GO:0000097,GO:0003674,GO:0003824,GO:0003962,GO:0004121,GO:0004123,GO:0005488,GO:0005575,GO:0005737,GO:0006082,GO:0006520,GO:0006534,GO:0006555,GO:0006790,GO:0006807,GO:0008150,GO:0008152,GO:0008652,GO:0009058,GO:0009066,GO:0009067,GO:0009069,GO:0009070,GO:0009086,GO:0009092,GO:0009987,GO:0016053,GO:0016740,GO:0016765,GO:0016829,GO:0016846,GO:0019343,GO:0019344,GO:0019346,GO:0019752,GO:0019842,GO:0030170,GO:0036094,GO:0043167,GO:0043168,GO:0043436,GO:0044237,GO:0044238,GO:0044249,GO:0044272,GO:0044281,GO:0044283,GO:0044424,GO:0044464,GO:0044699,GO:0044710,GO:0044711,GO:0044763,GO:0046394,GO:0047804,GO:0048037,GO:0050667,GO:0070279,GO:0071704,GO:0097159,GO:1901363,GO:1901564,GO:1901566,GO:1901576,GO:1901605,GO:1901607
## TRINITY_DN10023_c0_g1 GO:0005575,GO:0005634,GO:0005730,GO:0005737,GO:0008150,GO:0022613,GO:0042254,GO:0043226,GO:0043227,GO:0043228,GO:0043229,GO:0043231,GO:0043232,GO:0044085,GO:0044422,GO:0044424,GO:0044428,GO:0044446,GO:0044464,GO:0071840
Inspect original VCF
# Load contents of .rvars into the environment
source .rvars
echo "'head' view of ${orig_vcf}:"
echo ""
head "./data/${orig_vcf}"
echo ""
echo "End of 'head' view of ${orig_vcf}"
echo ""
echo "${line}"
echo ""
echo "VCF header info:"
echo ""
# Capture first line of header (skipping list of contigs)
begin=$("${bcftools}" view --header-only ./data/${orig_vcf} \
| grep --line-number "##ALT=<ID" \
| awk -F":" '{print $1}')
# Caputure last line of header
end=$("${bcftools}" view --header-only ./data/${orig_vcf} | wc -l)
# Use sed to print range of lines
sed --quiet "${begin},${end} p" ./data/${orig_vcf}
echo ""
echo "${line}"
echo ""
echo "List of samples in VCF:"
echo ""
${bcftools} query --list-samples ./data/${orig_vcf}
## 'head' view of cbai_v3.1-SNPS.vcf:
##
## ##fileformat=VCFv4.2
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##bcftoolsVersion=1.13+htslib-1.13
## ##bcftoolsCommand=mpileup --fasta-ref /gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta --threads 40 --output-type u /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380822.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380823.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380824.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380825.sorted.bam
## ##reference=file:///gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta
## ##contig=<ID=TRINITY_DN5604_c0_g2_i1,length=2325>
## ##contig=<ID=TRINITY_DN9_c4_g1_i10,length=1025>
## ##contig=<ID=TRINITY_DN9_c4_g1_i9,length=999>
## ##contig=<ID=TRINITY_DN38_c0_g3_i1,length=2357>
## ##contig=<ID=TRINITY_DN81_c0_g1_i9,length=1451>
##
## End of 'head' view of cbai_v3.1-SNPS.vcf
##
## -------------------------------------------------------------------------------------------------
##
## VCF header info:
##
## ##ALT=<ID=*,Description="Represents allele(s) other than observed.">
## ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
## ##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
## ##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
## ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
## ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
## ##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
## ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
## ##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
## ##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
## ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
## ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
## ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
## ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
## ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
## ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
## ##bcftools_callVersion=1.13+htslib-1.13
## ##bcftools_callCommand=call --output-type v --multiallelic-caller --variants-only --threads 40; Date=Fri Sep 10 21:20:44 2021
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT d2_uninfected_decreased-temp d2_infected_decreased-temp d2_uninfected_elevated-temp d2_infected_elevated-temp
## TRINITY_DN5604_c0_g2_i1 35 . A C 37.0646 . DP=3;VDB=0.14;SGB=-0.314421;RPBZ=-1.22474;FS=0;MQ0F=0;AC=3;AN=4;DP4=1,0,2,0;MQ=60 GT:PL ./.:0,0,0 0/1:31,0,31 ./.:0,0,0 1/1:37,3,0
## TRINITY_DN5604_c0_g2_i1 391 . C G 71.5526 . DP=3;VDB=0.6;SGB=-0.314421;FS=0;MQ0F=0;AC=4;AN=4;DP4=0,0,0,2;MQ=60 GT:PL ./.:0,0,0 1/1:60,3,0 ./.:0,0,0 1/1:37,3,0
##
## -------------------------------------------------------------------------------------------------
##
## List of samples in VCF:
##
## d2_uninfected_decreased-temp
## d2_infected_decreased-temp
## d2_uninfected_elevated-temp
## d2_infected_elevated-temp
Subset VCF to minimum coverage and quality
# Load contents of .rvars into the environment
source .rvars
# Subset VCF to only SNPs with ${SNP_coverage}x raw read coverage
# and quality >= ${SNP_qual}
"${bcftools}" filter \
--include "TYPE='snp' & MIN(DP)>=${SNP_coverage} & QUAL>=${SNP_quality}" \
--threads ${threads} \
./data/${orig_vcf} \
> ./analyses/${vcf_filtered}
Inspect filtered VCF
# Load contents of .rvars into the environment
source .rvars
echo "'head' view of ${vcf_filtered}:"
echo ""
head "./analyses/${vcf_filtered}"
echo ""
echo "End of 'head' view of ${vcf_filtered}"
echo ""
echo "${line}"
echo ""
echo "VCF header info:"
echo ""
# Capture first line of header (skipping list of contigs)
begin=$("${bcftools}" view --header-only ./analyses/${vcf_filtered} \
| grep --line-number "##ALT=<ID" \
| awk -F":" '{print $1}')
# Caputure last line of header
end=$("${bcftools}" view --header-only ./analyses/${vcf_filtered} | wc -l)
# Use sed to print range of lines
sed --quiet "${begin},${end} p" ./analyses/${vcf_filtered}
echo ""
echo "${line}"
echo ""
echo "List of samples in VCF:"
echo ""
${bcftools} query --list-samples ./analyses/${vcf_filtered}
## 'head' view of cbai_v3.1-SNPS-30Q-10x.vcf:
##
## ##fileformat=VCFv4.2
## ##FILTER=<ID=PASS,Description="All filters passed">
## ##bcftoolsVersion=1.13+htslib-1.13
## ##bcftoolsCommand=mpileup --fasta-ref /gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta --threads 40 --output-type u /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380822.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380823.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380824.sorted.bam /gscratch/scrubbed/samwhite/outputs/20210908-cbai-hisat2-cbai_transcriptome_v3.1/380825.sorted.bam
## ##reference=file:///gscratch/srlab/sam/data/C_bairdi/transcriptomes/cbai_transcriptome_v3.1.fasta
## ##contig=<ID=TRINITY_DN5604_c0_g2_i1,length=2325>
## ##contig=<ID=TRINITY_DN9_c4_g1_i10,length=1025>
## ##contig=<ID=TRINITY_DN9_c4_g1_i9,length=999>
## ##contig=<ID=TRINITY_DN38_c0_g3_i1,length=2357>
## ##contig=<ID=TRINITY_DN81_c0_g1_i9,length=1451>
##
## End of 'head' view of cbai_v3.1-SNPS-30Q-10x.vcf
##
## -------------------------------------------------------------------------------------------------
##
## VCF header info:
##
## ##ALT=<ID=*,Description="Represents allele(s) other than observed.">
## ##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
## ##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
## ##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
## ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
## ##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
## ##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
## ##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
## ##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
## ##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
## ##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
## ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
## ##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
## ##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
## ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
## ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
## ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
## ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
## ##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
## ##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
## ##bcftools_callVersion=1.13+htslib-1.13
## ##bcftools_callCommand=call --output-type v --multiallelic-caller --variants-only --threads 40; Date=Fri Sep 10 21:20:44 2021
## ##bcftools_filterVersion=1.13+htslib-1.13
## ##bcftools_filterCommand=filter --include 'TYPE='snp' & MIN(DP)>=10 & QUAL>=30' --threads 8 ./data/cbai_v3.1-SNPS.vcf; Date=Thu Oct 7 12:55:06 2021
## #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT d2_uninfected_decreased-temp d2_infected_decreased-temp d2_uninfected_elevated-temp d2_infected_elevated-temp
## TRINITY_DN9_c4_g1_i10 661 . T C 125.719 PASS DP=139;VDB=3.781e-05;SGB=-11.5772;RPBZ=-0.349139;MQBZ=0.473562;MQSBZ=0.168907;BQBZ=3.68777;SCBZ=-3.2251;FS=0;MQ0F=0.00719424;AC=2;AN=8;DP4=44,55,5,6;MQ=58 GT:PL 0/1:89,0,255 0/0:0,60,255 0/1:77,0,196 0/0:0,18,255
## TRINITY_DN9_c4_g1_i10 662 . G A 125.924 PASS DP=139;VDB=3.781e-05;SGB=-11.5772;RPBZ=-0.349216;MQBZ=0.473562;MQSBZ=0.168907;BQBZ=3.85055;SCBZ=-3.22016;FS=0;MQ0F=0.00719424;AC=2;AN=8;DP4=44,55,5,6;MQ=58 GT:PL 0/1:89,0,255 0/0:0,60,255 0/1:77,0,199 0/0:0,11,255
##
## -------------------------------------------------------------------------------------------------
##
## List of samples in VCF:
##
## d2_uninfected_decreased-temp
## d2_infected_decreased-temp
## d2_uninfected_elevated-temp
## d2_infected_elevated-temp
Extract Transcripts Having SNPs with ${SNP_coverage}x
Read Coverage and Quality >= ${SNP_qual}
The resulting FastA is useful to use in IGV, so that we don’t have to deal with browsing contigs with no variants.
# Load contents of .rvars into the environment
source .rvars
# List of FastA IDs with ${SNP_coverage}x SNP coverage
# Uses awk to skip all lines beginning with '#'
awk '/^[^#]/{print $1}' ./analyses/${vcf_filtered} \
| sort -u \
> ./analyses/${contigs_list}
# Use seqtk to generate a FastA from original transcriptome assembly and list of contigs with ${SNP_coverage}x SNP coverage
"${seqtk}" subseq ./data/${transcriptome_fasta} ./analyses/${contigs_list} > ./data/${transcriptome_SNPS_fasta}
# Generate FastA index file for new FastA
"${samtools}" faidx ./data/${transcriptome_SNPS_fasta}
ls -ltrh ./analyses
## total 22M
## -rw-rw-r-- 1 sam sam 22M Oct 7 12:55 cbai_v3.1-SNPS-30Q-10x.vcf
## -rw-rw-r-- 1 sam sam 423K Oct 7 12:55 cbai_v3.1-SNPS_30Q-10x_contig-IDs.txt
Compare number of original transcripts with number of those with SNPs
# Load contents of .rvars into the environment
source .rvars
for fasta in ./data/*.fasta
do
grep --count --with-filename "^>" ${fasta}
done | column -t -s ":"
## ./data/cbai_transcriptome_v3.1.fasta 78649
## ./data/cbai_transcriptome_v3.1_SNPs-30Q-10x.fasta 17680
SNP Stats
Summary stats
# Load contents of .rvars into the environment
source .rvars
# Shows summary stats for all samples
${bcftools} stats \
--samples - \
./analyses/${vcf_filtered} \
| head -n 31
## # This file was produced by bcftools stats (1.13+htslib-1.13) and can be plotted using plot-vcfstats.
## # The command line was: bcftools stats --samples - ./analyses/cbai_v3.1-SNPS-30Q-10x.vcf
## #
## # Definition of sets:
## # ID [2]id [3]tab-separated file names
## ID 0 ./analyses/cbai_v3.1-SNPS-30Q-10x.vcf
## # SN, Summary numbers:
## # number of records .. number of data rows in the VCF
## # number of no-ALTs .. reference-only sites, ALT is either "." or identical to REF
## # number of SNPs .. number of rows with a SNP
## # number of MNPs .. number of rows with a MNP, such as CC>TT
## # number of indels .. number of rows with an indel
## # number of others .. number of rows with other type, for example a symbolic allele or
## # a complex substitution, such as ACT>TCGA
## # number of multiallelic sites .. number of rows with multiple alternate alleles
## # number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
## #
## # Note that rows containing multiple types will be counted multiple times, in each
## # counter. For example, a row with a SNP and an indel increments both the SNP and
## # the indel counter.
## #
## # SN [2]id [3]key [4]value
## SN 0 number of samples: 4
## SN 0 number of records: 79753
## SN 0 number of no-ALTs: 0
## SN 0 number of SNPs: 79753
## SN 0 number of MNPs: 0
## SN 0 number of indels: 0
## SN 0 number of others: 0
## SN 0 number of multiallelic sites: 785
## SN 0 number of multiallelic SNP sites: 785
Transitions/Transversions and Substitution types
# Load contents of .rvars into the environment
source .rvars
${bcftools} stats \
--samples - \
./analyses/${vcf_filtered} \
| grep "ST"
## # TSTV, transitions/transversions:
## # TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
## TSTV 0 50173 30371 1.65 49905 29848 1.67
## # ST, Substitution types:
## # ST [2]id [3]type [4]count
## ST 0 A>C 4141
## ST 0 A>G 10450
## ST 0 A>T 3428
## ST 0 C>A 4425
## ST 0 C>G 3420
## ST 0 C>T 14934
## ST 0 G>A 14055
## ST 0 G>C 3373
## ST 0 G>T 3944
## ST 0 T>A 3635
## ST 0 T>C 10734
## ST 0 T>G 4005
Individual sample stats
# Load contents of .rvars into the environment
source .rvars
# Shows samples stats.
# Uses sed/column/tail to format output nicely
${bcftools} stats \
--samples - \
./analyses/${vcf_filtered} \
| grep "PSC" \
| sed 's/^# *//' \
| sed 's/average depth/avg_dp/' \
| column -t \
| tail -n 5
## PSC [2]id [3]sample [4]nRefHom [5]nNonRefHom [6]nHets [7]nTransitions [8]nTransversions [9]nIndels [10]avg_dp [11]nSingletons [12]nHapRef [13]nHapAlt [14]nMissing
## PSC 0 d2_uninfected_decreased-temp 23566 20691 28356 29276 19771 0 0.0 5658 0 0 7140
## PSC 0 d2_infected_decreased-temp 24704 20754 30361 30595 20520 0 0.0 4725 0 0 3934
## PSC 0 d2_uninfected_elevated-temp 22246 19306 33690 31942 21054 0 0.0 5052 0 0 4511
## PSC 0 d2_infected_elevated-temp 22970 20575 34095 32990 21680 0 0.0 5057 0 0 2113
Percentage of transcripts with SNPS
# Load contents of .rvars into the environment
source .rvars
# Count number of transcripts in transcriptome assembly
transcripts_count=$(grep -c "^>" ./data/${transcriptome_fasta})
printf "%s\t%s\n" "Original transcriptome transcripts:" "${transcripts_count}"
# Count number of transcripts with SNPs
# Parses Trinity ID and counts unique Trinity IDs
snp_transcripts_count=$(awk '/^[^#]/{print $1}' ./analyses/${vcf_filtered} | sort -u| wc -l)
printf "%s\t%s\n" "Transcripts with SNPs:" "${snp_transcripts_count}"
echo ""
# Calculate percentage
printf "%s\t%s\n" "Percentage of transcripts with SNPs:" "$(bc <<< "scale=4; ( ${snp_transcripts_count} / ${transcripts_count} * 100)")"
## Original transcriptome transcripts: 78649
## Transcripts with SNPs: 17680
##
## Percentage of transcripts with SNPs: 22.4700
Get max/min/mean number of SNPs per transcript
# Load contents of .rvars into the environment
source .rvars
awk '/^[^#]/{print $1}' ./analyses/${vcf_filtered} \
| sort \
| uniq -c \
| sort -k1n,1 \
| awk '{sum+=$1;cnt++;max=$1;min=cnt==1?$1:min} END{print "min="min, "max="max, "mean="sum/cnt}'
## min=1 max=334 mean=4.51092
Extract FastA IDs
Strips Trinity isoform designations from end of ID to match gene IDs in GO term annotation file
# Load contents of .rvars into the environment
source .rvars
awk '/^[^#]/{print $1}' ./analyses/${vcf_filtered} \
| awk 'BEGIN { FS = "_"; OFS = "_" } {print $1, $2, $3, $4}' \
| sort -u \
> ./analyses/${genes_list}
wc -l ./analyses/"${genes_list}"
echo ""
echo "${line}"
echo ""
head ./analyses/"${genes_list}"
## 10332 ./analyses/cbai_v3.1-SNPS_30Q-10x_gene-IDs.txt
##
## -------------------------------------------------------------------------------------------------
##
## TRINITY_DN1000_c1_g1
## TRINITY_DN10011_c0_g1
## TRINITY_DN10019_c0_g1
## TRINITY_DN1001_c0_g1
## TRINITY_DN10023_c0_g1
## TRINITY_DN10023_c1_g1
## TRINITY_DN10023_c2_g2
## TRINITY_DN10025_c0_g1
## TRINITY_DN1002_c2_g1
## TRINITY_DN10030_c0_g1
Extract genes with SNPs from transcriptome annotation file
# Load contents of .rvars into the environment
source .rvars
# Uses a list of Trinity gene IDs to extract gene IDs and corresponding GO accessions.
# Expects Trinotate annotations file as input.
# Uses sed to convert commas to tabs in preparation for subsequent "flattening"
while read -r line
do
grep "${line}" ./data/${cbai_v3_1_GO}
done < ./analyses/${genes_list} \
| sed $'s/,/\t/g' \
> ./analyses/${genes_GO_list}
# Check number of records
wc -l ./analyses/${genes_GO_list}
echo ""
echo "${line}"
echo ""
head ./analyses/${genes_GO_list}
## 6891 ./analyses/cbai_v3.1-SNPS_30Q-10x_GO.tab
##
## -------------------------------------------------------------------------------------------------
##
## TRINITY_DN1000_c1_g1 GO:0005575 GO:0005634 GO:0005737 GO:0008150 GO:0010494 GO:0010717 GO:0030529 GO:0032991 GO:0035770 GO:0036464 GO:0043226 GO:0043227 GO:0043228 GO:0043229 GO:0043231 GO:0043232 GO:0044424 GO:0044444 GO:0044464 GO:0045595 GO:0050789 GO:0050793 GO:0050794 GO:0051239 GO:0065007 GO:1990904 GO:2000026
## TRINITY_DN10011_c0_g1 GO:0002082 GO:0003674 GO:0003824 GO:0004129 GO:0005215 GO:0005575 GO:0005739 GO:0005746 GO:0006140 GO:0006461 GO:0006996 GO:0007005 GO:0008150 GO:0008324 GO:0009055 GO:0009987 GO:0015002 GO:0015075 GO:0015077 GO:0015078 GO:0016021 GO:0016043 GO:0016491 GO:0016675 GO:0016676 GO:0019219 GO:0019220 GO:0019222 GO:0022607 GO:0022857 GO:0022890 GO:0022891 GO:0022892 GO:0031224 GO:0031323 GO:0034622 GO:0042325 GO:0043226 GO:0043227 GO:0043229 GO:0043231 GO:0043467 GO:0043623 GO:0043933 GO:0044422 GO:0044424 GO:0044425 GO:0044429 GO:0044444 GO:0044446 GO:0044455 GO:0044464 GO:0050789 GO:0050794 GO:0051171 GO:0051174 GO:0065003 GO:0065007 GO:0070469 GO:0071822 GO:0071840 GO:0080090 GO:0097250 GO:1900542 GO:1903578
## TRINITY_DN1001_c0_g1 GO:0000096 GO:0000097 GO:0003674 GO:0003824 GO:0003962 GO:0004121 GO:0004123 GO:0005488 GO:0005575 GO:0005737 GO:0006082 GO:0006520 GO:0006534 GO:0006555 GO:0006790 GO:0006807 GO:0008150 GO:0008152 GO:0008652 GO:0009058 GO:0009066 GO:0009067 GO:0009069 GO:0009070 GO:0009086 GO:0009092 GO:0009987 GO:0016053 GO:0016740 GO:0016765 GO:0016829 GO:0016846 GO:0019343 GO:0019344 GO:0019346 GO:0019752 GO:0019842 GO:0030170 GO:0036094 GO:0043167 GO:0043168 GO:0043436 GO:0044237 GO:0044238 GO:0044249 GO:0044272 GO:0044281 GO:0044283 GO:0044424 GO:0044464 GO:0044699 GO:0044710 GO:0044711 GO:0044763 GO:0046394 GO:0047804 GO:0048037 GO:0050667 GO:0070279 GO:0071704 GO:0097159 GO:1901363 GO:1901564 GO:1901566 GO:1901576 GO:1901605 GO:1901607
## TRINITY_DN10023_c0_g1 GO:0005575 GO:0005634 GO:0005730 GO:0005737 GO:0008150 GO:0022613 GO:0042254 GO:0043226 GO:0043227 GO:0043228 GO:0043229 GO:0043231 GO:0043232 GO:0044085 GO:0044422 GO:0044424 GO:0044428 GO:0044446 GO:0044464 GO:0071840
## TRINITY_DN10023_c2_g2 GO:0001822 GO:0002376 GO:0003674 GO:0005488 GO:0005515 GO:0005575 GO:0005634 GO:0005829 GO:0006325 GO:0006464 GO:0006473 GO:0006475 GO:0006508 GO:0006511 GO:0006515 GO:0006807 GO:0006950 GO:0006974 GO:0006996 GO:0007130 GO:0007165 GO:0007276 GO:0007283 GO:0007420 GO:0008104 GO:0008150 GO:0008152 GO:0008630 GO:0009056 GO:0009057 GO:0009892 GO:0009894 GO:0009895 GO:0009987 GO:0010033 GO:0010243 GO:0010498 GO:0010605 GO:0010941 GO:0016043 GO:0018193 GO:0018205 GO:0018393 GO:0018394 GO:0019222 GO:0019538 GO:0019941 GO:0022402 GO:0022414 GO:0022607 GO:0030154 GO:0030162 GO:0030163 GO:0030324 GO:0030433 GO:0031323 GO:0031324 GO:0031329 GO:0031330 GO:0031593 GO:0031647 GO:0031982 GO:0032182 GO:0032268 GO:0032269 GO:0032403 GO:0032434 GO:0032435 GO:0032502 GO:0032991 GO:0033036 GO:0033554 GO:0034613 GO:0034976 GO:0035556 GO:0036211 GO:0036503 GO:0042176 GO:0042177 GO:0042221 GO:0042771 GO:0042981 GO:0043021 GO:0043022 GO:0043067 GO:0043130 GO:0043161 GO:0043170 GO:0043226 GO:0043227 GO:0043229 GO:0043230 GO:0043231 GO:0043234 GO:0043412 GO:0043543 GO:0043632 GO:0044237 GO:0044238 GO:0044248 GO:0044260 GO:0044265 GO:0044267 GO:0044421 GO:0044424 GO:0044444 GO:0044445 GO:0044464 GO:0044699 GO:0044702 GO:0044710 GO:0044712 GO:0044763 GO:0044767 GO:0044877 GO:0045048 GO:0045184 GO:0045861 GO:0045995 GO:0048232 GO:0048513 GO:0048519 GO:0048523 GO:0048609 GO:0048856 GO:0048869 GO:0050789 GO:0050793 GO:0050794 GO:0050821 GO:0050896 GO:0051171 GO:0051172 GO:0051179 GO:0051205 GO:0051234 GO:0051239 GO:0051246 GO:0051248 GO:0051276 GO:0051603 GO:0051641 GO:0051716 GO:0060255 GO:0061136 GO:0061857 GO:0065007 GO:0065008 GO:0070059 GO:0070062 GO:0070192 GO:0070193 GO:0070628 GO:0070727 GO:0071704 GO:0071712 GO:0071816 GO:0071818 GO:0071840 GO:0072331 GO:0072332 GO:0072379 GO:0072657 GO:0080090 GO:0090150 GO:0097190 GO:0097193 GO:1901564 GO:1901565 GO:1901575 GO:1901698 GO:1901799 GO:1903046 GO:1903050 GO:1903051 GO:1903362 GO:1903363 GO:1903561 GO:2000026
## TRINITY_DN1002_c2_g1 GO:0003674 GO:0003676 GO:0003723 GO:0003729 GO:0005488 GO:0005515 GO:0005575 GO:0005634 GO:0006139 GO:0006396 GO:0006725 GO:0006807 GO:0008150 GO:0008152 GO:0009987 GO:0016070 GO:0031050 GO:0034641 GO:0035196 GO:0043170 GO:0043226 GO:0043227 GO:0043229 GO:0043231 GO:0044237 GO:0044238 GO:0044260 GO:0044424 GO:0044464 GO:0046483 GO:0070918 GO:0071704 GO:0090304 GO:0097159 GO:1901360 GO:1901363
## TRINITY_DN10030_c0_g1 GO:0000166 GO:0000346 GO:0003674 GO:0003676 GO:0003677 GO:0003723 GO:0005488 GO:0005575 GO:0005730 GO:0006405 GO:0006406 GO:0006810 GO:0006913 GO:0008150 GO:0008284 GO:0008327 GO:0009893 GO:0010604 GO:0010638 GO:0015931 GO:0016604 GO:0016607 GO:0019222 GO:0030529 GO:0031056 GO:0031058 GO:0031060 GO:0031062 GO:0031323 GO:0031325 GO:0031399 GO:0031401 GO:0032268 GO:0032270 GO:0032781 GO:0032991 GO:0033043 GO:0033044 GO:0035770 GO:0036094 GO:0036464 GO:0042127 GO:0043085 GO:0043226 GO:0043228 GO:0043229 GO:0043232 GO:0043234 GO:0043462 GO:0043565 GO:0044093 GO:0044422 GO:0044424 GO:0044428 GO:0044444 GO:0044446 GO:0044451 GO:0044464 GO:0046907 GO:0048518 GO:0048522 GO:0050657 GO:0050658 GO:0050789 GO:0050790 GO:0050794 GO:0051028 GO:0051095 GO:0051096 GO:0051128 GO:0051130 GO:0051168 GO:0051169 GO:0051171 GO:0051173 GO:0051179 GO:0051234 GO:0051236 GO:0051246 GO:0051247 GO:0051336 GO:0051345 GO:0051641 GO:0051649 GO:0060255 GO:0065007 GO:0065009 GO:0071702 GO:0071705 GO:0080090 GO:0097159 GO:1901265 GO:1901363 GO:1902275 GO:1905269 GO:1990904 GO:2001252
## TRINITY_DN10030_c2_g1 GO:0005575 GO:0005789 GO:0008150 GO:0009987 GO:0016020 GO:0016021 GO:0019725 GO:0031224 GO:0042592 GO:0044422 GO:0044424 GO:0044425 GO:0044432 GO:0044444 GO:0044446 GO:0044464 GO:0044699 GO:0044763 GO:0045454 GO:0050789 GO:0050794 GO:0065007 GO:0065008
## TRINITY_DN10034_c0_g1 GO:0000166 GO:0000226 GO:0003674 GO:0003676 GO:0003677 GO:0003678 GO:0003824 GO:0004386 GO:0005488 GO:0005524 GO:0005575 GO:0005634 GO:0006139 GO:0006259 GO:0006260 GO:0006270 GO:0006725 GO:0006807 GO:0006996 GO:0007010 GO:0007017 GO:0007051 GO:0007052 GO:0008150 GO:0008152 GO:0009058 GO:0009059 GO:0009987 GO:0016043 GO:0016462 GO:0016787 GO:0016817 GO:0016818 GO:0016887 GO:0017076 GO:0017111 GO:0022402 GO:0030554 GO:0032553 GO:0032555 GO:0032559 GO:0032991 GO:0034641 GO:0034645 GO:0035639 GO:0036094 GO:0042555 GO:0043167 GO:0043168 GO:0043170 GO:0043226 GO:0043227 GO:0043229 GO:0043231 GO:0043234 GO:0044237 GO:0044238 GO:0044249 GO:0044260 GO:0044424 GO:0044464 GO:0046483 GO:0071704 GO:0071840 GO:0090304 GO:0097159 GO:0097367 GO:1901265 GO:1901360 GO:1901363 GO:1901576 GO:1902850 GO:1903047
## TRINITY_DN10038_c0_g2 GO:0003674 GO:0003824 GO:0016301 GO:0016740 GO:0016772
Flatten ${genes_GO_list}
to have one GO accession per line.
# Load contents of .rvars into the environment
source .rvars
# Identify first field containing a GO term.
# Search file with grep for "GO:" and pipe to awk.
# Awk sets tab as field delimiter (-F'\t'), runs a for loop that looks for "GO:" (~/GO:/), and then prints the field number).
# Awk results are piped to sort, which sorts unique by number (-ug).
# Sort results are piped to head to retrieve the lowest value (i.e. the top of the list; "-n1").
begin_goterms=$(grep "GO:" ./analyses/"${genes_GO_list}" | awk -F'\t' '{for (i=1;i<=NF;i++) if($i ~/GO:/) print i}' | sort -ug | head -n1)
# Flatten GO terms annotation file.
# Expects tab-delimited input file where:
## First field (column) is Trinity gene IDs.
## Remaining fields (columns) are each individual GO accessions.
while read -r line
do
# Capture maximum number of fields to handle differing number of GO terms.
max_field=$(echo "$line" | awk -F'\t' '{print NF}')
# Set which fields are "fixed" (i.e. Trinity gene IDs)
fixed_fields=$(echo "$line" | cut -f1)
# Identifies if a line has GO accessions,
# reads them into an array,
# and then prints the Trinity Id and single GO accession on each line.
if (( "$max_field" < "$begin_goterms" )); then
printf "%s\n" "$line"
else
# Set range of GO accessions for each line
goterms=$(echo "$line" | cut -f"$begin_goterms"-"$max_field")
# Set Internal Field Separator to a <tab> and read $goterms into array.
IFS=$'\t' read -r -a array <<< "$goterms"
# Loop through array and print tab-delimited file of Trinity ID and single GO accession.
for element in "${!array[@]}"
do printf "%s\t%s\n" "$fixed_fields" "${array[$element]}"
done
fi
done < ./analyses/"${genes_GO_list}" > ./analyses/"${flattened_GO}"
# Check number of records
wc -l ./analyses/${flattened_GO}
echo ""
echo "${line}"
echo ""
head ./analyses/${flattened_GO}
## 550412 ./analyses/cbai_v3_1-SNPS_30Q-10x_GO.flattened-go.txt
##
##
##
## TRINITY_DN1000_c1_g1 GO:0005575
## TRINITY_DN1000_c1_g1 GO:0005634
## TRINITY_DN1000_c1_g1 GO:0005737
## TRINITY_DN1000_c1_g1 GO:0008150
## TRINITY_DN1000_c1_g1 GO:0010494
## TRINITY_DN1000_c1_g1 GO:0010717
## TRINITY_DN1000_c1_g1 GO:0030529
## TRINITY_DN1000_c1_g1 GO:0032991
## TRINITY_DN1000_c1_g1 GO:0035770
## TRINITY_DN1000_c1_g1 GO:0036464
Create file with path to ${flattened_GO}
file
Used to pass Bash variable to R.
source .rvars
echo "./analyses/${flattened_GO}" > .flattened_GO
Create R string with path to Bash variable, ${flattened_GO}
string <- paste(readLines(".flattened_GO"), collapse=" ")
print(string)
## [1] "./analyses/cbai_v3_1-SNPS_30Q-10x_GO.flattened-go.txt"
library(GSEABase)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: annotate
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: XML
## Loading required package: graph
##
## Attaching package: 'graph'
## The following object is masked from 'package:XML':
##
## addNode
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x stringr::boundary() masks graph::boundary()
## x dplyr::collapse() masks IRanges::collapse()
## x dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## x dplyr::desc() masks IRanges::desc()
## x tidyr::expand() masks S4Vectors::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks S4Vectors::first()
## x dplyr::lag() masks stats::lag()
## x ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## x purrr::reduce() masks IRanges::reduce()
## x dplyr::rename() masks S4Vectors::rename()
## x dplyr::select() masks AnnotationDbi::select()
## x dplyr::slice() masks IRanges::slice()
# Script to retrieve GOslims from Trinity-based, EdgeR GOseq differential gene expression.
# Identifies enriched and depleted output files.
# Requires "goslim_generic.obo" from http://geneontology.org/docs/go-subset-guide
## Get max number of fields
# Needed to handle reading in file with different number of columns in each row
max_fields <- max(na.omit((count.fields(string, sep = "\t", blank.lines.skip = TRUE))))
## Read in tab-delimited GOseq file
# Use "max_fields" to populate all columns with a sequentially numbered header
go_seqs <- read.table(string,
sep = "\t",
header = FALSE,
col.names = paste0("V",seq_len(max_fields)),
fill = TRUE)
## Grab just the individual GO terms from the "2nd" column)
goterms <- as.character(go_seqs$V2)
### Use GSEA to map GO terms to GOslims
## Store goterms as GSEA object
myCollection <- GOCollection(goterms)
##
## Use generic GOslim file to create a GOslim collection
# I downloaded goslim_generic.obo from http://geneontology.org/docs/go-subset-guide
# then i moved it to the R library for GSEABase in the extdata folder
# in addition to using the command here - I think they're both required.
slim <- getOBOCollection("~/data/goslim_generic.obo")
## Map GO terms to GOslims and select Biological Processes group
slims <- goSlim(myCollection, slim, "BP", verbose = TRUE)
# Rename first column
slims <- slims %>% rownames_to_column(var = "GOslim")
## Write output file
write.csv(slims, file = "./analyses/S10-SNPs_GO-GOslims.csv", quote = FALSE, row.names = FALSE)
# Remove GOslim accession for the generic "biological_process" term to improve visualization of other terms.
slims <- slims[slims$GOslim != "GO:0008150",]
# Create bar plot.
# "Open" PNG file for saving subsequent plot
pdf("./figures/S9-SNPs_GO-GOslims_barplot.pdf", height = 10, width = 12)
ggplot(data = slims, aes(x=slims$Percent, y=slims$Term)) +
labs(title = "",
caption = "Supplemental Figure 1. Barplot of percentages of gene ontology assignments to GOslims for transcripts containing at least on SNP.
Excludes the generic \"biological_process\" GOslim (55.53% of all GO terms) to aid in visualization of other GOslim categories.",
x = "Percent GO terms assigned to GOslim",
y = "GOslim") +
geom_bar(stat = "identity") +
theme(plot.caption = element_text(hjust = 0)
)
## Warning: Use of `slims$Percent` is discouraged. Use `Percent` instead.
## Warning: Use of `slims$Term` is discouraged. Use `Term` instead.
# Close PNG file
dev.off()
## png
## 2