Code Snippets
A few useful code chunks.
Shell Basics
Most commands are for bash (shell) scripts.
Also, assumes usage of bash >=4.0.
R Markdown
Use Bash variables across chunks
Variables are saved to a “dot file” and that file needs to be sourced in each Bash chunk to have access to the Bash variables across Bash chunks.
The Bash variables set in the example below are:
-
${threads}
-
${my_fasta}
-
${samtools}
{bash save-bash-variables-to-rvars-file}
# Send text to export Bash variables to .rvars file
{
echo "# CPU threads"
echo 'export threads=8'
echo ""
echo "# Programs"
echo 'export my_fasta="~/data/temporary.fasta"'
echo 'export samtools="~/programs/samtools-1.12/samtools"'
echo ""
} > .rvars
In subsequent Bash chunks, load the variables into memory to use them:
{bash load-bash-variables}
# Load contents of .rvars into memory so varaibles are accessible
source .rvars
# Create FastA index file
"${samtools} faidx "${my_fasta}"
Git
Add files >100MB to .gitignore file
find ./* -size +100M | cat >> .gitignore
Run this from top directory of your repo.
This finds all files in your current directory (presumably a Git repo) greater than 100MB and writes the paths to those files in your .gitignore file.
FastQ files
Create separate arrays for R1 and R2 reads
-
With a for loop
# Declare arrays R1_array=() R2_array=() # Populate arrays for fastq in *R1.fq do R1_array+=(${fastq}) done for fastq in *R2.fq do R2_array+=(${fastq}) done
-
Using "globbing"
# Declare arrays R1_array=() R2_array=() # Populate arrays R1_array=(*R1.fq) R2_array=(*R2.fq)
-
Create comma-separated lists of FastQ reads
(E.g. This is useful when running
bowtie2
orTrinity
)R1_list=$(echo "${R1_array[@]}" | tr " " ",") R2_list=$(echo "${R2_array[@]}" | tr " " ",")
Creating single array with paired reads
## Assumes there is only a single set of paired reads per sample
# Declare array
fastq_array=()
# Populate array
# Corresponding reads will be placed next to each other in array
# (e.g. sample01_R1.fq sample01_R2.fq sample02_R1.fq samples02_R2.fq)
fastq_array=(*.fq)
-
Loop through single array of paired reads
## Assumes there is only a single set of paired reads per sample # Declare array fastq_array=() # Populate array fastq_array=(*.fq) # Loop through read pairs # Increment by 2 to process next pair of FastQ files for (( i=0; i<${#fastq_array[@]} ; i+=2 )) do echo "Read 1: ${fastq_array[i]}" echo "Read 2: ${fastq_array[i+1]}" done
-
Create comma-separated lists of paired FastQ reads
(E.g This is useful when running
bowtie2
orTrinity
)# Create comma-separated lists of FastQ reads # Loop through read pairs # Increment by 2 to process next pair of FastQ files for (( i=0; i<${#fastq_array[@]} ; i+=2 )) do # Check array length for even number (i.e. paire end FastQs) if [[ $(( "${#fastq_array[@]}" % 2 )) -ne 0 ]]; then echo "FastQ array contains uneven number of files." exit fi # Handle "fence post" problem # associated with comma placement if [[ ${i} -eq 0 ]]; then R1_list="${fastq_array[${i}]}," R2_list="${fastq_array[${i}+1]}," elif [[ ${i} -eq $(( ${#fastq_array[@]} - 1 )) ]]; then R1_list="${R1_list}${fastq_array[${i}]}" R2_list="${R2_list}${fastq_array[${i}+1]}" else R1_list="${R1_list}${fastq_array[${i}]}," R2_list="${R2_list}${fastq_array[${i}+1]}," fi done
File Transfers
Backing up Mox files
/volume2/web/seashell/bu-mox$
rsync -avz --exclude '*_to_*' --exclude 'CHG_*.txt' --exclude 'CHH_*.txt' --exclude 'CpG_*txt' \
--progress sr320@mox.hyak.uw.edu:/gscratch/scrubbed/sr320/ scrubbed/
/volume2/web/seashell/bu-mox$
rsync -avz --progress sr320@mox.hyak.uw.edu:/gscratch/srlab/sr320/ .
Backing up Raven files
/home/shared/8TB_HDD_01/sr320/github$
rsync -avz . \
sr320@gannet.fish.washington.edu:/volume2/web/seashell/bu-github/
wget a lot of files from url
wget -r \
--no-directories --no-parent \
-P . \
-A "*_001_val_1.fq.gz" https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/analyses/20190806_TrimGalore/
git clone website
when in public_html
mkdir temp
cd temp
git clone https://github.com/sr320/lab-website.git
cd ..
cp -r temp/lab-website/docs/* .
rm -f -r temp
echo "now done"
Transfer sequencing files to Owl
Standard rsync
procedure:
rsync --archive --progress --verbose *.fastq.gz <owl_username>@owl.fish.washington.edu:/volume1/web/nightingales/<species_directory>
-
Replace
<owl_username_>
with whatever username you use to login to owl (even replace the<
and the>
). -
Replace
<species_directory>
with whatever species you're working with (even replace the<
and the>
). Example directory name format:P_generosa
. -
If it doesn't work, Sam may need to change your user settings on Owl, so please post an issue in https://github.com/RobertsLab/resources/issues/
Using rsync
list of files:
rsync -avP --files-from=:/volume1/web/nightingales/P_generosa/rsync_list.txt owl:/volume1/web/ .
head rsync_list.txt
nightingales/P_generosa/Geoduck-ctenidia-RNA-1_S3_L001_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-2_S11_L002_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-3_S19_L003_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-4_S27_L004_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-5_S35_L005_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-6_S43_L006_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-7_S51_L007_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-ctenidia-RNA-8_S59_L008_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-gonad-RNA-1_S1_L001_R2_001.fastq.gz
nightingales/P_generosa/Geoduck-gonad-RNA-2_S9_L002_R2_001.fastq.gz
Confirm MD5 checksums
Multiple MD5 checksum files (Linux)
for checksum_file in *.md5
do
md5sum --check ${checksum_file}
done
Multiple MD5 checksum files (Mac OS)
for checksum_file in *.md5
do
# Gets filename without any suffixes
filename=$(basename -s .md5 ${checksum_file})
# Generates MD5 checksum and compares to provided checksum in MD5 file
diff <(md5 "${filename}.fastq.gz" | awk '{print $4}') <(awk ' {print $1}' ${checksum_file})
done
Download file from Google Drive
Install gdown
.
Ideally, a checksum for the file hosted on Google Drive exists and be can be subsequently verified after downloading.
gdown -O PGA_assembly.fasta https://drive.google.com/uc?id=1Yanmb5yBXn-D4b_fzkR2GSxP
Transfer files to/from Mox using Globus Connect Personal
-
Log into Mox.
-
Activate anaconda (this might fail, let me know if it does and don't bother going to the next step):
conda activate
-
Setup Globus collection:
/gscratch/srlab/programs/globusconnectpersonal-3.1.4/globusconnectpersonal -setup --no-gui
-
Follow the instructions (copy/paste URL into browser, get code from webpage, enter code in Mox terminal, provide name for collection).
-
Add desired Mox directory to config file and set permissions. Here's an example:
$cat ~/.globusonline/lta/config-paths
~/,0,1
/gscratch/scrubbed/samwhite/,0,1
The config file does two things:
~/,0,1
: Makes your home directory readable/writeable by Globus.-
/gscratch/scrubbed/samwhite/,0,1
: Makes my directory on/gscratch/scrubbed/
readable/writeable by Globus. -
Start Globus Connect Personal:
/gscratch/srlab/programs/globusconnectpersonal-3.1.4/globusconnectpersonal -start
. Nothing will happen after you hit enter. The cursor will simply flash - this is good. -
Login to your Globus Connect Personal account via a web browser.
-
Click on Collections and you should now see your collection (name provided in Step 4), and it should have a green stack of papers(?) next to it; the green indicates that the connection is activate.
-
Click on the collection name.
-
Click on "Open in File Manager" (on the right side of the screen).
-
Navigate to the directory you setup in Step 5. NOTE: You'll have to navigate up a directory out of your home directory in order to get to the
/gscratch
partition. -
Transfer data from other Globus Endpoint to Mox!
FastA
Filter FastA File by Minimum Sequence Length
Just change the number "200" in the code below to your desired minimum sequence length.
$ awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' InputFastaFile.fasta
Code explanation:
!/^>/ { next }
:
- If a line (i.e. record) begins with a “>”, go to the next line (record). The "!" tells awk to skip the regular expression that immediatley follows. The "^" tells awk that the regular expression it's looking for should only match if it's at the beginning of a line. Finally, the regular expression we're looking for in this example is the ">", which denotes the sequence descriptor portion of FASTA files.
{ getline seq }
:
- “getline” reads the next record and assigns the entire record to a variable called “seq”
length(seq) >=200
:
- If the length of the “seq” record is greater than, or equal to, 200 then…
{print $0 "\n" seq>}
:
- Print all records (
$0
) of the variable “seq” in the file that matched our conditions, each on a new line (“\n”)
fasta to tab-delimited
!perl -e '$count=0; $len=0; while(<>) {s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) {print "\n"} s/ |$/\t/; $count++; $_ .= "\t";} else {s/ //g; $len += length($_)} print $_;} print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n";' \
../data/GCF_000297895.1_oyster_v9_cds_from_genomic.fna > ../analyses/GCF_000297895.1_oyster_v9_cds_from_genomic.tab
FastQC
Pass space-delimited list of FastQ files to FastQC
# Set CPU threads to use
threads=20
# Populate array with FastQ files
fastq_array=(*.fq.gz)
# Pass array contents to new variable
fastqc_list=$(echo "${fastq_array[*]}")
# Run FastQC
# NOTE: Do NOT quote ${fastqc_list}
fastqc \
--threads ${threads} \
--outdir ${output_dir} \
${fastqc_list}
BLAST
Applications/bioinfo/ncbi-blast-2.11.0+/bin/blastx \
-query ../data/GCF_000297895.1_oyster_v9_cds_from_genomic.fna \
-db ../blastdb/Caenorhabditis_elegans.WBcel235.pep \
-out ../analyses/Cg-WBcel235_blastx.tab \
-evalue 1E-05 \
-num_threads 4 \
-max_target_seqs 1 \
-max_hsps 1 \
-outfmt "6 qaccver saccver evalue"
Tips & Tricks
Remove spaces from filenames in a directory
for file in *; do mv "$file" ${file// /}; done
Explanation:
-
for file in *;
-
A for loop that looks at all files in the current directory. The word
file
is a variable that takes on the value of each file name in the directory (one file name per loop). The;
is needed for bash for loop formatting. -
do mv "$file" ${file// /};
-
Tells bash to use the move command (
mv
) and use the current contents of the variable$file
as the initial filename. The${file// /}
is a substitution command that tells bash to use the contents of thefile
variable and replace all spaces (//
; note - there should be a space after the last slash here) with nothing (/
- you can add text after this slash to replace with information of your choice). The;
is needed for bash for loop formatting. -
done
- Ends the for loop