GitHub Curation – Sam’s Notebook

Updated a couple of GitHub Wikis:

Created a new repo in the RobertsLab Organization GitHub account with a wiki to provide an overview of how to use of Hyak (mox) computing node. This was lightly modified from what Sean already had in his personal repo.

As a quick test, I updated all the md files in the sr320/LabDocs/code md files to format headers for GitHub’s newest interpretation of headers. The headers (represented by a series of ‘#’) require a space between them and the subsequent text. I used the following command in bash:

for i in *.md; do sed -i.bak 's/^#*/& /g' "$i"; done

The code works as follows:

Run for loop on all .md files in the directory
Use sed to edit the files in place: -i.bak (this command structure is needed for Mac OS X).
's/^#*/& /g': Performs a substitution by identifying all lines beginning (^) with a pound symbol (#) and match zero or more occurrences of the pound symbol (*), then substituting the same pattern that was matched and adding a space at the end of the pattern (& ). Do this for all occurrences found within the document (g).

Since this worked, I’ll probably run this through all of the md files in all of our various repos to quickly and easily fix header formatting issues.

Working on updating the Genome-sequencing-December-2016-(UW-PacBio) wiki, but need to work out the kinks on any easy, documentable way to rename and move some files around in order to make files/organization compliant with our [data management plan (DMP)(https://github.com/sr320/LabDocs/wiki/Data-Management#ngs-data-management-plan).

Current strategy:

Generate MD5 checksums for fastq files for each of the SMRT cell runs.
Copy file names from the the .xml file in the top level of each SMRT cell run folder to an array.
Use parameter substitution (in bash) to strip path and suffix from each index of the array (results likely stored in a secondary or tertiary array).
Use bash find command to copy the filtered_subreads.fastq.gz from each SMRT cell run folder, and append each of the corresponding stripped filenames in the final array to the beginning of the fastq file, to the owl/nightingales/O_lurida directory.
Generate new MD5 checksums on the copied files and compare to original MD5 checksums. This will confirm two things: 1 - The data did not get corrupted when copied. 2 - The new filenames correspond to the correct, original filtered_subreads.fastq.gz file (renaming a file doesn’t alter the MD5 checksum).
Archive the original SMRT cell run folders (which contain a ton of metdata files)