Data Management - High-throughput Sequencing Data

We’ve had a recent influx of sequencing data, which is great, but it created a bit of a backlog documenting what we’ve received.

I updated our Google Sheet (Nightingales) with the data from geoduck genome sequencing data from BGI, Olympia oyster genome sequencing data from BGI, and MBD bisulfite sequencing data from ZymoResearch.

I also fixed the :FileLocation” column by replacing the “HYPERLINK” function with “CONCATENATE”.

Google Sheet: Nightingales

After updating the Nightingales Google Sheet, I updated the corresponding Google Fusion Table (also called Nightingales).

To update the Fusion Table, you have to do the following:

  • delete all rows in the Nightingales Google Fusion Table (Edit > Delete all rows)

  • Import data from the Nightingales Google Spreadsheet (File > Import more rows…)

Fusion Table: Nightingales

At initial glance, the Fusion Table appears the same as the Google Sheet. However, if you follow the link to the full Fusion Table, it offers some unique ways to visually explore the data contained in the Fusion Table.

After that I decided to deal with the fact that many of the directories on Owl ( lack readme files and subsequent information about the sequencing files in those folders.

So, I took an inordinate amount of time to write a script that would automate as much of the process as I could think of.

The script is here (GitHub):

The goal of the script is to perform the following:

  • Identify folders that do not have readme files.

  • Identify folders that do not have checksum files.

  • Create readme files in those directories lacking readme files

  • Append the directory path to each new readme file

  • Append sequencing file names and corresponding read counts to the new readme files

Will run the script. Hope it works…