Genomics Pipelines, Tools, and Hackathons

Modified

February 1, 2025

A list of my genomics pipelines and tools developed at SCRI, Fred Hutch, and through external collaborations, such as Hackathons. These pipelines and tools were used to reproducibly process data or generate figures for many of the published journal articles on projects I supported.

Genomics Data Pipelines

1. Cut and Run Nextflow workflow

  • Custom pipeline developed from nf-core workflow template and uses most nf-core best practices

  • Bowtie2 alignment, quality trimming of reads with trimgalore, SEACR peak calling, and optionally MACS2 peak calling. MACS2 requires an effective genome size to call peaks, which you can provide directly or call unique-kmers.py to calculate the effective genome size on the fly. Coverage tracks are produced for visualization in IGV.

  • Performs general QC statistics on the fastqs with fastqc, the alignment, peak calling, and sample similarity using deeptools. Finally, the QC reports are collected into a single file using multiQC.

  • Fully documented using parameterized Rmd that knits to HTML and github markdown formats.

  • Integrated for CI/CD with Atlassian bamboo agent to run functional tests using small toy datasets and re-build documentation.

  • Public version can be found here, but most recent releases are private repos at SCRI bitbucket.

DAG of Cut&Run Pipeline

2. RNA-seq Quantification Nextflow workflow

  • Custom pipeline developed from nf-core workflow template and uses most nf-core best practices

  • Designed to output gene expression counts from bulk RNA-seq using STAR aligner using –quantmode. It will also perform general QC statistics on the fastqs with fastqc and the alignment using rseqc. Finally, the QC reports are collected into a single file using multiQC.

  • Fully documented using parameterized Rmd that knits to HTML and github markdown formats.

  • Public version can be found here, but most recent releases are private repos at SCRI bitbucket.

DAG of RNAseq Pipeline

3. RNA-seq Fusion Detection Nextflow workflow

  • Custom pipeline developed using DSL2 syntax with module files.

  • Bulk RNA-seq fusion detection pipeline with STAR aligner, STAR-Fusion, and Fusion-Inspector. This includes the most relevant output files, such as SJ.out.tab, aligned.bam, and chimeric.junctions.tab, and the fusion inspector HTML report.

  • The workflow also includes the CICERO fusion detection algorithm that is run using the aligned.bam from STAR-aligner output.

  • In addition, the fastq files undergo quality control checks and a multiQC report.

  • Configured for SLURM, PBS pro, and AWS batch executors with containerized software. Outputs can be saved locally or configured to upload data to AWS S3 buckets.

  • Public version can be found here, but most recent releases are private repos at SCRI bitbucket.

STAR Fusion Algorithm

Other Pipelines

Other nextflow pipelines generated with nextflow and nf-core tools include:

  • CutandRun and Chip-seq heatmaps (AKA tornado plots - see Figure below)

  • VCF variant annotation

  • splitting ATAC-seq bams by fragment size

Tornado Plots (Figure C,D,E)

CI/CD and Reproducible Development Environments

Use of Github actions and Gitpod with DevContainers to automate tasks, like testing and building of applications or websites, as well as create shareable and reproducible development environments.

Reproducible development environments allow team members to actively contribute to a shared code repository, each each team member using an identical compute environment by enabling the use of a containerized environment with devcontainers.

Github Actions

Personal website found here

  • Uses a ‘on push’ trigger with github hosted runners
    1. the repository is checked out and cloned on the remote the ubuntu build machine (a gh hosted runner)
    2. install quarto on the ubuntu build machine
    3. use quarto render to build the website
    4. the workflow artifacts (a tar archive of the rendered website files) are uploaded to a temporary server
    5. deploy the artifacts (rendered website) to the github pages URL

Gitpod Development Environment

I am working on the DevOps for Data Science lab exercises and recording the work in a github repository using

  • environment is generated directly from a clone of the public repository
  • a containerized environment is then built using Docker, which is orchestrated by GitPod Desktop
  • automations are used to install required dependencies:
    • a reproducible R environment is managed with Renv package to install R packages
    • a reproducible Python environment is managed with venv library to install python packages
  • the entire project and identical compute environment can be collaboratively developed using this link

Technical Skills and Courses

Additional details about technical skills, including course completion or course progress:

  • roadmap.sh

    • Git and Github

    • Docker

    • Linux

    • Python

    • SQL

    • AI and Data Scientist

    • Data Analyst

  • freecodecamp

  • codecademy

    • SQL programming course
      • Introduction to SQL Queries and relational databases
      • Completed: Nov 2024
    • Introduction to Linux: Users and Permissions
      • Useful refresher course
      • Completed: Jan 2025

Hackathon Projects

R Packages

  1. RNA-seq and multi-omics Data Analysis
  • DeGSEA

  • R package was developed for use in the Meshinchi lab to help streamline the association of clinical covariates and RNA-seq and miRNA-seq expression data.

  1. RNA-seq Fusion Breakpoint Data Analysis
  • fusBreakpoint

  • R package for Sequence Search in BAM files using R Bioconductor

Bioinformatics / Statistical Analysis Notebooks

The analysis notebooks (primarily Rmarkdown) for all analyses from Fred Hutch can be found at Meshinchi Lab. Selected analysis notebooks and scripts from SCRI are hosted at Research Scientific Computing github repository.

References

These pipelines and tools were used to reproducibly process data for many of the published journal articles on projects I supported. See the publications page for details.