GATK4 CNV Analysis in Glioblastoma

Summary

Built an end-to-end, HPC-optimized copy number variation (CNV) pipeline for dbGaP-protected cancer RNA-seq data using GATK4 CNV, automating SRA-to-BAM processing and genome-wide CNV calling across large patient cohorts on the USF RRA cluster.

Project Highlights

Designed a secure, large-scale workflow to convert SRA accessions to aligned, sorted BAM files (SRA → FASTQ → BAM) using Slurm arrays, fasterq-dump, BWA-mem2, and samtools.
Parallelized FASTQ extraction and alignment on scratch storage with timeout handling, enabling stable processing of thousands of samples without manual restarts.
Constructed autosome-wide interval lists and a robust Panel of Normals from normal cohorts using GATK CreateReadCountPanelOfNormals to reduce technical noise in copy-ratio estimates.
Automated CNV calling with GATK CollectReadCounts → DenoiseReadCounts → ModelSegments, generating standardized copy-ratio tracks and high-resolution CNV segments for each tumor sample.
Implemented Python-based post-processing to map CNV segments to genes and assemble cohort-level gene-by-sample CNV matrices ready for survival analysis, immune feature integration, and downstream ML.

GATK4 CNV CNV Analysis RNA-Seq HPC Slurm BWA-mem2 samtools Bash Python dbGaP

GitHub Blog (coming soon)