GATK4 CNV Analysis in Glioblastoma
Summary
Built an end-to-end, HPC-optimized copy number variation (CNV) pipeline for dbGaP-protected cancer RNA-seq data using GATK4 CNV, automating SRA-to-BAM processing and genome-wide CNV calling across large patient cohorts on the USF RRA cluster.
Project Highlights
- Designed a secure, large-scale workflow to convert SRA accessions to aligned, sorted BAM files (SRA → FASTQ → BAM) using Slurm arrays, fasterq-dump, BWA-mem2, and samtools.
- Parallelized FASTQ extraction and alignment on scratch storage with timeout handling, enabling stable processing of thousands of samples without manual restarts.
- Constructed autosome-wide interval lists and a robust Panel of Normals from normal cohorts using GATK CreateReadCountPanelOfNormals to reduce technical noise in copy-ratio estimates.
- Automated CNV calling with GATK CollectReadCounts → DenoiseReadCounts → ModelSegments, generating standardized copy-ratio tracks and high-resolution CNV segments for each tumor sample.
- Implemented Python-based post-processing to map CNV segments to genes and assemble cohort-level gene-by-sample CNV matrices ready for survival analysis, immune feature integration, and downstream ML.