Projects
01
Single-Nucleus RNA-seq of Hepatoblastoma
I built a complete end-to-end machine learning pipeline in Python using scikit-learn to predict Boston housing prices. This project demonstrates my ability to perform data cleaning, feature engineering, multivariable regression, diagnostic evaluation, and clear result interpretation — skills I also apply to biological datasets. Visualizations were created with Matplotlib and Seaborn for clear insights.
Seurat Differential Expression PCA Dimensionality Reduction
02
Tumor vs Normal TCR Classification
This project applies deep learning to classify tumor versus normal immune repertoires using TRB CDR3 sequences from lung cancer patients in the CPTAC dataset. The pipeline covers end-to-end preprocessing, variable-length sequence modeling, and interpretability.
Deep Learning LSTM PyTorch Mean Pooling
03
SRA to BAM: Immune Receptor Extraction Pipeline
This modular Nextflow pipeline automates immune receptor region extraction from public sequencing datasets. It processes SRA accessions end-to-end: from FASTQ conversion and genome alignment to BAM slicing and cleanup — enabling high-throughput analysis of TCR/BCR regions for immunogenomic studies.
Nextflow Conda NCBI SRA Toolkit STAR
04
Boston House Price Prediction
This project builds a complete machine learning pipeline using Python’s scikit-learn to predict Boston housing prices.
Scikit-learn Linear Regression Docker Machine Learning
05
Intrinsic Disorder Analysis of Immune Receptor CDR3 Regions
This project investigates the structural flexibility of TCR and BCR CDR3 regions by predicting their intrinsic disorder using metapredict. I built a custom pipeline to extract receptor sequences from cancer patient BAM files, run disorder analysis, and compare CDR3 segments to V and J regions to highlight unique biophysical properties.
metapredict SAMtools GDC (Data Transfer Tool) HPC
06
Tumor Severity Prediction from RNA-Seq using ML & Deep Learning
Developed a reproducible ML/DL pipeline to classify tumor severity in lung adenocarcinoma using TCGA-LUAD RNA-seq data. Applied ANOVA F-test for gene selection, benchmarked classical models against a CNN with Optuna tuning, and achieved 79% accuracy. Dockerized the project and integrated a future-ready Nextflow pipeline for scalable optimization.
Keras Optuna CNN RNA-Seq