Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Genes and Geography - A Bioinformatics Project

OMGenomics via YouTube

Overview

Embark on a comprehensive bioinformatics project walkthrough that explores the relationship between genes and geography through population genotype data analysis. Learn to run Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) on genetic data from the 1000 Genomes project. Follow step-by-step instructions to download and parse VCF files using pysam, create numpy arrays, and utilize pandas for data manipulation. Transition between Python scripts and Google Colab environments while mastering visualization techniques with both matplotlib and Altair. Gain insights into population genetics by coloring data points based on ancestry labels and merging additional population information. Conclude with an exercise on performing PCA on SNPs and discover the origin story behind this illuminating project.

Syllabus

Intro
Hunting for data
Inspecting the VCF
Finding population labels for the samples
Parsing VCF with pysam
Going from alleles to numbers for a numpy array
When to work in colab versus python script
Saving data with pandas
Adding population labels from the panel file
To Colab!
PCA
First plot! Mission accomplished :
Using Altair for plotting with labels
Second plot with population labels!
Merging with the igsr_population.tsv data
TSNE
Exercise: PCA on the SNPs
Conclusion and origin story for this project

Taught by

OMGenomics

Reviews

Start your review of Genes and Geography - A Bioinformatics Project

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.