Big Data Science with the BD2K-LINCS Data Coordination and Integration Center
Icahn School of Medicine at Mount Sinai via Coursera
-
403
-
- Write review
Overview
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
The Library of Integrative Network-based Cellular Signatures (LINCS) was an NIH Common Fund program that lasted for 10 years from 2012-2021. The idea behind the LINCS program was to perturb different types of human cells with many different types of perturbations such as drugs and other small molecules, genetic manipulations such as single gene knockdown, knockout, or overexpression, manipulation of the extracellular microenvironment conditions, for example, growing cells on different surfaces, and more. These perturbations are applied to various types of human cells including cancer cell lines or induced pluripotent stem cells (iPSCs) from patients, differentiated into various lineages such as neurons or cardiomyocytes. Then, to better understand the molecular networks that are affected by these perturbations, changes in levels of many different molecules within the human cells were measured including: mRNAs, proteins, and metabolites, as well as cellular phenotypic changes such as cell morphology. The BD2K-LINCS Data Coordination and Integration Center (DCIC) was commissioned to organize, analyze, visualize, and integrate this data with other publicly available relevant resources. In this course, we introduce the LINCS DCIC and the various Data and Signature Generation Centers (DSGCs) that collected data for LINCS. We then cover the LINCS metadata, and how the metadata is linked to ontologies and dictionaries. We then present the data processing and data normalization methods used to clean and harmonize the LINCS data. This follows by discussions about how the LINCS data is served with RESTful APIs. Most importantly, the course covers computational bioinformatics methods that can be applied to other multi-omics datasets and projects including dimensionality reduction, clustering, gene-set enrichment analysis, interactive data visualization, and supervised learning. Finally, we introduce crowdsourcing/citizen-science projects where students can work together in teams to extract gene expression signatures from public databases, and then query such collections of signatures against the LINCS data for predicting small molecules as potential therapeutics for a collection of complex human diseases.
Syllabus
- The Library of Integrated Network-based Cellular Signatures (LINCS) Program Overview
- This module provides an overview of the concept behind the LINCS program; and tutorials on how to get started with using the LINCS L1000 dataset.
- Metadata and Ontologies
- This module includes a broad high level description of the concepts behind metadata and ontologies and how these are applied to LINCS datasets.
- Serving Data with APIs
- In this module we explain the concept of accessing data through an application programming interface (API).
- Bioinformatics Pipelines
- This module describes the important concept of a Bioinformatics pipeline.
- The Harmonizome
- This module describes a project that integrates many resources that contain knowledge about genes and proteins. The project is called the Harmonizome, and it is implemented as a web-server application available at: http://amp.pharm.mssm.edu/Harmonizome/
- Data Normalization
- This module describes the mathematical concepts behind data normalization.
- Data Clustering
- This module describes the mathematical concepts behind data clustering, or in other words unsupervised learning - the identification of patterns within data without considering the labels associated with the data.
- Midterm Exam
- The Midterm Exam consists of 45 multiple choice questions which covers modules 1-7. Some of the questions may require you to perform some analysis with the methods you learned throughout the course on new datasets.
- Enrichment Analysis
- This module introduces the important concept of performing gene set enrichment analyses. Enrichment analysis is the process of querying gene sets from genomics and proteomics studies against annotated gene sets collected from prior biological knowledge.
- Machine Learning
- This module describes the mathematical concepts of supervised machine learning, the process of making predictions from examples that associate observations/features/attribute with one or more properties that we wish to learn/predict.
- Benchmarking
- This module discusses how Bioinformatics pipelines can be compared and evaluated.
- Interactive Data Visualization
- This module provides programming examples on how to get started with creating interactive web-based data visualization elements/figures.
- Crowdsourcing Projects
- This final module describes opportunities to work on LINCS related projects that go beyond the course.
- Final Exam
- The Final Exam consists of 60 multiple choice questions which covers all of the modules of the course. Some of the questions may require you to perform some analysis with the methods you learned throughout the course on new datasets.
Taught by
Avi Ma’ayan