Basics of Data Science

RWTH Aachen University via edX

Go to class Write review

Details

Go to class

Provider

edX
Pricing

Free Online Course (Audit)
Languages

English
Certificate

$99.00 Certificate Available
Duration & workload

9 weeks, 5-11 hours a week
Sessions

On-Demand
Level

Intermediate
Subtitles

English

Found in

Overview

"Basics of Data Science" is designed to provide participants with a comprehensive overview of the fundamental challenges, concepts and tools of data science. The content can be organized in three main areas of data science:

Initially, a brief overview is given to data science infrastructure concerned with volume and velocity. Topics include instrumentation, big data infrastructures and distributed systems, databases and data management. The main challenge is to make things scalable and instant.

The main focus of the course is on data analysis concerned with extracting knowledge from data. Key topics covered are data exploration and visualization, data preprocessing, data quality issues and transformations, various supervised learning techniques with a focus on their evaluation, unsupervised learning, clustering, pattern mining, process mining and text mining. The main challenge of data analysis is to provide answers to known and unknown unknowns.

Finally, data science affects people, organizations, and society. The course is concluded by discussing challenges and providing guidelines and techniques to apply data science techniques responsibly with a focus on confidentiality and fairness. Topics include ethics & privacy, IT law, human-technology interaction, operations management, business models, entrepreneurship, and the main challenge is to do all of the above in a responsible manner.

Throughout the course, the ideas and concepts conveyed in the videos are complemented by hands-on exercises using Python (Jupyter notebooks). Participants will be guided to apply the presented techniques on artificial and real-life data sets to gain valuable hands-on experience.

After the course participants should have a good overview of the best practices, challenges, goals and concepts of the broader data science field, providing a strong foundation for further study or professional development in this rapidly evolving field. Through the combination with hands-on experience with commonly used Python Libraries, participants will be able to conceptualize and implement various basic data analysis techniques in their own projects and accurately evaluate and interpret analysis results.

Syllabus

Week 1: Introduction, Data Exploration & Visualization

In the first half of the week, we will provide an overview of the course and illustrate the advantages and challenges when applying data science techniques. Students will get an overview of the data science pipeline, data sources and data types, data analysis techniques and challenges related to their application.

The second half of the week focuses on basic data exploration, visualization and transformation techniques.

Week 2: Supervised Learning Techniques

In the first half of this week, students will delve into data analysis using decision trees. We introduce the basic ID3 Algorithm and its extension to different notions of information gain, as well as pruning techniques, random forests and the applicability of decision trees to continuous data.

The second half of the week is dedicated to a brief overview of other supervised learning techniques (students interested in details are referred to the "Basics of Machine Learning" course which is also part of the BridgingAI course series). These techniques include Linear Regression, Logistic Regression, Support Vector Machines (SVMs), Neural Networks and Naive Bayesian Classification.

Week 3: Evaluation of Supervized Learning, Data Quality & Preprocessing

The first half of this week is dedicated to the evaluation of supervised learning techniques and the models they produce. We introduce the confusion matrix, ROC curve, R2 Coefficient and cross validation including their extension and adaption to specific goals or contexts. Furthermore, challenges and pitfalls regarding the evaluation and interpretation of supervised learning techniques are highlighted.

In the second half of the week, students will learn about data quality issues, their causes and avoidance strategies as well as possible approaches to dealing with outliers or missing values. Furthermore, and overview of data transformation, data reduction and normalization techniques is given.

Week 4: Clustering, Frequent Itemsets

In the first half of this week clustering is introduced as the first unsupervised learning technique. In particular, we present various similarity measures, the k-means and k-medoids algorithms, density-based clustering (DBSCAN) and give an overview of agglomerative clustering techniques and self-organizing maps (SOM).

The second half of the week focuses on the introduction of frequent itemsets. Two algorithms to compute such itemsets are explained: the straightforward Apriori approach as well as the more efficient FP-Growth algorithm.

Week 5: Association Rule Mining, Sequence Mining

In this week, we build upon the concepts of frequent itemsets to generate and evaluate association rules. Furthermore, we use association rules to illustrate Simpson's paradox.

The second half of the week revolves around sequence mining, in particular the AprioriAll algorithm. The relationships between frequent itemsets, association rules, sequence mining and process mining (introduced in Week 6) are clarified.

Week 6: Process Mining

The whole week is dedicated to various aspects of process mining. We start out with an extensive introduction to the topic, including various types of models, tools and applications. Next, various approaches to process discovery are presented as the most prominent example of unsupervised learning in the context of process mining. Finally, supervised problems in process mining are discussed with the main focus on conformance checking techniques.

Week 7: Text Mining

In this week we explore the topic of text mining. Various approaches to text preprocessing are discussed, including corpus annotation, tokenization, stop word removal, token normalization, stemming and lemmatization, followed by an overview of modelling techniques, i.e., BoW, document-term matrix and TF-IDF scoring. We briefly discuss the inclusion of semantics using public databases (Linked Open Data) before proceeding with a detailed introduction to N-grams and their application to word prediction and text generation. These concepts are extended in the following when discussing word embeddings, particularly the concepts of autoencoders, Word2vec, CBoW and Doc2vec.

Week 8: Responsible Data Science

In this week we discuss challenges and solution approaches to confidentiality and fairness in data science. The first half of the week is dedicated to confidentiality. We give a brief overview to data encryption before introducing various techniques to anonymize data while maintaining its usefulness for analysis and to objectively evaluate the level of anonymization.

The second part of the week, focusing on fairness, introduces various metrics to objectively measure fairness and explores approaches to decrease discrimination of data science models and techniques. We conclude with a discussion of the potential trade-offs between model performance and model fairness.

Week 9: The Bigger Picture

In the final week, we briefly recap the contents of the course and discuss connections, trade-offs, conflicts and interactions between the various topics as well as their context and impact within the bigger picture of data science. An outlook to further perspectives and topics omitted in this introductory course is given.