Getting and Cleaning Data

Johns Hopkins University via Coursera

Go to class Write review

Details

Go to class

Provider

Coursera
Pricing

Free Online Course (Audit)
Languages

English
Certificate

Paid Certificate Available
Duration & workload

20 hours
Sessions

On-Demand
Subtitles

Arabic, French, Portuguese, Chinese, Italian, German, Russian, English, Spanish, Korean, Thai, Indonesian, Kazakh, Hindi, Swedish, Greek, Ukrainian, Japanese, Polish, Dutch, Turkish, Hungarian, Bengali, Pashto, Urdu, Azerbaijani, Farsi

Found in

Part of

Overview

Before you can work with data you have to get some. This course will cover the basic ways that data can be obtained. The course will cover obtaining data from the web, from APIs, from databases and from colleagues in various formats. It will also cover the basics of data cleaning and how to make data “tidy”. Tidy data dramatically speed downstream data analysis tasks. The course will also cover the components of a complete data set including raw data, processing instructions, codebooks, and processed data. The course will cover the basics needed for collecting, cleaning, and sharing data.

Syllabus

Week 1

In this first week of the course, we look at finding data and reading different file types.

Week 2

Welcome to Week 2 of Getting and Cleaning Data! The primary goal is to introduce you to the most common data storage systems and the appropriate tools to extract data from web or from databases like MySQL.

Week 3

Welcome to Week 3 of Getting and Cleaning Data! This week the lectures will focus on organizing, merging and managing the data you have collected using the lectures from Weeks 1 and 2.

Week 4

Welcome to Week 4 of Getting and Cleaning Data! This week we finish up with lectures on text and date manipulation in R. In this final week we will also focus on peer grading of Course Projects.

Taught by

Jeff Leek

Reviews

3.4 rating, based on 58 Class Central reviews

4.5 rating at Coursera based on 8064 ratings

Start your review of Getting and Cleaning Data

Life is Study

Getting and cleaning data is the third course in the first wave of John Hopkins’s data science specialization track on Coursera. It is recommended that you take this course after the data scientist's toolkit and R programming courses. The title of…

Getting and cleaning data is the third course in the first wave of John Hopkins’s data science specialization track on Coursera. It is recommended that you take this course after the data scientist's toolkit and R programming courses.

The title of the course pretty well sums up the content: the entire class is about loading data into R and cleaning it up so that it can be used of data analysis. You'll learn how to load various data formats into R, such as json, xml, csv, excel files and get data from other sources like MySQL and web APIs. The course also discusses subsetting data, adding variables, merging data, regular expressions and working with dates.

This course is a good summary of many of the things that are useful to know when trying to access and prepare data for analysis. Similar to R programming, it suffers from overuse of static slides with voice-overs, a lack of instructor face time and a lack of interactive content or in-lecture quizzes to help you learn and retain as you go along. You'll be introduced to many R packages and syntax that you probably won't remember after a week or two, but you'll be exposed to many common data formats so that you can refer back to the course materials or other web resources to deal with them in the future.
Anonymous

I'm a fresh beginner to R and my only experience with it is from the previous 2 courses in this specialization. The lectures aren't so bad... they're a little bit boring and not engaging since they rarely are more than just a voiceover and slides.…

I'm a fresh beginner to R and my only experience with it is from the previous 2 courses in this specialization.

The lectures aren't so bad... they're a little bit boring and not engaging since they rarely are more than just a voiceover and slides. If that's important to you, don't take this class. However, I do think the instructors explain the lecture topics well and there is some value in their short walkthroughs.

Unfortunately... this only applies to the lecture topics... which are often only a small part of the quizzes and programming assignments. The previous course in the specialization, R Programming, was MUCH worse in this regard. That said, if your R background is minimal and you utilize outside sources (Stack Exchange, forums, etc.), you WILL learn a LOT. But at times you may feel a bit that the course itself didn't play a large role in the learning process, aside from giving you assignments.

Overall, the material is quite difficult for someone with no background (aside from the other courses in the specialization). I have to give a lot of props to the people in the discussion forums and the CTAs: these people really help close the gap and it's because of them that I keep pushing through!
Stephen B

Class information is very sparse. There's a huge gap between the (minimal) content provided in the lectures and the class project required for completion of the course. This is the worst constructed college course and worst MOOC I have ever encountered. I've completed 12 MOOCs, 2 bachelor's degrees, and several graduate courses at Stanford, so that is a distinction earned by Johns Hopkins U from among a very wide field. A complete overhaul of this course and series is desperately needed.
Anonymous

Dropping this course because there is such a disconnect between what is taught and what is expected to complete the project and quizzes. I found myself using external sources to learn all of the material necessary. Many of the questions are vague, leaving you spending hours trying to complete tasks only to realize that the objective is different and just not communicated effectively. There is no coherent order to how they deliver the material, teaching basic concepts in week 3 which should have been covered in week 1 or the prior course in R programming. So, I will just use others' tutorials to learn data science in R. Ridiculous that I wasted so much time on this!
Anonymous

Extremely frustrating class, I spent tons of time wondering what is it that I am actually suppose to do...

I am considering dropping the specialization.
Anonymous

Course is lacking any kind of logic or structure. It's simply methods/functions thrown one after another. Complete lack of perspective.
Anonymous

A rather poor and confusing course. The lectures are not so great. I'm rather dissapointed with it. Normally these courses are rather good, but not this one.
Syed Aslam

i didn't learn much from course lectures or materials, rather i learned most from stack over flow.really a big disappointment.
Anonymous

This is the third course in the series, and it's taken me this long to realize that everything I learn comes from external sources and not the course itself. If you do this, you'll learn something. If you don't, you'll lose your mind and waste a ton…

This is the third course in the series, and it's taken me this long to realize that everything I learn comes from external sources and not the course itself. If you do this, you'll learn something. If you don't, you'll lose your mind and waste a ton of time in the process.

I started out by watching the videos, taking copious notes and then realizing that I didn't have the information I needed to complete the assignments. I was very stressed about it until my friend -- who uses R programming regularly for work -- shrugged and said, "That's how it works in the real world. You search Stack Overflow or Github to find others who have already solved your problem. Don't go reinventing the wheel." This took a huge amount of pressure off.

I find that I look at others' solutions, work through them line by line to figure out the how and why of it, test to see if they're correct (about 40% of the time they don't appear to be), and learn by doing. I listen to the course videos in the background, and try to tie what they're talking about to actual code that I'm seeing, which makes a big difference.
Brandt Pence

This is the third course in the Data Science specialization. The course is all about how to read data of different formats into R and how to create tidy datasets (one variable per column, one observation per row, one observational unit type per tabl…

This is the third course in the Data Science specialization. The course is all about how to read data of different formats into R and how to create tidy datasets (one variable per column, one observation per row, one observational unit type per table). There are brief introductions to reading datasets from online resources such as XML files, website APIs, and MySQL, and the quizzes for weeks 1 and 2 require you to work with these tools. Week 3 introduces subsetting and reshaping data and tools like dplyr, and week 4 introduces working with text strings and regular expressions.

I found this course to be quite a bit easier than the prerequisite R Programming. The quoted time commitment of 4-9 hours/week seems pretty reasonable, and I was probably at the lower end of that even though I front-loaded everything and finished by the middle of the second week of the course. The course project was fairly straightforward but also open-ended, and there was some concern on the discussion boards about how certain aspects of the project (e.g. descriptive variable names) would be evaluated.

All of the quizzes required a fair bit of programming, but nothing was too difficult. There were some technical hurdles in several of the quizzes that caused people problems. For example, R cannot read XML files over an https connection, and that caused some problems for quiz 1, although several solutions were quickly posted. Quiz 2 contained probably the most difficult programming task in the course, which required reading information from the instructor's Github account using the Github API. With some searching I found solutions online, but if you're having trouble and can't find good answers elsewhere, the forum will eventually help once a sufficient number of people get around to taking the quiz.

Overall, four stars. The course was fairly straightforward, and the information here may or may not be valuable to you depending on the type of data analyses you plan to perform and where/how your data are stored. I had previous experience with dplyr from the first course in the EdX PH525 series, so the most valuable portion of the course for me was the section on working with regular expressions.
Ramesh Natarajan

This course just provides an outline on the subject. Its upto you to figure out how to get the assignment done .. Google and StackOverflow is your instructors .. Really! To make things worse, the course assignment instructions are very ambiguous and you spend tons of time trying to understand the problem than solving it. If thats the intend of this course, they have succeeded in it, but when you have a course deadline (and a full time job as many of you do), its extremely frustrating.
Andari Reksi

What were taught in video materials are nothing compared to the quiz and final projects. At this point I'm still re-reading my final project assignment data, and although I can sense some things that needs to be done to finish this project, it has taken me hours into StackOverflow or some other R blogs (just to make sure the command/formula I type is right). Very frustrating compared to other Coursera modules I finished. After this I may drop the Data Scientist specialisation altogether.
Anonymous

There is a complete disconnect between what is taught and what is expected in the project and tests. The course is pretty bad. I was considering doing the specialization in Data Science and this course is making me re-think this goal.

I understand that you need to be good at 'hacking' to be a good data scientist, but if that's the case then what's the point of paying money to have to Google everything.
Mohd Azzani

It's not free at all.
Providing demo doesn't mean free
I tried enrolling to the so called free course and I couldn't make it without providing credit card
It's providing free demo but the course itself is not free at all
Hongmei Li

There is a significant gap between the video lecture and the assignments/quizzes.
Very horrible... I paid my course for certification, and I cann't retake it for free.
Michal

The course is a part of very good 'data science with R' program (don't know current name cause it changes) available at Coursera.

The program is quite massive, it contains about 8 courses but is really thorough and well presented. It is designed with even complete beginners in mind, so may start it without any prior knowledge.
Jason Michael Cherry

This course teaches a lot of extremely important skills in data science. No matter what you end up doing, dealing with data quality is going to be a part of it. This is a challenging class, and rightly so, as the work is tedious, but oh-so-important! The lectures do get a bit bland, but are informative.
Scott orr

Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping. The syllabus also includes tips on using R to clean and recode data, and, in the last lecture, a long list of links to sources of data. It's also worth noting that the style of the video lectures is a bit different from those of other classes I've taken: there's never any video of the instructor, just the instructor's voice over the lecture notes.
Jevgeni Martjushev
Daniel Rosquete

Ok, this course is really helpful!

Everything on it has no waste at all, this course is a must for a data scientist!