If You’ve Got Data, Mine It Yourself: Ian Witten on Data Mining, Weka, and his MOOC
Not only is “big data” hot right now, but so is talking about it. An October 2012 Harvard Business Review article predicted ‘data scientist’ to be one of the hottest jobs of the 21st Century, and at this point, it looks like they have a fair chance at being right. There is even some backlash, … Continued
Not only is “big data” hot right now, but so is talking about it. An October 2012 Harvard Business Review article predicted ‘data scientist’ to be one of the hottest jobs of the 21st Century, and at this point, it looks like they have a fair chance at being right. There is even some backlash, and many people think ‘big data’ has become an overused buzzword. Buzzword or not, there is no question that we live in a world where the amount of data being collected is growing dramatically (witness U.S. NSA surveillance programs). Ian Witten drives home this point:
The Current Data Explosion
“The biggest thing that’s changed about the world is data. Past 20 years. It’s a whole new phenomenon, and people are just starting to think about it.”
Ian Witten is a professor of Computer Science at the University of Waikato in New Zealand. He is the original creator of Weka, a popular open-source data mining tool (downloaded a total of 4.9 million times so far) which allows end users to analyze their own data. Ian is teaching a new session of the free MOOC, ‘Data Mining with Weka’, starting on March 3, 2014, which lasts five weeks, and later in April will teach a continuation course, ‘More Data Mining with Weka’. His first ‘Data Mining with Weka’ MOOC was held in the fourth quarter of 2013 with 6,500 people signed up: about half were active, and about 1,050 earned certificates of completion. Charlie Chung from Class Central spoke with Ian about his thoughts on data mining and his MOOC.
First we talked about different terminology that we hear about: data mining, machine learning, statistical analysis. The language changes over time, but they refer to very similar things. Ian cited a quip that “data mining = statistics + marketing”, implying that if you can describe statistics in a way that grabs people’s attention, then what you are doing is data mining. Call it what you will, Weka is enabling a new audiences to perform a wide variety of analyses (beyond its initial use by scientists crunching research data). Ian gives two examples to show the variety of applications, at opposite ends of the life-death continuum: embryologists use Weka to crunch data on over 60 variables to help select human embryos to implant during in-vitro fertilization to maximize the chances for a viable birth, and New Zealand cattle ranchers use Weka to select which cows to kill off during the low-grass season (apparently in NZ they have this quaint tradition of feeding cows their natural food source, grass). In between life and death, Weka is being used for all kinds of applications.
Weka enables domain experts to mine data
Weka is a Java-based environment for analyzing large datasets via a user-interface geared for end-users, that is, domain experts. Who are these domain experts? They might be statisticians or programmers, but they need not be. They are people working in a field who have a large amount of data, understand how its collected, and have the context to properly interpret any results. But most important, they have well-formed questions or hypotheses that they want to examine by looking at the data. Ian explains the importance of domain expertise:
The Importance of Domain Knowledge
“To work with the data, you’ve got to understand the domain. My mission is to move data mining into the hands of the domain expert”
Those interested in data analysis may be aware that there are many other MOOCs on machine learning, some of them very popular, but Ian notes that their focus differs from his course, and there is not much participant overlap. Other MOOCs focus on how to implement machine learning algorithms, whereas the Weka MOOC focuses on utilizing them. But we pushed Ian on this point: isn’t it necessary to work through the implementation (or the math) of these algorithms in order to fully understand and use them properly? Ian wasn’t fazed by the question and gave the following analogy: “You drove your car in to work today without knowing the details of what’s happening inside that internal combustion engine–and certainly not down to the atomic level. You’ve got to stop somewhere. I think it’s more important to know about the limitations of what you’re doing than to understand those little intricate details”. Okay, that’s a fair point.
Approach to Mining Data
But data mining can be done well or poorly, and putting a powerful tool like Weka in inexperienced hands can surely lead to bad analysis at times. Ian acknowledges this and says “in some research papers [researchers using Weka] have come to some flaky conclusions…in a way that really isn’t really contributing in any way to science.” But he doesn’t lose sleep over this, as this would be true for any new tool or method, and instead prefers to focus on the many people that are enabled to gain new insights from it.
In terms of how data mining should be approached, Class Central asked Ian a much-discussed question: is it better to approach a data set with specific hypotheses, or to explore the data without preconceived notions to see what emerges? Ian firmly supports the former camp:
Using a Question-Driven Approach to Data Mining
“It’s not usually productive to just look at the data and see if you can find something interesting in it. A large part of the data mining problem is coming up with the question that you’re asking and refining that question”
About the Weka MOOC and the follow-on MOOC
Ian feels confident that taking the ‘Data Mining with Weka’ MOOC will adequately prepare participants to use Weka to do data analysis. Not all of Weka’s features will be covered, of course, but there is the ‘More Data Mining with Weka’ MOOC scheduled to start in late April (and if that goes well, Ian hinted that an ‘Advanced’ course may be planned in the future). The MOOC consists of a series of 5-10 minute videos, but the learning really occurs while doing the exercises. MOOC participants download the Weka environment and will receive several data sets to analyze (though if you happen to have your own data you want to analyze, you could probably apply what you’re learning, and look for any help you need on the discussion boards). The MOOC will also have TAs, who were former students from the first session, to help answer questions.
There’s also something that Ian says is unique about his data analysis MOOC: a discussion of ethics. He states, “I think it is extremely important. When you give people powerful tools, you need to make them aware of some of the issues that are involved in applying these tools…we at least raise the issue about data, and some different international perspectives in different countries about what you can do with data you’ve collected.” This aspect of data analysis surely is important in a post-Snowden, post-NSA world.
On Teaching the MOOC
When we asked Ian how his experience was teaching MOOCs, he basically said it was a lot of work–but in a good way. We’ll let him explain:
The Joy of Adequate Preparation Teaching a MOOC
“I re-learned the joy of having enough time to prepare teaching material. People can walk away, and with each video I ask, will this be the one that turns them off? It’s great pressure for an educator to have, this is the way it should be”
Well, in our book, for being a passionate educator, something that was evident throughout the interview, Ian Witten scores a touchdown (American football–we don’t know the relevant rugby terms), which is 6 points. But then he gets an extra two point conversion on top of that: one point for creating Weka and giving it to the open-source community 20 years ago, and another point for teaching these MOOCs to train thousands of people who want to learn how to mine their own data and want to harness the power of Weka to draw out new and valuable insights.
Professor Ian Witten’s MOOC, ‘Data Mining with Weka’ starts its next session on March 3, 2014, and you can sign up to take it for free.
(You can find the full interview here on Class Central’s YouTube channel).
Vinod Satpute
which are the differnt algorithms should i use for prediction of energy consumption in any industry?