Mining of Massive DataSets - A Review - cat /dev/random

If you’ve been keeping track, I’ve hardly published a single article in the last month. In my defense, the last 7 months have been extremely hectic. High work pressure and a rather dense online course kept most of the drafts from seeing the light of day. Before I jump in reviewing the course i.e. Mining Massive DataSets (MMDS), here’s a quick short story for some context.

I first stumbled onto MMDS or CS246 (as its called in Stanford), a graduate level course on (you guessed it) data mining in early 2012 when I had recently finished Andrew Ng’s course on Machine Learning. With professors like Anand Rajaraman (of Amazon) and Jeff Ullman teaching the course and making their book freely available, I got quite interested and wished that it be offered on Coursera some day. Fast forward 2 years and I see a mail from Coursera informing me that the course is up for grabs. Without hesitating, I hurriedly signed up and waited eagerly for the course to start.

The course lasted for around 8 weeks comprising of long lectures, quizzes and a final exam which I gave just a couple of days back. MMDS came on HN where I also posted a similar review. I’m reposting it here with minor changes. For the impatient ones a TL;DR -

If you’re interested in Machine Learning and Data Mining and want to learn with what kind of challenges are posed by huge datasets in applying standard algorithms, then you’ll find this course extremely valuable.

For the rest of you, here’s a few things that I liked -

Faculty

Like most MOOCs, MMDS is taught by one of the best faculty from the field. I’ve been an avid follower of Anand Rajaraman’s blog before I joined this course and I have to say the enthusiasm of the faculty is infectious and their expertise with the material is markedly evident.

Difficulty

MMDS is a CS graduate level course (CS246) from Stanford. That means the topics are not trivial, the lectures are dense and you as a student are expected to invest significant time into understanding the material. On average I spent around 6-8 hours per week on the lectures and quizzes. Since this is hard, grasping the concepts and getting the quiz right is quite gratifying. There’s also an advanced section for students who want to challenge themselves more. As an incentive, a certificate of achievement with distinction is awarded to these students.

Material

The syllabus and the topics covered in this blog are extremely relevant for any one aspiring to work in the data mining / machine learning field. Having done Andrew Ng’s ML course, this course acts a perfect supplement and covers a lot of practical aspects of implementing the algorithms when applied to massive data sets. For example, a recent lecture talked about how the BFR algorithm for finding clusters works better than k-means for a very large dataset.

Book

The accompanying MMDS book is just awesome and the lectures build upon the content and examples from it. For someone who finds the book a bit too challenging (probably because your math is a bit rusty) the lectures make the material quite approachable.

Final Exam

This was my first course where there was a final exam and in my opinion it made the experience more rewarding. Two exams of 3 hours and 2 hours did take a toll but revising the content at the end helped build a mental model of the concepts and grasp the big picture better, all of which at the end of the day made the learning experience more rewarding and fruitful.

And a few things which could’ve been better -

Theoretical

The course is primarily theoretical in both its presentation and exercises. This is not to say that algorithms are presented without examples, but that the examples (and the quizzes even more so) are trivial and do not do a great job in illustrating the issues with implementing or applying various algorithms in real-life datasets.

Programming Assignments

In sharp contrast to Andrew Ng’s course, there are no compulsory programming assignments. The exercises are all quizzes which check how well you have understood the concepts. There is just one programming assignment which is also optional.

Conclusion

Overall, I’m really glad I did this course. The professors emphasize citing industry examples wherever necessary (the PageRank algorithm and accompanying Google’s implementation was covered for 3 lectures), which is a welcome change from other CS courses. Along with the book, I believe the course is a wonderful primer to the field of Data Mining.

PRAKHAR SRIVASTAV

Mining of Massive DataSets - A Review