Mining of Massive Datasets
by Anand Rajaraman and Jeffrey David Ullman
Publication Date: December 30, 2011 | ISBN-10: 1107015359 | ISBN-13: 978-1107015357
Data-Intensive Text Processing with MapReduce
by Jimmy Lin and Chris Dyer
Morgan & Claypool Publishers, 2010.
Hadoop Real World Solutions Cookbook
by Jonathan R. Owens, Brian Femiano, and Jon Lentz
Publication Date: February 7, 2013 | ISBN-10: 1849519129 | ISBN-13: 978-1849519120

Supported by AWS in Education Grant award and Hortonworks University


The field of computer science is experiencing a transition from computation-intensive to data-intensive problems, wherein data is produced in massive amounts by large sensor networks, new data acquisition techniques, simulations, and social networks. Efficiently extracting, interpreting, and learning from very large datasets requires a new generation of scalable algorithms as well as new data management technologies.

In this course we explore key data analysis and management techniques, which applied to massive datasets are the cornerstone that enables real-time decision making in distributed environments, business intelligence in the Web, and scientific discovery at large scale. In particular, we examine the map-reduce parallel computing paradigm and associated technologies such as distributed file systems, no-sql databases, and stream computing engines. Additionally we review machine learning methods that make possible the efficient analysis of large volumes of data in near real time.

This course is highly interactive and based on the problem-based learning philosophy; students are expected to make use of said technologies to design highly scalable systems that can process and analyze Big Data for a variety of scientific, social, and environmental challenges.

Core topics:

The course is divided into three main core topics:

  • Introduction to the Big Data problem. Current challenges, trends, and applications
  • Algorithms for Big Data analysis. Mining and learning algorithms that have been developed specifically to deal with large datasets
  • Technologies for Big Data management. Big Data technology and tools, special consideration made to the Map-Reduce paradigm and the Hadoop ecosystem.

Course objectives:

At the end of this course, the student will become familiar with the fundamental concepts of Big Data management and analytics; will become competent in recognizing challenges faced by applications dealing with very large volumes of data as well as in proposing scalable solutions for them; and will be able to understand how Big Data impacts business intelligence, scientific discovery, and our day-to-day life.


Projects are the most important learning tool of this class. There will be a series of small projects (challenges) assigned by the instructor during the semester, and one final project defined by the student.


We will have multiple labs during the semester. These labs are based on the Hortonworks material and we will use their virtual machine for most of them. Labs are due exactly one week after they are assigned.

Class project

The final project is entirely to the discretion of the student (upon instructor approval). Students are free to explore a problem of their interest and propose their own solution. The project has the following deliverables:

  • Proposal. Maximum 2 pages of project proposal, why the problem is important, what has been done so far in the field, and what are the expected outcomes
  • Presentations Expect 3 presentations during the semester, each one will detail different aspects of your project and preliminary results are expected.
  • Poster and report. Maximum 10 page report highlighting consisting on the traditional sections of introduction, motivation, method, results, and conclusion

During the course we will hold bi-weekly brainstorming sessions to discuss and strengthen every proposed project.

Projects will be done in teams of 3 grad students or 4 students if they include at least 1 undergraduate student.

Daily assignments and quizzes:

The student can expect to have simple exercises and quizzes every meeting. Some of these daily assignments will be done in groups specified by the instructor and they will account for the participation grade of the course. Make up assignments will be allowed only if the instructor or TA were informed of a documented absence before the quiz took place.


Class attendance:

Attendance to class is expected and note taking encouraged. Important information (about assignments, projects, policies) may be communicated only in the lectures. We may also cover additional material (not available in the notes) during the lecture. If you miss a lecture, you should find what material was covered and if any announcement was made.


Exams are this course's formal evaluation tool. In the exams students will be tested with respect to the learning goals of this course. Exams will comprise a mix of practical exercises and concepts. There will be only one midterm exam at around 3/4 of the semester. The exam is open notes but only handwritten notes are allowed.


Participation is the barometer of the class. Based on it I can determine if the pace of the course is too fast or too slow, it helps me to spot pitfalls and misconceptions, and it helps you to reinforce the material you learned.

Participation accounts for 15% of your final grade and won't be given for granted. You are required to participate either in class or electronically (through Piazza).


In order to facilitate interaction between students and to promote a broader participation, I created a Piazza group. This is a discussion forum for the class and members are expected to conduct themselves with respect by posting comments and replies only in the context of the course. Use the Piazza group to ask general questions about the homework, exams, and lectures. You can also paste small snippets of code to clarify an idea. Students are encouraged to answer each others questions. Recall that your thoughtful participation in this forum accounts through your final grade.


I value student's opinions regarding the course and I will take them in consideration to make this course as exciting and engaging as possible. Thus, through the semester I will ask students formal and informal feedback. Formal feedback includes short surveys on my teaching effectiveness, preferred teaching methods, and pace of the class. Informal feedback will be in the form of polls or in-class questions regarding learning preferences. You can also leave anonymous feedback in the form of a note in my departmental mail box, or using this form. Remember that it is in the best interest of the class if you bring up to my attention if something is not working properly (e.g the pace of the class is too slow, the projects are boring, my teaching style is not effective) so that I can make the corrective steps.


  • Participation 15 pts
  • Labs 25 pts
  • Project presentations 30 pts
  • Poster 5 pts
  • Final report 10 pts
  • Exam 15 pts

Grades will be based on your earned points, following this grade scale. You need to get the specified number of points or more to obtain the grade from the same column. Scores will be rounded to the closest integer value.

A	A-	B+	B	B-	C+	C	C-	D+	D	D-	F
95	90	87	83	80	77	73	70	67	63	60	<60
  • Incomplete can be assigned only for a documented medical reason


In accordance with University Policy 2310 and the Americans with Disabilities Act (ADA), academic accommodations may be made for any student who notifies the instructor of the need for an accommodation. If you have a disability, either permanent or temporary, contact Accessibility Resource Center at 277-3506 for additional information.


Introduction & Overview of available infrastructure (CARC Galles, Amazon EC2)

Big Data applications MMD Ch. 1

The MapReduce paradigm & Hadoop and HDFS overview MMD Ch. 2

Social media analysis

Challenge 1

Page rank

Apache Spark

Recommender systems MMD Ch. 9

Mahout, clustering, and classification MMD Ch. 12

Searching, indexing, and their implications to memory management HRW Ch 1

Hadoop ecosystem and EC2 practice & Performance considerations and best practices HRW Ch. 3

Challenge 2

Finding similar items MMD Ch. 3

Minhashing & Locality Sensitive Hashing (LSH) MMD Ch 6

Frequent itemsets & Mining data streams MMD Ch. 4 & 6



Challenge 3

BigTable, Hive and Pig HRW Ch. 4

Architecting for the cloud HRW Ch 2 & 9

Database evolution & NoSQL databases and MongoDB HRW Ch. 2

Visualization as a complementary data analysis tool HRW Ch 1

Other approaches for performance improvement: MPI

Other approaches for performance improvement: CUDA

(TBD) Poster presentations

The schedule is under construction!