Trilce Estrada

CS 467/567 Introduction to Big Data

Course Information

Class time: T/R 11:00 - 12:15

Classroom: ME 300

Prerequisites: Fluent in at least Python, Java, or Scala

Preferred: Background in Machine Learning, Data Mining, or Statistics

Piazza link: piazza.com/unm/fall2019/cs467567

Instructor: Trilce Estrada

Office: FEC 2390

Office hours: Tuesday 1:00 to 3:00

Appointments: https://calendly.com/trilce-estrada

Teaching Assistant: TBD

Office: TBD

Office hours: TBD

Course Description

The field of computer science is experiencing a transition from computation-intensive to data-intensive problems, wherein data is produced in massive amounts by large sensor networks, new data acquisition techniques, simulations, and social networks. Efficiently extracting, interpreting, and learning from very large datasets requires a new generation of scalable algorithms as well as new data management technologies.

In this course we explore key data analysis and management techniques, which applied to massive datasets are the cornerstone that enables real-time decision making in distributed environments, business intelligence in the Web, and scientific discovery at large scale. In particular, we examine the map-reduce parallel computing paradigm and associated technologies such as distributed file systems, no-sql databases, and stream computing engines. Additionally we review machine learning methods that make possible the efficient analysis of large volumes of data in near real time.

This course is highly interactive and based on the problem-based learning philosophy; students are expected to make use of said technologies to design highly scalable systems that can process and analyze Big Data for a variety of scientific, social, and environmental challenges.

Core Topics

The course is divided into three main core topics: (1) Introduction to the Big Data problem. Current challenges, trends, and applications. (2) Algorithms for Big Data analysis. Mining and learning algorithms that have been developed specifically to deal with large datasets.(3) Technologies for Big Data management. Big Data technology and tools, special consideration made to the Map-Reduce paradigm and the Hadoop ecosystem.

Course Objectives

At the end of this course, the student will become familiar with the fundamental concepts of Big Data management and analytics; will become competent in recognizing challenges faced by applications dealing with very large volumes of data as well as in proposing scalable solutions for them; and will be able to understand how Big Data impacts business intelligence, scientific discovery, and our day-to-day life.

Text Books

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman

Spark: The Definitive Guide - Big Data Processing Made Simple by Bill Chambers and Matei Zaharia

Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

Coursework

Participation

Participation is the barometer of the class. Based on it I can determine if the pace of the course is too fast or too slow, it helps me to spot pitfalls and misconceptions, and it helps you to reinforce the material you learned.

The student can expect to have simple exercises frequently. Some of these daily assignments will be done in groups specified by the instructor and they will account for the participation grade of the course. Make up assignments will be allowed only if the instructor or TA were informed of a documented absence before the quiz took place.

Participation accounts for 15% of your final grade and won't be given for granted. You are required to participate either in class or electronically (through Piazza).

Homework

There will be a series of coding homework during the semester. For every homework students will turn in a two-page report and well documented code through UNM Learn only, no emailed assignments will be graded and no late assignments will be accepted.

Exam

Exams are this course's formal evaluation tool. In the exams students will be tested with respect to the learning goals of this course. Exams will comprise a mix of practical exercises and concepts. There will be only one midterm exam at around 3/4 of the semester. The exam is open notes but only handwritten notes are allowed.

Undergraduate students will be graded on 80% of the exam, graduate studens will be graded on 100% of the exam

Project

The final project is entirely to the discretion of the student (upon instructor approval). Students are free to explore a problem of their interest and propose their own solution. The project has the following deliverables:

Proposal. Maximum 2 pages of project proposal, why the problem is important, what has been done so far in the field, and what are the expected outcomes.

Presentations. Expect 2 presentations during the semester, each one will detail different aspects of your project and preliminary results are expected.

Poster and report. Maximum 10 page report highlighting consisting on the traditional sections of introduction, motivation, method, results, and conclusion

During the course we will hold bi-weekly brainstorming sessions to discuss and strengthen every proposed project.

Projects will be done in teams of 3 grad students or 4 students if they include at least 1 undergraduate student.

Grading

Grades will be based on your earned points, following this grade scale. You need to get the specified number of points or more to obtain the grade from the same column. Scores will be rounded to the closest integer value.

Incomplete can be assigned only for a documented medical reason. Change of grade to CR/NC after the semester deadline will be granted ONLY under special, documented extenuating circumstances.

A (95), A- (90), B+ (87), B (83), B- (80) C+ (77), C (73), C- (70), D+ (67), D (63), D- (60), F (le 60)

Participation15%

Homework15%

Exam30%

Project30%

Final presentation10%

Policies

Academic Honesty

Unless otherwise specified, you must write/code your own homework assignments. You cannot use the web to find answers to any assignment. If you do not have time to complete an assignment, it is better to submit your partial solutions than to get answers from someone else. Cheating students will be prosecuted according to University guidelines. Students should get acquainted with their rights and responsibilities as explained in the Student Code of Conduct

http://dos.unm.edu/student-conduct/academic-integrityhonesty.html

Any and all acts of plagiarism will result in an immediate dismissal from the course and an official report to the dean of students.

Instances of plagiarism include, but are not limited to: downloading code and snippets from the Internet without explicit permission from the instructor and/or without proper acknowledgment, citation, or license use; using code from a classmate or any other past or present student; quoting text directly or slightly paraphrasing from a source without proper reference; any other act of copying material and trying to make it look like it is yours.

Note that dismissal from the class means that the student will be dropped with an F from the course.

The best way of avoiding plagiarism is to start your assignments early. Whenever you feel like you cannot keep up with the course material, your instructor is happy to find a way to help you. Make an appointment or come to office hours, but DO NOT plagiarize; it is not worth it!.

Attendance

Attendance to class is expected (read mandatory) and note taking encouraged. Important information (about exams, assignments, projects, policies) may be communicated only during lecture time. We may also cover additional material (not available in the book or in slides) during the lecture.

If you miss a lecture, you should find what material was covered and if any announcement was made. If you have unexcused absences, this may result in participation points being deducted. Excused absences include sickness, attending conferences, job interviews, and similar. Even if your absence is excused, it is your responsibility to find out what material you missed. The professor is happy to answer specific questions regarding the lecture, but cannot go through all of the missed material on a one-to-one basis.

Excused absences have to be notified to the TA and instructor (through a piazza private post) at least 24hrs in advance, sickness has to be justified with a doctor's note

Communication

In order to facilitate interaction between students and to promote a broader participation, I created a Piazza group. Use the Piazza public group to ask general questions about homework, exams, projects, and lectures. You can also paste small snippets of code to clarify an idea. Students are encouraged to answer each others questions. Recall that your thoughtful participation in this forum accounts through your final grade. Use Piazza private posts to ask for excused absences and other personal matters. Always cc the class TA in those cases. Piazza is a discussion forum for the class and members are expected to conduct themselves with respect by posting comments and replies only in the context of the course.

Feedback

I value student's opinions regarding the course and I will take them into consideration to make this course as exciting and engaging as possible. Thus, through the semester I will ask students formal and informal feedback. Formal feedback includes short surveys on my teaching effectiveness, preferred teaching methods, and the pace of the class. Informal feedback will be in the form of polls or in-class questions regarding learning preferences. You can also leave anonymous feedback in the form of a note in my departmental mailbox, under my office door, or using this form. Remember that it is in the best interest of the class if you bring up to my attention if something is not working properly (e.g the pace of the class is too slow, the projects are boring, my teaching style is not effective) so that I can make the corrective steps.

ADA

In accordance with University Policy 2310 and the Americans with Disabilities Act (ADA), academic accommodations may be made for any student who notifies the instructor of the need for an accommodation. If you have a disability, either permanent or temporary, contact Accessibility Resource Center at 277-3506 for additional information.

Schedule

Topic	Subtopics	Readings
Mining Big Data and Applications	The evolution of Big Data Technologies contributing to its rise Statistical Limits on Data Mining Applications of Big Data and its future Things useful to know: recap	MMD:CH1
Systems foundations of Big Data	Distributed systems Distibuted file systems	Tanembaum:CH1,CH10
MapReduce and the New Software Stack	Map-Reduce Algorithms and complexity Extensions to MR	MMD:CH2
Introduction to Apache Spark	Apache Spark's philosophy Running Spark Spark architecture Language's API Spark sessions, dataframes, transformations, and actions HW: Simple spark example	SDG:CH1, SDG:CH2
Spark's Toolset and How Spark runs on a CLuster	Datasets: type-safe structured APIs Structured streaming Machine Learning and advanced analytics Lower-Level APIs and Spark’s ecosystem and packages The life cycle of a Spark application Execution details	SDG:CH3, SDG:CH15
Mining Data Streams	Advantages and challenges of stream processing Stream processing design points The stream data model Sampling data and filtering streams Estimating moments	MMD:CH4, SDG:CH16
Mining Data Streams Practice	Architecture and abstraction Streaming transformations Output operations and input sources Event-time and stateful processing HW: spark streaming (in ch 22)	LSPK:CH10, SDG:CH22
Clustering	Clustering techniques Clustering in non-Euclidean spaces Clustering for Streams and Parallelism HW: clustering with Spark	MMD:CH7
Finding similar items	Applications of Set Similarity Shingling of Documents Similarity-Preserving Summaries of Sets Locality-Sensitive Hashing for Documents Distance Measures Applications of Locality-Sensitive Hashing Methods for High Degrees of Similarity	MMD:CH3
Link Analysis	Page Rank Efficient computation of Page Rank Topic sensitive Page Rank Links and authorities HW: Page Rank ch4 in LSPK	MMD:CH5, LSPK:CH4
Mining Social-Network Graphs	Clustering of Social-Network Graphs Direct Discovery of Communities and Simrank Partitioning of Graphs Counting Triangles Neighborhood Properties of Graph HW: GraphX	MMD:CH10
Recomender Systems	A Model for Recommendation Systems Content-Based Recommendations Collaborative Filtering Spark practice: recommendation systems with ALS HW: Recommender system with MovieLens	MMD:CH9, SDG:CH28
Large-Scale Machine Learning	The Machine Learning model Learning from Nearest Neighbors Linear regression Support-Vector Machines Decision trees Perceptrons Comparison of Learning Methods HW: Page Rank ch4 in LSPK	MMD:CH12
Machine Learning with MLib	Data types Working with vectors Feature extraction Classification and regression Model evaluation HW: Spam classifier	LSPK:CH11, SDG:CH26
Neural Networks	Introduction to Neural Nets Dense Feedforward Networks Backpropagation and Gradient Descent Regularization HW: Intro to	MMD:CH13
More Neural Networks	Recurrent Neural Networks Long Short-Term Memory (LSTM) HW: RNN	MMD:CH13

MMD: Mining of Massive Datasets

SDG: Spark: The Definitive Guide - Big Data Processing Made Simple

LSPK: Learning Spark: Lightning-Fast Big Data Analysis