Instructor: Abdullah Mueen

Time: 12:30 pm - 1:45 pm

Room: Centennial Engineering Center B146A

Office Hours: Tuesday and Thursday, 10:00AM-12:00PM

Office: FEC 340 (Knock if the door is closed)

TA: Nikan Chavoshi

Office Hours: Tue-Thu, 3:00PM-4:30PM

Office: FEC 345(B).

- Poster session will be on Monday 8th December, 2014, 12:00PM-2:00PM at the Stamm room 1044 in CEC.
- Assignment Four (Extra Credit worth of 3% of the class) has been posted. Due: Dec 9th by 11:59PM.
- Final Exam will be on Thursday, Dec 4th, 2014 in the class.
- Assignment 3 has been posted, Due Nov 30th, Sunday, by 11:59PM.
- Homework 4 posted. Due Nov 20th, Thursday, in the Class.
- Homework 3 has been posted. Due date is Oct 30th, Thursday, in the class.
- Assignment 2 has been posted. Due date is Oct 19th, Friday by 11:59PM. OCT 24th, Friday by 11;59PM.
- Project proposal details posted. Due: Friday Oct 10th, 2014 by 11:59PM. Sunday Oct 12th, 2014 by 11:59PM.
- Homework 2 posted. Due: Tuesday, 09/30/2014 in the class.
- Assignment 1 posted. Due: Friday, 09/19/2014 by 11:59PM.
- Project Proposals are due on Oct 10th, 2014 by 11:59PM.
- Midterm will be on the Tuesday, Oct 7th, 2014 in the class.
- Homework 1 posted. Due: 9/11/2014, Thursday in the class.
- The email address for the teachers is cs591.teachers@gmail.com
- The google groups for the class is cs491-591-fall2014
- Groups are due by Tuesday 09/02/2014, 12:00PM

Syllabus

Description: This course covers data mining topics from basic to advanced level. Topics include data cleaning, clustering, classification, outlier detection, association-rule discovery, tools and technologies for data mining and algorithms for mining complex data such as graphs, text and sequences. Students will work on a data mining project to gather hands-on experience.

The course learning objectives include

- Learning basic data mining algorithms and their applications
- Learning about the tools and technologies available for analyzing various types of data
- Gaining hands-on experience in cleaning, managing and processing complex data.

Book: Data Mining: Concepts and Techniques, 3rd ed.

Lecture Schedule: Here

Grading: There will be two exams. One midterm on topics from weeks 1-7 and the final on the reminder of the topics. The exams are worth 25% each. Students will pick group-projects and apply mining algorithms. Project is worth 20%. There will be three to five homework, together they are 10% of the class. There will be four assignments worth 5% each. Homework will focus on understanding the algorithms and techniques. The assignments will be on applying different techniques on real-data selected by the instructor.

Academic Integrity:

For the assignments in this class, discussion of concepts with others is encouraged, but

Academic Calendar: For a list of dates to enroll, change, withdraw classes and a list of hoildays go here.

Project: Each group will do one project. A group can have at most two
students. Students in the CS 491 section can have groups of three students. A project consists of two phases with equal weights.

- Data Preprocessing and Cleaning:
Each group will propose a data source or pick a data from a given list. Each
group will propose data mining tasks, a set of algorithms/tools and
success measures. Groups will clean the data for the projects and submit
the written proposals by Oct 12th, 2014.

Details: Here is a proposal from last year that was well strucutred. I need the following sections. Title, Introduction, Data (collection and preprocessing), Hypothesis, Proposed Method, Validation and Conclusion. I need clear answers to the following questions;What data you will be using? How is formatted? What is the size of the data? How you will clean the data? How will you process the data?

What hypothesis/hope do you have? How would you prove or disprove your hypothesis?

What methods will you use? What software tools will you use? How much programming does it need?

How do you validate your method is working? How does that relate to proving your hypothesis? - Implementation and Presentation:
Each group will implement the project and write up the methods and results
in the final project report. The groups will present and demonstrate
their projects in the class or in a poster session. A poster template is here.

Details: Poster session will be on Monday, 8th December, 2014, 12:00PM-2:00PM. Students are advised to print their posters well ahead to avoid forming long queue in the printer. Poster session will be in the Centennial Engineering Center’s Stamm Room 1044. We will provide velcro stickers for hanging. I will be visiting your posters and grade them. Do NOT leave the room until I see your poster. If you have questions, email me.

Homework:

Homework 1: Here Due: Thursday, 09/11/2014, beginning of the lecture. No electronic submission. Only paper-based submission. You have to show steps clearly to convince us that you did it yourself. Solution

Homework 2: Here Due: Tuesday, 09/30/2014, in the class. No electronic submission. Only paper-based submission. You have to show steps clearly to convince us that you did it yourself.

Homework 3: Exercises 10.2, 10.7 and 10.8. Due Oct 30th, Thursday, in the class. Only paper-based submission.

Homework 4: Here Due: November 20th, Thursday in the class. Only paper submission.

Assignments:

Assignment 1: Here. Due: Friday, 09/19/2014 by 11:59PM. Only electronic submissions to the teachers email address. We will not open submissions in our personal inbox.

Assignment 2: Due: Friday Oct 19, 2014 by 11:59PM. Only Electronic Submissions to the teachers email address. Use the dataset from the previous assignment. Submit your code so I can reproduce the reported numbers for the classifiers

a) Label the first 5000 rows as class 1 and the remaining rows as class 2. Use SVM and Neural Network to classify the data and report 10-fold cross-validated accuracy. Describe the parameters of your classifiers.

b) Label the rows [1:500,1001:1500,2001:2500,3001:3500,4001:4500,5001:5500,6001:6500,7001:7500,8001:8500,9001:9500]

as class 1 and the remaining rows as class 2. Use SVM and Neural Network to classify the data and report 10-fold cross-validated accuracy. Describe the parameters of your classifiers.

SVM code snippet from the class.

Assignment 3: Due November 30th by 11:59PM. Only Electronic Submissions to the teachers email address. Submit your code and plot. Describe any assumption that you required to make.

a) Implement the Local Outlier Factor algorithm to find the LOFs of all the points in the dataset from Assignment 1.

b) Produce a plot for different values of k (i.e. 1 to 100) that shows the number of outliers. Use a threshold of 2 for deciding if a point is an outlier.

Assignment 4: Due Dec 9th by 11:59PM. Online submissions only. For the given dataset, use a locality sensitive hashing scheme to search for approximate nearest neghbors. Use the following queryset. You can use any parameter choices to obtain the nearest neighbors.

Deliverables: 1. The approximate nearest neighbors of the queries.

2. Describe all the parameters and the reason for choosing them.

3. The code for building the hash table and searching the tables.

Data: Links to some data sources (in no order) you can use for the course projects. You are welcome to suggest any dataset of your choice preferably large, noisy and (semi/un)structured.

- Social Network Graph of Twitter
- GPS Trajectories from Microsoft Research
- Tiny Images Dataset from MIT.
- Remote Sensing Data from NASA. Direct download link for the product MOD09CMG.005.
- 83 million Twits from Twitter
- Daily Currency Conversion Rates between USD and others.
- Daily Values of Stock Tickers
- CMU Motion Capture Database
- MIR FLICKR
- Geo-tagged image data
- Video with GPS ground-truth
- ABQ Data

Tools:

- Weka
- VW
- MATLAB - Student version is available
- Hadoop
- t-SNE
- Microsoft Azure Machine Learning
- Machine Learning Video Library

- Lecture 1: Overview
- Lecture 2: Data Types and Similarities
- Lecture 3: Data Transformation and Reduction
- Lecture 4: Frequent Pattern Mining
- Lecture 5: Basic Classification
- Lecture 6: Classification: ANN and SVM
- Lecture 7: Basic Clustering
- Lecture 8: Advanced Clustering (Fuzzy and Co-Clustering)
- Lecture 9: Outlier Detection
- Lecture 10: Time Series Mining SAX and LB_Keogh
- Lecture 11: Graph Algorithms
- Lecture 12: Locality Sensitive Hashing