This is a supporting page to the paper:

ClearView: Data Cleaning for Online Review Mining, published in the ASONAM 2016 proceedings.
In this paper, we describe three types of noisy and abnormal reviews, discuss methods to detect and filter them, and, finally, show the effectiveness of our cleaning process by improving the overall distributional characteristics of review datasets.

by:

Amanda J. Minnich, Noor Abu-El-Rub, Maya Gokhale, Ronald Minnich, and Abdullah Mueen

Overview and motivation

How can we automatically clean and curate online reviews to better mine them for knowledge discovery? Typical online reviews are full of noise and abnormalities, leading to a poor customer experience and hindering knowledge discovery from such reviews. Abnormalities include non-standard characters, unstructured punctuation, different/multiple languages, and misspelled words. Worse still, people will leave "junk" text, which are either completely nonsensical, spam, or fraudulent. In this paper, we describe three types of noisy and abnormal reviews, discuss methods to detect and filter them, and, finally, show the effectiveness of our cleaning process by improving the overall distributional characteristics of review datasets.

PDF:

Pdf version of the paper is available here.

Slides:

The slides from the talk at ASONAM 2016 are available here. Feel free to use them for a reading group, but please email me at aminnich AT cs DOT unm DOT edu to let me know you are doing so.

Data:

Two datasets were used for the experiments described in this section. The first dataset consists of reviews from TripAdvisor.com. We collected all the reviews and associated information for almost all of the US hotels on this site. For the second dataset, we collected reviews and their associated information from nearly all of the apps in the Google Play Store. We have annotated 10,000 randomly sampled reviews from each datset with three sentiment scores given by Amazon Mechanical Turkers. This dataset is password protected; please email aminnich AT cs DOT unm DOT edu to request access.

Code:

The code is available on my Github.

Figures:

(top-row) Unintelligible reviews and a review in Russian. (bottom-row) Repeated text, positive and negative twisters, and non-unicode text.

The distribution of semantic scores.

Iterative training results for TripAdvisor.

Left: Distribution of the number of words in Google Play reviews before and after filtering.
Middle: Distribution of the number of characters in TripAdvisor titles before and after filtering.
Right: Rating distributions of Google Play reviews before and after filtering.