- Incorrect
baseline: Comparing a method with a crippled baseline to make ones method
look better. Example: Showing one’s periodicity detection model is better
than auto-regressive model, which is not designed to detect seasonality or periodicity.
- Incorrect
assumptions:
The problem is right; while the formulation is wrong because of
incorrect assumptions. Example: Using mortality of patients as
surrogate for severity of diseases. This is a convenient but
unrealistic assumption with tons of hidden attributes.
- Unnecessary
problem: The problem does not exist in any application domain. Example:
Speeding up a quadratic algorithm while the data from the applications is so small to require that speedup.
- Incorrect validation: Validation using low quality labeling. In the name of user study, authors generate labeled data themselves and thus, bias the results. A user study with three subjects is naturally done by the authors or their friends.
- Absence
of significance: Absence of significance in the experimental
comparison of methods is not desirable. For example, if we use RMSE for
to compare two methods we must show how significant is the reduction in
error by the better method. A 4-watts reduction of root-meand-squared-error in a mega-watt
power-plant is not worth re-engineering an algorithm. Note that RMSE has a unit.
- Discovering
known knowledge using known knowledge: Using biased features that have
hidden pathways to the predictions. Example: Using doctor’s notes to
predict the disease of a patient. If a test patient goes to a doctor
for a note, he does not need a prediction for disease. He can just ask the
doctor.
- Incorrect
use of Synthetic data: Using synthetic data without an explanation of
how the data was generated to challenge the method. For example, if we
generate a sine wave for periodicity detection and a classic
random walk for autoregressive model, the synthetic data form trivial
scenarios. Note that, a sine wave has exactly one period and a
random walk has exactly one coefficient to estimate inthe AR model, which is 1.
- Complexity of methods: Complexity is relative to the reader. However, authors must make every effort to present their methods simply. It is a crime to present a method in complex ways while there exists a simpler way. Mathematical notations are useful to convey complex ideas, however, it is an art to find simpler description instead of a laundry-list of notations or maze-like plate-representations in papers.
- A machine trained with another machine: Papers often use automatically calculated or derived scores such as reputation, helpfulness, trustworthiness, etc. as labels to train algorithms and show performances. This is equivalent to training a classifier to behave like another classifier. It is easier to make a copy of the original classifier where the trivial non-scientific challenge is to find money to obtain it.
- Mismatch between motivation and optimization: Often motivational example of a paper does not match with the optimization done in the paper. Example: Imagine an optimization function that reduces the average shortest path distance the most by adding a set of edges in the graph. This optimization function does not motivate a reduction in number of stops in air-routing (nodes are airports and edges are flights), if the formulation does not consider passenger load at each edge.
- Implementation bias: Performance improvements of important algorithms are often results of implementation bias such as comparing a c++ implementation with a matlab implementation. Empirical evaluation of performances should always be in the same platform and between the BEST implementations of the competing methods under the BEST compilers and so on. Clearly, I prefer avoiding such problems than solving them appropriately.
- Averaging across datasets: Often algoirhtms are tested on multiple datasets and authors report an average error across datasets without the varaince or individual numbers. A dataset from a real-domain comes with a large set of dependencies on various domain specific parameters such as operating conditions, time of the year, etc. Averaging error/accuracy over many datasets and showing that the average is better than some other methods, is an incorrect way of claiming general superiority.