↑ Data

One of the most often mentioned issues with personal information mined from an individual’s consumption behaviour for purposes of recommending products and/or services to that individual, is that it can cause pain and embarrassment to individuals, and it can deliver noise.

  • What if you are permanently in a wheelchair and out of curiosity or for buying a family member or friend a gift, checked products online that someone in a wheelchair cannot use, and then get offered similar products on your screen for days?

  • What if you are a teenage girl visiting a website that sells baby products, and the application of the company sends promotional baby products to your home address?

In short, companies seem to think that if more information is disclosed and used, sales of products will automatically increase. Data gathered in large pools of information, is increasingly a focus of technology competition.

Instead, better approaches on how to interpret data, models, and understanding the limitations of both in order to produce better output are required. Data without sound approaches becomes noise. Better Data != ↑ Data

  • Better data does not mean more data, sometimes it means less (data cleansing, outlier removal, stratified sampling).

  • Content-based features (or different features in general) might be able to improve accuracy in many cases, but not always.

  • High bias situations (a model that is too simple to explain the data) will not benefit from more training examples, but might indeed benefit from more features.

  • High variance situations (a model that is too complicated for the amount of data, the training error is much lower than the test error) leads to model over-fitting, and can be addressed by reducing the number of features, and by increasing the number of data points.

  • Complex algorithms can limit the ability of scaling up to larger number of features.