Robots, Art, Offshore Finance, and Life 

fast2value

Robots, Art, Offshore Finance, Life - Machine Learning Basic Paradigm

MACHINE LEARNING BASIC PARADIGM

​BASIC PARADIGM

Observe a set of examples "Training Data"
Infer something about the process that generated the data
Use that inference to make predictions about previously unseen data "Test Data"

VARIATIONS ON THE PARADIGM

Supervised Learning:

Given a set of label pairs, find a rule that predicts the label associated with a previously unseen input.

Unsupervised Learning:

Given a set of feature vectors (without labels) group them into “Natural Clusters” or create labels for groups.


CLUSTERING EXAMPLES INTO GROUPS

To decide on the similarity of an example, with the goal of separating it into distinct, natural groups.  Similarity is a "Distance Measure".

  • We know that there are “k” different groups in the training data, but don’t know the labels.
  • Pick “k” samples as examples
  • Cluster remaining samples by minimising distance between samples in the same cluster "Objective Function".  Put the sample in the group with the closest exemplar
  • Find median example in each cluster as the new exemplar
  • Repeat until no change


MACHINE LEARNING METHODS

Learn models based on unlabelled data by clustering training data into groups of nearby points

  • Resulting clusters can assign labels to new data


Lear models that separate labelled groups of similar data from other groups.

  • May not be possible to perfectly separate groups without “Over Fitting”
  • But can make decisions with respect to tracking off “False Positives” versus “False Negatives
  • Resulting classifiers can assign labels to new data.


FEATURES REPRESENTATION

Features never full describe the situation.  “all models are wrong, but some are useful”

Feature engineering, Consider, as a simple example, if you are recruiting a project manager and want to use ML to determine suitability of prospective candidates. The approach is to represent examples by feature vectors that will facilitate generalisation.

  • use 100 examples from past of relevant features (on-time delivery; on-budget delivery; achieved earned value management goals) to predict which features can be used to identify the best candidates.
  • Some additional features may be useful indicators (numeracy, domain experience, use of methodologies)
  • Other features may cause over-fit of data (eye colour, star sign etc).


FALSE POSITIVES / NEGATIVES

The goal is to determine a good model that has 100% true positives and 100% true negatives and an understandable, acceptable level of false positives and false negatives.

  • False negative: When a data point is classified as a negative example (say class A) but it is actually a positive example (belongs to class B).
  • False positive: When a data point is classified as a positive example (say class B) but it is actually a negative example (belongs to class A).
  • True negative: When a data point is classified as a negative example (say class A) and it is actually a negative example (belongs to class A).
  • True positive: When a data point is classified as a positive example (say class B) and it is actually a positive example (belongs to class B).​