1

Machine Learning for Design

Lecture 5 - Part a

Training and Evaluation

2

Previously on ML4D

3

CRISP-DM Methodology

4

Model Development Lifecycle

5
6

Dataset Splitting

7

Split your data

  • Training set
    • train
  • Validation set
    • fine-tune
  • Test set
    • evaluate
8

Avoid leakages

  • Data items
    • in the validation or evaluation sets
  • Features
    • highly correlated to prediction
    • not present in the production environment
9

Cross-validation

  • Cycle training and validation data several times
    • Useful when dataset is small
  • Split the data into nn portions
    • Train the model nn times using n1n-1 portions for training
    • Average results
10

Evaluation

11

How to Evaluate?

  • Metric
    • How to measure errors?
    • Both training and testing
  • Training
    • How to “help” the ML model to perform well?
  • Validation
    • How to pick the best ML model?
  • Evaluation
    • How to “help” the ML model to generalize?
12

Let errors guide you

  • Errors are almost inevitable!
  • How to measure errors?
  • Select an evaluation procedure (a “metric”)
13

Model Training Process

14

Errors

  • These are the most common questions:
    • How is the prediction wrong?
    • How often is the prediction wrong?
    • What is the cost of wrong predictions?
    • How does the cost vary by the wrong prediction type?
    • How can costs be minimised?
15

Regression

16

Mean absolute error

MAE=1NJ=1NpjvjMAE= \frac{1}{N}\sum^N_{J=1}|p_j - v_j|

Average of the difference between the expected value (vjv_j) and the predicted value (pjp_j)

17

Mean square error

MSE=12NJ=1N(pjvj)2MSE= \frac{1}{2N}\sum^N_{J=1}(p_j - v_j)^2

Average of the square of the difference between the training value (vjv_j) and the expected value (pjp_j)

Square is easier to use during the training process (derivative)

More significant errors are more pronounced

18

Classification

19

Confusion Matrix

Describes the complete performance of the model

20

Type I and Type II errors

21

Accuracy

TP+TNTP+TN+FP+FN\frac{TP+TN}{TP+TN+FP+FN}

The percentage of times that a model is correct

The model with the highest accuracy is not necessarily the best

Some errors (e.g., False Negative) can be more costly than others

22

Errors are not equal

23
  • Detecting the “Alexa” command?
  • Pregnancy detection
    • Cost of “false negatives”?
    • Cost of “false positives”?
  • Covid testing
    • Cost of “false negatives”?
    • Cost of “false positives”?
  • Law enforcement?
24

Balanced Accuracy

TPTP+FN+TNFP+TN2\frac{\frac{TP}{TP+FN}+\frac{TN}{FP+TN}}{2}

Average of single class performance

Good to use when the distribution of data items in classes is imbalanced

25

Balanced Accuracy Weighted

TP(TP+FN)w+TN(FP+TN)(1w)2\frac{\frac{TP}{(TP+FN)*w}+\frac{TN}{(FP+TN)*(1-w)}}{2}

Weighted average of single-class performance

Weight depends on the popularity of a class.

26

Precision

TPTP+FP\frac{TP}{TP + FP}

Among the examples we classified as positive, how many did we correctly classify?

Recall

TPTP+FN\frac{TP}{TP + FN}

Among the positive examples, how many did we correctly classify?

27

F1F_1 -Score

F1=211P+1RF_1 = 2 * \frac{1}{\frac{1}{P}+\frac{1}{R}}

The harmonic mean between precision and recall

What is the implicit assumption about the costs of errors?

28

Sensitivity (true positive rate)

TPFN+TP\frac{TP}{FN + TP}

Identification of the positively labeled data items

Same as recall

29

Specificity (false positive rate)

TNFP+TN\frac{TN}{FP + TN}

Identification of the negatively labeled data items

Not the same as precision

30
31

Medical Test Model

  • Recall and sensitivity
    • How many were correctly diagnosed as sick among the sick people (positives)?
  • Precision
    • Among the people diagnosed as sick, how many were sick?
  • Specificity
    • Among the healthy people (negatives), how many were correctly diagnosed as healthy? \u2028
32

Spam Detection Model

  • Recall and sensitivity
    • How many were correctly deleted among the spam emails (positives)?
  • Precision
    • Among the deleted emails, how many were spam?
  • Specificity
    • Among the good emails (negatives), how many were correctly sent to the inbox?\u2028
33

Search Engine

  • Constraint: high precision
    • False positives are tolerable but should be minimised
  • Among the available models, pick one with a higher recall
    • Metric: Recall at Precision = xx%
34

Metrics are also designed in a multi-stakeholder context

  • One team builds the mode
    • Data scientists / ML engineers
  • Many teams will make use of it
    • e.g., product team, management team
35

Machine Learning for Design

Lecture 5 - Part a

Training and Evaluation

36

Credits

Grokking Machine Learning. Luis G. Serrano. Manning, 2021

[CIS 419/519 Applied Machine Learning]. Eric Eaton, Dinesh Jayaraman.

Deep Learning Patterns and Practices - Andrew Ferlitsch, Maanning, 2021

Machine Learning Design Patterns - Lakshmanan, Robinson, Munn, 2020