Machine Learning models

This introductive part is about models(!)

Linear regression

Learning objectives:

  • Explain a loss function and how it works.
  • Define and describe how gradient descent finds the optimal model parameters.
  • Describe how to tune hyperparameters to efficiently train a linear model.

Linear regression is a statistical technique used to find the relationship between variables. In an ML context, linear regression finds the relationship between features and a label.

For example, suppose we want to predict a car's fuel efficiency in miles per gallon based on how heavy the car is, and we have the following dataset:

Pounds in 1000s (feature)Miles per gallon (label)
3.518
3.6915
3.4418
3.4316
4.3415
4.4214
2.3724

If we plotted these points, we'd get the following graph:

IMAGE: Figure 1. Data points showing downward-sloping trend from left to right.

Figure 1. Car heaviness (in pounds) versus miles per gallon rating. As a car gets heavier, its miles per gallon rating generally decreases.

We could create our own model by drawing a best fit line through the points:

IMAGE: Figure 2. Data points with a best fit line drawn through them representing the model.

Figure 2. A best fit line drawn through the data from the previous figure.

Linear regression equation

In algebraic terms, the model would be defined as \( y = mx + b \), where

  • \(y\) is miles per gallon—the value we want to predict.
  • \(m\) is the slope of the line.
  • \( x \) is pounds—our input value.
  • \( b \) is the y-intercept.

In ML, we write the equation for a linear regression model as follows:

\[ y' = b + w_1x_1 \]

where:

  • \( y' \) is the predicted label—the output.
  • \( b \) is the bias of the model. Bias is the same concept as the y-intercept in the algebraic equation for a line. In ML, bias is sometimes referred to as \( w_0 \). Bias is a parameter of the model and is calculated during training.
  • \( w_1 \) is the weight of the feature. Weight is the same concept as the slope \( m \) in the algebraic equation for a line. Weight is a parameter of the model and is calculated during training.
  • \( x_1 \) is a feature — the input.

During training, the model calculates the weight and bias that produce the best model.

IMAGE: Figure 3. The equation y' = b + w1x1, with each component annotated with its purpose.

Figure 3. Mathematical representation of a linear model.

In our example, we'd calculate the weight and bias from the line we drew. The bias is 34 (where the line intersects the y-axis), and the weight is –4.6 (the slope of the line). The model would be defined as \( y' = 34 + (-4.6)(x_1) \), and we could use it to make predictions. For instance, using this model, a 4,000-pound car would have a predicted fuel efficiency of 15.6 miles per gallon.

IMAGE: Figure 4. Same graph as Figure 2, with the point (4, 15.6) highlighted.

Figure 4. Using the model, a 4,000-pound car has a predicted fuel efficiency of 15.6 miles per gallon.

Models with multiple features

Although the example in this section uses only one feature—the heaviness of the car—a more sophisticated model might rely on multiple features, each having a separate weight (\( w_1 \), \( w_2 \), etc.). For example, a model that relies on five features would be written as follows:

\( y' = b + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 \)

For example, a model that predicts gas mileage could additionally use features such as the following:

  • Engine displacement
  • Acceleration
  • Number of cylinders
  • Horsepower

This model would be written as follows:

IMAGE: Figure 5. Linear regression equation with five features.

Figure 5. A model with five features to predict a car's miles per gallon rating.

By graphing a couple of these additional features, we can see that they also have a linear relationship to the label, miles per gallon:

IMAGE: Figure 6. Displacement in cubic centimeters graphed against miles per gallon showing a negative linear relationship.

Figure 6. A car's displacement in cubic centimeters and its miles per gallon rating. As a car's engine gets bigger, its miles per gallon rating generally decreases.

IMAGE: Figure 7. Acceleration from zero to sixty in seconds graphed against miles per gallon showing a positive linear relationship.

Figure 7. A car's acceleration and its miles per gallon rating. As a car's acceleration takes longer, the miles per gallon rating generally increases.

Key terms:

Help Center

Linear regression: Loss

Loss is a numerical metric that describes how wrong a model's predictions are. Loss measures the distance between the model's predictions and the actual labels. The goal of training a model is to minimize the loss, reducing it to its lowest possible value.

In the following image, you can visualize loss as arrows drawn from the data points to the model. The arrows show how far the model's predictions are from the actual values.

IMAGE: Figure 8. Loss lines connect the data points to the model.

Figure 8. Loss is measured from the actual value to the predicted value.

Distance of loss

In statistics and machine learning, loss measures the difference between the predicted and actual values. Loss focuses on the distance between the values, not the direction. For example, if a model predicts 2, but the actual value is 5, we don't care that the loss is negative (\( 2-5=-3 \)). Instead, we care that the distance between the values is \( 3 \). Thus, all methods for calculating loss remove the sign.

The two most common methods to remove the sign are the following:

  • Take the absolute value of the difference between the actual value and the prediction.
  • Square the difference between the actual value and the prediction.

Types of loss

In linear regression, there are five main types of loss, which are outlined in the following table.

Loss typeDefinitionEquation
L1 lossThe sum of the absolute values of the difference between the predicted values and the actual values.\( ∑ | actual\ value - predicted\ value | \)
Mean absolute error (MAE)The average of L1 losses across a set of N examples.\( \frac{1}{N} ∑ | actual\ value - predicted\ value | \)
L2 lossThe sum of the squared difference between the predicted values and the actual values.\( ∑(actual\ value - predicted\ value)^2 \)
Mean squared error (MSE)The average of L2 losses across a set of N examples.\( \frac{1}{N} ∑ (actual\ value - predicted\ value)^2 \)
Root mean squared error (RMSE)The square root of the mean squared error (MSE).\( \sqrt{\frac{1}{N} ∑ (actual\ value - predicted\ value)^2} \)

The functional difference between L1 loss and L2 loss (or between MAE/RMSE and MSE) is squaring. When the difference between the prediction and label is large, squaring makes the loss even larger. When the difference is small (less than 1), squaring makes the loss even smaller.

Loss metrics like MAE and RMSE may be preferable to L2 loss or MSE in some use cases because they tend to be more human-interpretable, as they measure error using the same scale as the model's predicted value.

Note: MAE and RMSE can differ quite widely. MAE represents the average prediction error, whereas RMSE represents the "spread" of the errors, and is more skewed by larger errors.

When processing multiple examples at once, we recommend averaging the losses across all the examples, whether using MAE, MSE, or RMSE.

Calculating loss example

Using the previous best fit line, we'll calculate L2 loss for a single example. From the best fit line, we had the following values for weight and bias:

  • \( \small{Weight: -4.6} \)
  • \( \small{Bias: 34} \)

If the model predicts that a 2,370-pound car gets 23.1 miles per gallon, but it actually gets 26 miles per gallon, we would calculate the L2 loss as follows:

Note: The formula uses 2.37 because the graphs are scaled to 1000s of pounds

ValueEquationResult
Prediction\(\small{bias + (weight * feature\ value)}\) \(\small{34 + (-4.6*2.37)}\)\(\small{23.1}\)
Actual value\( \small{ label } \)\( \small{ 26 } \)
L2 loss\( \small{ (actual\ value - predicted\ value)^2 } \) \(\small{ (26 - 23.1)^2 }\)\(\small{8.41}\)

In this example, the L2 loss for that single data point is 8.41.

Choosing a loss

Deciding whether to use MAE or MSE can depend on the dataset and the way you want to handle certain predictions. Most feature values in a dataset typically fall within a distinct range. For example, cars are normally between 2000 and 5000 pounds and get between 8 to 50 miles per gallon. An 8,000-pound car, or a car that gets 100 miles per gallon, is outside the typical range and would be considered an outlier.

An outlier can also refer to how far off a model's predictions are from the real values. For instance, 3,000 pounds is within the typical car-weight range, and 40 miles per gallon is within the typical fuel-efficiency range. However, a 3,000-pound car that gets 40 miles per gallon would be an outlier in terms of the model's prediction because the model would predict that a 3,000-pound car would get around 20 miles per gallon.

When choosing the best loss function, consider how you want the model to treat outliers. For instance, MSE moves the model more toward the outliers, while MAE doesn't. L2 loss incurs a much higher penalty for an outlier than L1 loss. For example, the following images show a model trained using MAE and a model trained using MSE. The red line represents a fully trained model that will be used to make predictions. The outliers are closer to the model trained with MSE than to the model trained with MAE.

IMAGE: Figure 9. The model is tilted more toward the outliers.

Figure 9. MSE loss moves the model closer to the outliers.

IMAGE: Figure 10. The model is tilted further away from the outliers.

Figure 10. MAE loss keeps the model farther from the outliers.

Note the relationship between the model and the data:

  • MSE. The model is closer to the outliers but further away from most of the other data points.
  • MAE. The model is further away from the outliers but closer to most of the other data points.

Choose MSE:

  • If you want to heavily penalize large errors.
  • If you believe the outliers are important and indicative of true data variance that the model should account for.

Note: The mathematical properties of MSE often make optimization smoother. Root Mean Squared Error (RMSE) is often used to get the error back into the same units as the label.

Choose MAE:

  • If your dataset has significant outliers that you don't want to overly influence the model. MAE is more robust.
  • If you prefer a loss function that is more directly interpretable as the average error magnitude.

In practice, your metric choice can also depend on the specific business problem and what kind of errors are more costly.

Check Your Understanding

Consider the following two plots of a linear model fit to a dataset:

IMAGE: A plot of 10 points. A line runs through 6 of the points. 2 points are 1 unit above the line; 2 other points are 1 unit below the line.IMAGE: A plot of 10 points. A line runs through 8 of the points. 1 point is 2 units above the line; 1 other point is 2 units below the line.

Which of the two linear models shown in the preceding plots has the higher Mean Squared Error (MSE) when evaluated on the plotted data points?

The model on the left.

Click to answer The six examples on the line incur a total loss of 0. The four examples not on the line are not very far off the line, so even squaring their offset still yields a low value: $$MSE = \frac{0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 1^2 + 0^2 + 0^2} {10} = 0.4$$

The model on the right.

Click to answer The eight examples on the line incur a total loss of 0. However, although only two points lay off the line, both of those points are *twice* as far off the line as the outlier points in the left figure. Squared loss amplifies those differences, so an offset of two incurs a loss four times as great as an offset of one: $$MSE = \frac{0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2 + 0^2 + 2^2 + 0^2 + 0^2} {10} = 0.8$$

Key terms:

Linear regression: Gradient descent

Gradient descent is a mathematical technique that iteratively finds the weights and bias that produce the model with the lowest loss. Gradient descent finds the best weight and bias by repeating the following process for a number of user-defined iterations.

The model begins training with randomized weights and biases near zero, and then repeats the following steps:

  1. Calculate the loss with the current weight and bias.
  2. Determine the direction to move the weights and bias that reduce loss.
  3. Move the weight and bias values a small amount in the direction that reduces loss.
  4. Return to step one and repeat the process until the model can't reduce the loss any further.

The diagram below outlines the iterative steps gradient descent performs to find the weights and bias that produce the model with the lowest loss.

IMAGE: Figure 11. Illustration of the gradient descent process.

Figure 11. Gradient descent is an iterative process that finds the weights and bias that produce the model with the lowest loss.

Click the plus icon to learn more about the math behind gradient descent.

At a concrete level, we can walk through the gradient descent steps using a small dataset with seven examples for a car's heaviness in pounds and its miles per gallon rating:

Pounds in 1000s (feature)Miles per gallon (label)
3.518
3.6915
3.4418
3.4316
4.3415
4.4214
2.3724
  1. The model starts training by setting the weight and bias to zero:

\[ \small{Weight:\ 0} \] \[ \small{Bias:\ 0} \] \[ \small{y = 0 + 0(x_1)} \]

  1. Calculate MSE loss with the current model parameters:

\[ \small{Loss = \frac{(18-0)^2 + (15-0)^2 + (18-0)^2 + (16-0)^2 + (15-0)^2 + (14-0)^2 + (24-0)^2}{7}} \] \[ \small{Loss= 303.71} \]

  1. Calculate the slope of the tangent to the loss function at each weight and the bias:

\[ \small{Weight\ slope: -119.7} \] \[ \small{Bias\ slope: -34.3} \]

Click the plus icon to learn about calculating slope.

To get the slope for the lines tangent to the weight and bias, we take the derivative of the loss function with respect to the weight and the bias, and then solve the equations.

We'll write the equation for making a prediction as:
\( f_{w,b}(x) = (w*x)+b \).

We'll write the actual value as: \( y \).

We'll calculate MSE using:
\( \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 \)
where \(i\) represents the \(ith\) training example and \(M\) represents the number of examples.

Weight derivative

The derivative of the loss function with respect to the weight is written as:
\( \frac{\partial }{\partial w} \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 \)

and evaluates to:
\( \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)}) * 2x_{(i)} \)

First we sum each predicted value minus the actual value and then multiply it by two times the feature value. Then we divide the sum by the number of examples. The result is the slope of the line tangent to the value of the weight.

If we solve this equation with a weight and bias equal to zero, we get -119.7 for the line's slope.

Bias derivative

The derivative of the loss function with respect to the bias is written as:
\( \frac{\partial }{\partial b} \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)})^2 \)

and evaluates to:
\( \frac{1}{M} \sum_{i=1}^{M} (f_{w,b}(x_{(i)}) - y_{(i)}) * 2 \)

First we sum each predicted value minus the actual value and then multiply it by two. Then we divide the sum by the number of examples. The result is the slope of the line tangent to the value of the bias.

If we solve this equation with a weight and bias equal to zero, we get -34.3 for the line's slope.

  1. Move a small amount in the direction of the negative slope to get the next weight and bias. For now, we'll arbitrarily define the "small amount" as 0.01:

\[ \small{New\ weight = old\ weight - (small\ amount * weight\ slope)} \] \[ \small{New\ bias = old\ bias - (small\ amount * bias\ slope)} \] \[ \small{New\ weight = 0 - (0.01)*(-119.7)} \] \[ \small{New\ bias = 0 - (0.01)*(-34.3)} \] \[ \small{New\ weight = 1.2} \] \[ \small{New\ bias = 0.34} \]

Use the new weight and bias to calculate the loss and repeat. Completing the process for six iterations, we'd get the following weights, biases, and losses:

IterationWeightBiasLoss (MSE)
100303.71
21.200.34170.84
32.050.59103.17
42.660.7868.70
53.090.9151.13
63.401.0142.17

You can see that the loss gets lower with each updated weight and bias. In this example, we stopped after six iterations. In practice, a model trains until it converges. When a model converges, additional iterations don't reduce loss more because gradient descent has found the weights and bias that nearly minimize the loss.

If the model continues to train past convergence, loss begins to fluctuate in small amounts as the model continually updates the parameters around their lowest values. This can make it hard to verify that the model has actually converged. To confirm the model has converged, you'll want to continue training until the loss has stabilized.

Model convergence and loss curves

When training a model, you'll often look at a loss curve to determine if the model has converged. The loss curve shows how the loss changes as the model trains. The following is what a typical loss curve looks like. Loss is on the y-axis and iterations are on the x-axis:

IMAGE: Figure 12. Graph of loss curve showing a steep decline and then a gentle decline.

Figure 12. Loss curve showing the model converging around the 1,000th-iteration mark.

You can see that loss dramatically decreases during the first few iterations, then gradually decreases before flattening out around the 1,000th-iteration mark. After 1,000 iterations, we can be mostly certain that the model has converged.

In the following figures, we draw the model at three points during the training process: the beginning, the middle, and the end. Visualizing the model's state at snapshots during the training process solidifies the link between updating the weights and bias, reducing loss, and model convergence.

In the figures, we use the derived weights and bias at a particular iteration to represent the model. In the graph with the data points and the model snapshot, blue loss lines from the model to the data points show the amount of loss. The longer the lines, the more loss there is.

In the following figure, we can see that around the second iteration the model would not be good at making predictions because of the high amount of loss.

IMAGE: Figure 13. Loss curve and corresponding graph of the model, which tilts away from the data points.

Figure 13. Loss curve and snapshot of the model at the beginning of the training process.

At around the 400th-iteration, we can see that gradient descent has found the weight and bias that produce a better model.

IMAGE: Figure 14. Loss curve and corresponding graph of the model, which cuts through the data points but not at the optimal angle.

Figure 14. Loss curve and snapshot of model about midway through training.

And at around the 1,000th-iteration, we can see that the model has converged, producing a model with the lowest possible loss.

IMAGE: Figure 15. Loss curve and corresponding graph of the model, which fits the data well.

Figure 15. Loss curve and snapshot of the model near the end of the training process.

Exercise: Check your understanding

What's the role of gradient descent in linear regression?

Gradient descent is an iterative process that finds the best weights and bias that minimize the loss.

Click to answer Correct.

Gradient descent helps to determine what type of loss to use when training a model, for example, L1 or L2.

Click to answer Wrong. Gradient descent is not involved in the selection of a loss function for model training.

Gradient descent removes outliers from the dataset to help the model make better predictions.

Click to answer Wrong. Gradient descent doesn't change the dataset.

Convergence and convex functions

The loss functions for linear models always produce a convex surface. As a result of this property, when a linear regression model converges, we know the model has found the weights and bias that produce the lowest loss.

If we graph the loss surface for a model with one feature, we can see its convex shape. The following is the loss surface for a hypothetical miles per gallon dataset. Weight is on the x-axis, bias is on the y-axis, and loss is on the z-axis:

IMAGE: Figure 16. 3-D graph of loss surface.

Figure 16. Loss surface that shows its convex shape.

In this example, a weight of -5.44 and bias of 35.94 produce the lowest loss at 5.54:

IMAGE: Figure 17. 3-D graph of loss surface, with (-5.44, 35.94, 5.54) at the bottom.

Figure 17. Loss surface showing the weight and bias values that produce the lowest loss.

A linear model converges when it's found the minimum loss. Therefore, additional iterations only cause gradient descent to move the weight and bias values in very small amounts around the minimum. If we graphed the weights and bias points during gradient descent, the points would look like a ball rolling down a hill, finally stopping at the point where there's no more downward slope.

IMAGE: Figure 18. Convex 3-D loss surface with gradient descent points moving to the lowest point.

Figure 18. Loss graph showing gradient descent points stopping at the lowest point on the graph.

Notice that the black loss points create the exact shape of the loss curve: a steep decline before gradually sloping down until they've reached the lowest point on the loss surface.

It's important to note that the model almost never finds the exact minimum for each weight and bias, but instead finds a value very close to it. It's also important to note that the minimum for the weights and bias don't correspond to zero loss, only a value that produces the lowest loss for that parameter.

Using the weight and bias values that produce the lowest loss—in this case a weight of -5.44 and a bias of 35.94—we can graph the model to see how well it fits the data:

IMAGE: Figure 19. Graph of pounds in 1000s vs miles per gallon, with the model fitting the data.

Figure 19. Model graphed using the weight and bias values that produce the lowest loss.

This would be the best model for this dataset because no other weight and bias values produce a model with lower loss.

Key terms:

Linear regression: Hyperparameters

Hyperparameters are variables that control different aspects of training. Three common hyperparameters are:

  • Learning rate
  • Batch size
  • Epochs

In contrast, parameters are the variables, like the weights and bias, that are part of the model itself. In other words, hyperparameters are values that you control; parameters are values that the model calculates during training.

Learning rate

Learning rate is a floating point number you set that influences how quickly the model converges. If the learning rate is too low, the model can take a long time to converge. However, if the learning rate is too high, the model never converges, but instead bounces around the weights and bias that minimize the loss. The goal is to pick a learning rate that's not too high nor too low so that the model converges quickly.

The learning rate determines the magnitude of the changes to make to the weights and bias during each step of the gradient descent process. The model multiplies the gradient by the learning rate to determine the model's parameters (weight and bias values) for the next iteration. In the third step of gradient descent, the "small amount" to move in the direction of negative slope refers to the learning rate.

The difference between the old model parameters and the new model parameters is proportional to the slope of the loss function. For example, if the slope is large, the model takes a large step. If small, it takes a small step. For example, if the gradient's magnitude is 2.5 and the learning rate is 0.01, then the model will change the parameter by 0.025.

The ideal learning rate helps the model to converge within a reasonable number of iterations. In Figure 20, the loss curve shows the model significantly improving during the first 20 iterations before beginning to converge:

IMAGE: Figure 20. Loss curve that shows a steep slope before flattening out.

Figure 20. Loss graph showing a model trained with a learning rate that converges quickly.

In contrast, a learning rate that's too small can take too many iterations to converge. In Figure 21, the loss curve shows the model making only minor improvements after each iteration:

IMAGE: Figure 21. Loss curve that shows an almost 45-degree slope.

Figure 21. Loss graph showing a model trained with a small learning rate.

A learning rate that's too large never converges because each iteration either causes the loss to bounce around or continually increase. In Figure 22, the loss curve shows the model decreasing and then increasing loss after each iteration, and in Figure 23 the loss increases at later iterations:

IMAGE: Figure 22. Loss curve that shows jagged up-and-down line.

Figure 22. Loss graph showing a model trained with a learning rate that's too big, where the loss curve fluctuates wildly, going up and down as the iterations increase.

IMAGE: Figure 23. Loss curve that shows the loss increasing at later iterations

Figure 23. Loss graph showing a model trained with a learning rate that's too big, where the loss curve drastically increases in later iterations.

Exercise: Check your understanding

What is the ideal learning rate?

The ideal learning rate is problem-dependent.

Click to answer Each model and dataset will have its own ideal learning rate.

0.01

Click to answer Wrong.

1.0

Click to answer Wrong.

Batch size

Batch size is a hyperparameter that refers to the number of examples the model processes before updating its weights and bias. You might think that the model should calculate the loss for every example in the dataset before updating the weights and bias. However, when a dataset contains hundreds of thousands or even millions of examples, using the full batch isn't practical.

Two common techniques to get the right gradient on average without needing to look at every example in the dataset before updating the weights and bias are stochastic gradient descent and mini-batch stochastic gradient descent:

  • Stochastic gradient descent (SGD): Stochastic gradient descent uses only a single example (a batch size of one) per iteration. Given enough iterations, SGD works but is very noisy. "Noise" refers to variations during training that cause the loss to increase rather than decrease during an iteration. The term "stochastic" indicates that the one example comprising each batch is chosen at random.

    Notice in the following image how loss slightly fluctuates as the model updates its weights and bias using SGD, which can lead to noise in the loss graph:

    IMAGE: Figure 24. Steep loss curve that flattens out, but with a lot of tiny fluctuations.

    Figure 24. Model trained with stochastic gradient descent (SGD) showing noise in the loss curve.

    Note that using stochastic gradient descent can produce noise throughout the entire loss curve, not just near convergence.

  • Mini-batch stochastic gradient descent (mini-batch SGD): Mini-batch stochastic gradient descent is a compromise between full-batch and SGD. For \( N \) number of data points, the batch size can be any number greater than 1 and less than \( N \). The model chooses the examples included in each batch at random, averages their gradients, and then updates the weights and bias once per iteration.

    Determining the number of examples for each batch depends on the dataset and the available compute resources. In general, small batch sizes behaves like SGD, and larger batch sizes behaves like full-batch gradient descent.

    IMAGE: Figure 25. Steep loss curve that begins to flatten out, with much smaller fluctuations near convergence.

    Figure 25. Model trained with mini-batch SGD.

When training a model, you might think that noise is an undesirable characteristic that should be eliminated. However, a certain amount of noise can be a good thing. In later modules, you'll learn how noise can help a model generalize better and find the optimal weights and bias in a neural network.

Epochs

During training, an epoch means that the model has processed every example in the training set once. For example, given a training set with 1,000 examples and a mini-batch size of 100 examples, it will take the model 10 iterations to complete one epoch.

Training typically requires many epochs. That is, the system needs to process every example in the training set multiple times.

The number of epochs is a hyperparameter you set before the model begins training. In many cases, you'll need to experiment with how many epochs it takes for the model to converge. In general, more epochs produces a better model, but also takes more time to train.

IMAGE: Figure 26. A full batch is the entire dataset, a mini batch is a subset of the dataset, and an epoch is a full pass through ten mini batches.

Figure 26. Full batch versus mini batch.

The following table describes how batch size and epochs relate to the number of times a model updates its parameters.

Batch typeWhen weights and bias updates occur
Full batchAfter the model looks at all the examples in the dataset. For instance, if a dataset contains 1,000 examples and the model trains for 20 epochs, the model updates the weights and bias 20 times, once per epoch.
Stochastic gradient descentAfter the model looks at a single example from the dataset. For instance, if a dataset contains 1,000 examples and trains for 20 epochs, the model updates the weights and bias 20,000 times.
Mini-batch stochastic gradient descentAfter the model looks at the examples in each batch. For instance, if a dataset contains 1,000 examples, and the batch size is 100, and the model trains for 20 epochs, the model updates the weights and bias 200 times.

Exercise: Check your understanding

  1. What's the best batch size when using mini-batch SGD?

It depends

Click to answer The ideal batch size depends on the dataset and the available compute resources

10 examples per batch

Click to answer Wrong.

100 examples per batch

Click to answer Wrong
  1. Which of the following statements is true?

Larger batches are unsuitable for data with many outliers.

Click to answer This statement is false. By averaging more gradients together, larger batch sizes can help reduce the negative effects of having outliers in the data.

Doubling the learning rate can slow down training.

Click to answer This statement is true. Doubling the learning rate can result in a learning rate that is too large, and therefore cause the weights to "bounce around," increasing the amount of time needed to converge. As always, the best hyperparameters depend on your dataset and available compute resources.

Key terms:

Logistic Regression

Learning Objectives

  • Identify use cases for performing logistic regression.
  • Explain how logistic regression models use the sigmoid function to calculate probability.
  • Compare linear regression and logistic regression.
  • Explain why logistic regression uses log loss instead of squared loss.
  • Explain the importance of regularization when training logistic regression models.

This module introduces a new type of regression model called logistic regression that is designed to predict the probability of a given outcome.

Key terms:

Logistic regression: Calculating a probability with the sigmoid function

Sigmoid function

You might be wondering how a logistic regression model can ensure its output represents a probability, always outputting a value between 0 and 1. As it happens, there's a family of functions called logistic functions whose output has those same characteristics. The standard logistic function, also known as the sigmoid function (sigmoid means "s-shaped"), has the formula:

\[f(x) = \frac{1}{1 + e^{-x}}\]

where:

  • f(x) is the output of the sigmoid function.
  • e is Euler's number: a mathematical constant ≈ 2.71828.
  • x is the input to the sigmoid function.

Figure 1 shows the corresponding graph of the sigmoid function.

IMAGE: Sigmoid (s-shaped) curve plotted on the Cartesian coordinate plane, centered at the origin.

Figure 1. Graph of the sigmoid function. The curve approaches 0 as x values decrease to negative infinity, and 1 as x values increase toward infinity.

As the input, x, increases, the output of the sigmoid function approaches but never reaches 1. Similarly, as the input decreases, the sigmoid function's output approaches but never reaches 0.

The table below shows the output values of the sigmoid function for input values in the range –7 to 7. Note how quickly the sigmoid approaches 0 for decreasing negative input values, and how quickly the sigmoid approaches 1 for increasing positive input values.

However, no matter how large or how small the input value, the output will always be greater than 0 and less than 1.

InputSigmoid output
-70.001
-60.002
-50.007
-40.018
-30.047
-20.119
-10.269
00.50
10.731
20.881
30.952
40.982
50.993
60.997
70.999

Transforming linear output using the sigmoid function

The following equation represents the linear component of a logistic regression model:

\[z = b + w_1x_1 + w_2x_2 + \ldots + w_Nx_N\]

where:

  • z is the output of the linear equation, also called the log odds.
  • b is the bias.
  • The w values are the model's learned weights.
  • The x values are the feature values for a particular example.

To obtain the logistic regression prediction, the z value is then passed to the sigmoid function, yielding a value (a probability) between 0 and 1:

\[y' = \frac{1}{1 + e^{-z}}\]

where:

  • y' is the output of the logistic regression model.
  • e is Euler's number: a mathematical constant ≈ 2.71828.
  • z is the linear output (as calculated in the preceding equation).

In the equation \(z = b + w_1x_1 + w_2x_2 + \ldots + w_Nx_N\), z is referred to as the log-odds because if you start with the following sigmoid function (where \(y\) is the output of a logistic regression model, representing a probability):

\[y = \frac{1}{1 + e^{-z}}\]

And then solve for z:

\[ z = \ln\left(\frac{y}{1-y}\right) \]

Then z is defined as the natural logarithm of the ratio of the probabilities of the two possible outcomes: y and 1 – y.

Figure 2 illustrates how linear output is transformed to logistic regression output using these calculations.

IMAGE: Left: Line with the points (-7.5, –10), (-2.5, 0), and (0, 5) highlighted. Right: Sigmoid curve with the corresponding transformed points (-10, 0.00004), (0, 0.5), and (5, 0.9933) highlighted.

Figure 2. Left: graph of the linear function z = 2x + 5, with three points highlighted. Right: Sigmoid curve with the same three points highlighted after being transformed by the sigmoid function.

In Figure 2, a linear equation becomes input to the sigmoid function, which bends the straight line into an s-shape. Notice that the linear equation can output very big or very small values of z, but the output of the sigmoid function, y', is always between 0 and 1, exclusive. For example, the yellow square on the left graph has a z value of –10, but the sigmoid function in the right graph maps that –10 into a y' value of 0.00004.

Exercise: Check your understanding

A logistic regression model with three features has the following bias and weights:

\[\begin{align} b &= 1 \ w_1 &= 2 \ w_2 &= -1 \ w_3 &= 5 \end{align} \]

Given the following input values:

\[\begin{align} x_1 &= 0 \ x_2 &= 10 \ x_3 &= 2 \end{align} \]

Answer the following two questions.

  1. What is the value of z for these input values?

–1

Click to answer Wrong.

0

Click to answer Wrong.

0.731

Click to answer Wrong.

1

Click to answer Correct! The linear equation defined by the weights and bias is *z* = 1 + 2x1 – x2 + 5 x3. Plugging the input values into the equation produces z = 1 + (2)(0) - (10) + (5)(2) = 1
  1. What is the logistic regression prediction for these input values?

0.268

Click to answer Wrong

0.5

Click to answer Wrong

0.731

Click to answer As calculated in #1 above, the log-odds for the input values is 1. Plugging that value for *z* into the sigmoid function:

\(y = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.367} = \frac{1}{1.367} = 0.731\)

1

Click to answer Wrong

Remember, the output of the sigmoid function will always be greater than 0 and less than 1.

Key terms:

Logistic regression: Loss and regularization

Log Loss

In the Linear regression module, you used squared loss (also called L2 loss) as the loss function. Squared loss works well for a linear model where the rate of change of the output values is constant. For example, given the linear model \(y' = b + 3x_1\), each time you increment the input value \(x_1\) by 1, the output value \(y'\) increases by 3.

However, the rate of change of a logistic regression model is not constant. As you saw in Calculating a probability, the sigmoid curve is s-shaped rather than linear. When the log-odds (\(z\)) value is closer to 0, small increases in \(z\) result in much larger changes to \(y\) than when \(z\) is a large positive or negative number. The following table shows the sigmoid function's output for input values from 5 to 10, as well as the corresponding precision required to capture the differences in the results.

inputlogistic outputrequired digits of precision
50.9933
60.9973
70.9993
80.99974
90.99994
100.999985

If you used squared loss to calculate errors for the sigmoid function, as the output got closer and closer to 0 and 1, you would need more memory to preserve the precision needed to track these values.

Instead, the loss function for logistic regression is Log Loss. The Log Loss equation returns the logarithm of the magnitude of the change, rather than just the distance from data to prediction. Log Loss is calculated as follows:

\(\text{Log Loss} = -\frac{1}{N}\sum_{i=1}^{N} y_i\log(y_i') + (1 - y_i)\log(1 - y_i')\)

where:

  • \(N\) is the number of labeled examples in the dataset
  • \(i\) is the index of an example in the dataset (e.g., \((x_3, y_3)\) is the third example in the dataset)
  • \(y_i\) is the label for the \(i\)th example. Since this is logistic regression, \(y_i\) must either be 0 or 1.
  • \(y_i'\) is your model's prediction for the \(i\)th example (somewhere between 0 and 1), given the set of features in \(x_i\).

This form of the Log Loss function calculates the mean Log Loss across all points in the dataset. Using mean Log Loss (as opposed to total Log Loss) is desirable in practice, because it enables us to decouple tuning of the batch size and the learning rate.

Regularization in logistic regression

Regularization, a mechanism for penalizing model complexity during training, is extremely important in logistic regression modeling. Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in cases where the model has a large number of features. Consequently, most logistic regression models use one of the following two strategies to decrease model complexity:

Note: You'll learn more about regularization in the Datasets, Generalization, and Overfitting module of the course. Key terms:

Classification

Learning objectives

  • Determine an appropriate threshold for a binary classification model.
  • Calculate and choose appropriate metrics to evaluate a binary classification model.
  • Interpret ROC and AUC. Prerequisites:

This module assumes you are familiar with the concepts covered in the following modules:

In the Logistic regression module, you learned how to use the sigmoid function to convert raw model output to a value between 0 and 1 to make probabilistic predictions—for example, predicting that a given email has a 75% chance of being spam. But what if your goal is not to output probability but a category—for example, predicting whether a given email is "spam" or "not spam"?

Classification is the task of predicting which of a set of classes (categories) an example belongs to. In this module, you'll learn how to convert a logistic regression model that predicts a probability into a binary classification model that predicts one of two classes. You'll also learn how to choose and calculate appropriate metrics to evaluate the quality of a classification model's predictions. Finally, you'll get a brief introduction to multi-class classification problems, which are discussed in more depth later in the course.

Key terms:

Thresholds and the confusion matrix

Let's say you have a logistic regression model for spam-email detection that predicts a value between 0 and 1, representing the probability that a given email is spam. A prediction of 0.50 signifies a 50% likelihood that the email is spam, a prediction of 0.75 signifies a 75% likelihood that the email is spam, and so on.

You'd like to deploy this model in an email application to filter spam into a separate mail folder. But to do so, you need to convert the model's raw numerical output (e.g., 0.75) into one of two categories: "spam" or "not spam."

To make this conversion, you choose a threshold probability, called a classification threshold. Examples with a probability above the threshold value are then assigned to the positive class, the class you are testing for (here, spam). Examples with a lower probability are assigned to the negative class, the alternative class (here, not spam).

You may be wondering: what happens if the predicted score is equal to the classification threshold (for instance, a score of 0.5 where the classification threshold is also 0.5)? Handling for this case depends on the particular implementation chosen for the classification model. The Keras library predicts the negative class if the score and threshold are equal, but other tools/frameworks may handle this case differently.

Suppose the model scores one email as 0.99, predicting that email has a 99% chance of being spam, and another email as 0.51, predicting it has a 51% chance of being spam. If you set the classification threshold to 0.5, the model will classify both emails as spam. If you set the threshold to 0.95, only the email scoring 0.99 will be classified as spam.

While 0.5 might seem like an intuitive threshold, it's not a good idea if the cost of one type of wrong classification is greater than the other, or if the classes are imbalanced. If only 0.01% of emails are spam, or if misfiling legitimate emails is worse than letting spam into the inbox, labeling anything the model considers at least 50% likely to be spam as spam produces undesirable results.

Confusion matrix

The probability score is not reality, or ground truth. There are four possible outcomes for each output from a binary classifier. For the spam classifier example, if you lay out the ground truth as columns and the model's prediction as rows, the following table, called a confusion matrix, is the result:

Actual positiveActual negative
Predicted positiveTrue positive (TP): A spam email correctly classified as a spam email. These are the spam messages automatically sent to the spam folder.False positive (FP): A not-spam email misclassified as spam. These are the legitimate emails that wind up in the spam folder.
Predicted negativeFalse negative (FN): A spam email misclassified as not-spam. These are spam emails that aren't caught by the spam filter and make their way into the inbox.True negative (TN): A not-spam email correctly classified as not-spam. These are the legitimate emails that are sent directly to the inbox.

Notice that the total in each row gives all predicted positives (TP + FP) and all predicted negatives (FN + TN), regardless of validity. The total in each column, meanwhile, gives all real positives (TP + FN) and all real negatives (FP + TN) regardless of model classification.

When the total of actual positives is not close to the total of actual negatives, the dataset is imbalanced. An instance of an imbalanced dataset might be a set of thousands of photos of clouds, where the rare cloud type you are interested in, say, volutus clouds, only appears a few times.

Effect of threshold on true and false positives and negatives

Different thresholds usually result in different numbers of true and false positives and true and false negatives. The following video explains why this is the case.

Try changing the threshold yourself here.

Check your understanding

  1. Imagine a phishing or malware classification model where phishing and malware websites are in the class labeled 1 (true) and harmless websites are in the class labeled 0 (false). This model mistakenly classifies a legitimate website as malware. What is this called?

A false positive

Click to answer A negative example (legitimate site) has been wrongly classified as a positive example (malware site).

A true positive

Click to answer A true positive would be a malware site correctly classified as malware.

A false negative

Click to answer A false negative would be a malware site incorrectly classified as a legitimate site.

A true negative

Click to answer A true negative would be a legitimate site correctly classified as a legitimate site.
  1. In general, what happens to the number of false positives when the classification threshold increases? What about true positives? Experiment with the slider above.

Both true and false positives decrease.

Click to answer As the threshold increases, the model will likely predict fewer positives overall, both true and false. A spam classifier with a threshold of .9999 will only label an email as spam if it considers the classification to be at least 99.99% likely, which means it is highly unlikely to mislabel a legitimate email, but also likely to miss actual spam email.

Both true and false positives increase.

Click to answer Using the slider above, try setting the threshold to 0.1, then dragging it to 0.9. What happens to the number of false positives and true positives?

True positives increase. False positives decrease.

Click to answer Using the slider above, try setting the threshold to 0.1, then dragging it to 0.9. What happens to the number of false positives and true positives?
  1. In general, what happens to the number of false negatives when the classification threshold increases? What about true negatives? Experiment with the slider above.

Both true and false negatives increase.

Click to answer As the threshold increases, the model will likely predict more negatives overall, both true and false. At a very high threshold, almost all emails, both spam and not-spam, will be classified as not-spam.

Both true and false negatives decrease.

Click to answer Using the slider above, try setting the threshold to 0.1, then dragging it to 0.9. What happens to the number of false negatives and true negatives?

True negatives increase. False negatives decrease.

Click to answer Using the slider above, try setting the threshold to 0.1, then dragging it to 0.9. What happens to the number of false negatives and true negatives?

Key terms:

Classification: Accuracy, recall, precision, and related metrics  |  Machine Learning  |  Google for Developers

True and false positives and negatives are used to calculate several useful metrics for evaluating models. Which evaluation metrics are most meaningful depends on the specific model and the specific task, the cost of different misclassifications, and whether the dataset is balanced or imbalanced.

All of the metrics in this section are calculated at a single fixed threshold, and change when the threshold changes. Very often, the user tunes the threshold to optimize one of these metrics.

Accuracy

Accuracy is the proportion of all classifications that were correct, whether positive or negative. It is mathematically defined as:

\[\text{Accuracy} = \frac{\text{correct classifications}}{\text{total classifications}} = \frac{TP+TN}{TP+TN+FP+FN}\]

In the spam classification example, accuracy measures the fraction of all emails correctly classified.

A perfect model would have zero false positives and zero false negatives, and therefore an accuracy of 1.0, or 100%.

Because it incorporates all four outcomes from the confusion matrix (TP, FP, TN, FN), given a balanced dataset, with similar numbers of examples in both classes, accuracy can serve as a coarse-grained measure of model quality. For this reason, it is often the default evaluation metric used for generic or unspecified models carrying out generic or unspecified tasks.

However, when the dataset is imbalanced, or where one kind of mistake (FN or FP) is more costly than the other, which is the case in most real-world applications, it's better to optimize for one of the other metrics instead.

For heavily imbalanced datasets, where one class appears very rarely, say 1% of the time, a model that predicts negative 100% of the time would score 99% on accuracy, despite being useless.

Note: In machine learning (ML), words like recall, precision, and accuracy have mathematical definitions that may differ from, or be more specific than, more commonly used meanings of the word.

Recall, or true positive rate

The true positive rate (TPR), or the proportion of all actual positives that were classified correctly as positives, is also known as recall.

Recall is mathematically defined as:

\[\text{Recall (or TPR)} = \frac{\text{correctly classified actual positives}}{\text{all actual positives}} = \frac{TP}{TP+FN}\]

False negatives are actual positives that were misclassified as negatives, which is why they appear in the denominator. In the spam classification example, recall measures the fraction of spam emails that were correctly classified as spam. This is why another name for recall is probability of detection: it answers the question "What fraction of spam emails are detected by this model?"

A hypothetical perfect model would have zero false negatives and therefore a recall (TPR) of 1.0, which is to say, a 100% detection rate.

In an imbalanced dataset where the number of actual positives is very low, recall is a more meaningful metric than accuracy because it measures the ability of the model to correctly identify all positive instances. For applications like disease prediction, correctly identifying the positive cases is crucial. A false negative typically has more serious consequences than a false positive. For a concrete example comparing recall and accuracy metrics, see the notes in the definition of recall.

False positive rate

The false positive rate (FPR) is the proportion of all actual negatives that were classified incorrectly as positives, also known as the probability of false alarm. It is mathematically defined as:

\[\text{FPR} = \frac{\text{incorrectly classified actual negatives}} {\text{all actual negatives}} = \frac{FP}{FP+TN}\]

False positives are actual negatives that were misclassified, which is why they appear in the denominator. In the spam classification example, FPR measures the fraction of legitimate emails that were incorrectly classified as spam, or the model's rate of false alarms.

A perfect model would have zero false positives and therefore a FPR of 0.0, which is to say, a 0% false alarm rate.

In an imbalanced dataset where the number of actual negatives is very, very low, say 1-2 examples in total, FPR is less meaningful and less useful as a metric.

Precision

Precision is the proportion of all the model's positive classifications that are actually positive. It is mathematically defined as:

\[\text{Precision} = \frac{\text{correctly classified actual positives}} {\text{everything classified as positive}} = \frac{TP}{TP+FP}\]

In the spam classification example, precision measures the fraction of emails classified as spam that were actually spam.

A hypothetical perfect model would have zero false positives and therefore a precision of 1.0.

In an imbalanced dataset where the number of actual positives is very, very low, say 1-2 examples in total, precision is less meaningful and less useful as a metric.

Precision improves as false positives decrease, while recall improves when false negatives decrease. But as seen in the previous section, increasing the classification threshold tends to decrease the number of false positives and increase the number of false negatives, while decreasing the threshold has the opposite effects. As a result, precision and recall often show an inverse relationship, where improving one of them worsens the other.

Try it yourself here.

What does NaN mean in the metrics?

NaN, or "not a number," appears when dividing by 0, which can happen with any of these metrics. When TP and FP are both 0, for example, the formula for precision has 0 in the denominator, resulting in NaN. While in some cases NaN can indicate perfect performance and could be replaced by a score of 1.0, it can also come from a model that is practically useless. A model that never predicts positive, for example, would have 0 TPs and 0 FPs and thus a calculation of its precision would result in NaN.

Choice of metric and tradeoffs

The metric(s) you choose to prioritize when evaluating the model and choosing a threshold depend on the costs, benefits, and risks of the specific problem. In the spam classification example, it often makes sense to prioritize recall, nabbing all the spam emails, or precision, trying to ensure that spam-labeled emails are in fact spam, or some balance of the two, above some minimum accuracy level.

MetricGuidance
AccuracyUse as a rough indicator of model training progress/convergence for balanced datasets. For model performance, use only in combination with other metrics. Avoid for imbalanced datasets. Consider using another metric.
Recall (True positive rate)Use when false negatives are more expensive than false positives.
False positive rateUse when false positives are more expensive than false negatives.
PrecisionUse when it's very important for positive predictions to be accurate.

(Optional, advanced) F1 score

The F1 score is the harmonic mean (a kind of average) of precision and recall.

Mathematically, it is given by:

\[\text{F1}=2*\frac{\text{precision * recall}}{\text{precision + recall}} = \frac{2\text{TP}}{2\text{TP + FP + FN}}\]

This metric balances the importance of precision and recall, and is preferable to accuracy for class-imbalanced datasets. When precision and recall both have perfect scores of 1.0, F1 will also have a perfect score of 1.0. More broadly, when precision and recall are close in value, F1 will be close to their value. When precision and recall are far apart, F1 will be similar to whichever metric is worse.

Exercise: Check your understanding

A model outputs 5 TP, 6 TN, 3 FP, and 2 FN. Calculate the recall.

0.714

Click to answer Recall is calculated as \\(\frac{TP}{TP+FN}=\frac{5}{7}\\).

0.455

Click to answer Recall considers all actual positives, not all correct classifications. The formula for recall is \(\frac{TP}{TP+FN}\).

0.625

Click to answer Recall considers all actual positives, not all positive classifications. The formula for recall is \(\frac{TP}{TP+FN}\)

A model outputs 3 TP, 4 TN, 2 FP, and 1 FN. Calculate the precision.

0.6

Click to answer Precision is calculated as \\(\frac{TP}{TP+FP}=\frac{3}{5}\\).

0.75

Click to answer Precision considers all positive classifications, not all actual positives. The formula for precision is \\(\frac{TP}{TP+FP}\\).

0.429

Click to answer Precision considers all positive classifications, not all correct classifications. The formula for precision is \\(\frac{TP}{TP+FP}\\)

You're building a binary classifier that checks photos of insect traps for whether a dangerous invasive species is present. If the model detects the species, the entomologist (insect scientist) on duty is notified. Early detection of this insect is critical to preventing an infestation. A false alarm (false positive) is easy to handle: the entomologist sees that the photo was misclassified and marks it as such. Assuming an acceptable accuracy level, which metric should this model be optimized for?

Recall

Click to answer In this scenario, false alarms (FP) are low-cost, and false negatives are highly costly, so it makes sense to maximize recall, or the probability of detection.

False positive rate (FPR)

Click to answer In this scenario, false alarms (FP) are low-cost. Trying to minimize them at the risk of missing actual positives doesn't make sense.

Precision

Click to answer In this scenario, false alarms (FP) aren't particularly harmful, so trying to improve the correctness of positive classifications doesn't make sense.

Key terms:

Classification: ROC and AUC

The previous section presented a set of model metrics, all calculated at a single classification threshold value. But if you want to evaluate a model's quality across all possible thresholds, you need different tools.

Receiver-operating characteristic curve (ROC)

The ROC curve is a visual representation of model performance across all thresholds. The long version of the name, receiver operating characteristic, is a holdover from WWII radar detection.

The ROC curve is drawn by calculating the true positive rate (TPR) and false positive rate (FPR) at every possible threshold (in practice, at selected intervals), then graphing TPR over FPR. A perfect model, which at some threshold has a TPR of 1.0 and a FPR of 0.0, can be represented by either a point at (0, 1) if all other thresholds are ignored, or by the following:

IMAGE: Figure 1. A graph of TPR (y-axis) against FPR (x-axis) showing the performance of a perfect model: a line from (0,1) to (1,1).

Figure 1. ROC and AUC of a hypothetical perfect model.

Area under the curve (AUC)

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative.

The perfect model above, containing a square with sides of length 1, has an area under the curve (AUC) of 1.0. This means there is a 100% probability that the model will correctly rank a randomly chosen positive example higher than a randomly chosen negative example. In other words, looking at the spread of data points below, AUC gives the probability that the model will place a randomly chosen square to the right of a randomly chosen circle, independent of where the threshold is set.

IMAGE: Figure 2. Visualization of a classifier with AUC = 1.0, where all positive examples are ranked to the right of negative examples.

Figure 2. A spread of predictions for a binary classification model. AUC is the chance a randomly chosen square is positioned to the right of a randomly chosen circle.

In more concrete terms, a spam classifier with AUC of 1.0 always assigns a random spam email a higher probability of being spam than a random legitimate email. The actual classification of each email depends on the threshold that you choose.

For a binary classifier, a model that does exactly as well as random guesses or coin flips has a ROC that is a diagonal line from (0,0) to (1,1). The AUC is 0.5, representing a 50% probability of correctly ranking a random positive and negative example.

In the spam classifier example, a spam classifier with AUC of 0.5 assigns a random spam email a higher probability of being spam than a random legitimate email only half the time.

IMAGE: Figure 3. A graph of TPR (y-axis) against FPR (x-axis) showing the performance of a random 50-50 guesser: a diagonal line from (0,0) to (1,1).

Figure 3. ROC and AUC of completely random guesses.

(Optional, advanced) Precision-recall curve

AUC and ROC work well for comparing models when the dataset is roughly balanced between classes. When the dataset is imbalanced, precision-recall curves (PRCs) and the area under those curves may offer a better comparative visualization of model performance. Precision-recall curves are created by plotting precision on the y-axis and recall on the x-axis across all thresholds.

IMAGE: Precision-recall curve example with downward convex curve from (0,1) to (1,0)

AUC and ROC for choosing model and threshold

AUC is a useful measure for comparing the performance of two different models, as long as the dataset is roughly balanced. The model with greater area under the curve is generally the better one.

IMAGE: Figure 4.a. ROC/AUC graph of a model with AUC=0.65. IMAGE: Figure 4.b. ROC/AUC graph of a model with AUC=0.93.

Figure 4. ROC and AUC of two hypothetical models. The curve on the right, with a greater AUC, represents the better of the two models.

The points on a ROC curve closest to (0,1) represent a range of the best-performing thresholds for the given model. As discussed in the Thresholds, Confusion matrix and Choice of metric and tradeoffs sections, the threshold you choose depends on which metric is most important to the specific use case. Consider the points A, B, and C in the following diagram, each representing a threshold:

IMAGE: Figure 5. A ROC curve of AUC=0.84 showing three points on the convex part of the curve closest to (0,1) labeled A, B, C in order.

Figure 5. Three labeled points representing thresholds.

If false positives (false alarms) are highly costly, it may make sense to choose a threshold that gives a lower FPR, like the one at point A, even if TPR is reduced. Conversely, if false positives are cheap and false negatives (missed true positives) highly costly, the threshold for point C, which maximizes TPR, may be preferable. If the costs are roughly equivalent, point B may offer the best balance between TPR and FPR.

Here is the ROC curve for the data we have seen before:

Exercise: Check your understanding

In practice, ROC curves are much less regular than the illustrations given above. Which of the following models, represented by their ROC curve and AUC, has the best performance?

IMAGE: ROC curve that is approximately a straight line from (0,0) to (1,1), with a few zig-zags. The curve has an AUC of 0.508.

Click to answer Wrong.

IMAGE: ROC curve that zig-zags up and to the right from (0,0) to (1,1). The curve has an AUC of 0.623.

Click to answer Wrong.

IMAGE: ROC curve that arcs upward and then rightward from (0,0) to (1,1). The curve has an AUC of 0.77.

Click to answer This model has the highest AUC, which corresponds with the best performance.

IMAGE: ROC curve that arcs rightward and then upward from (0,0) to (1,1). The curve has an AUC of 0.31.

Click to answer Wrong.

Which of the following models performs worse than chance?

IMAGE: ROC curve that arcs rightward and then upward from (0,0) to (1,1). The curve has an AUC of 0.32.

Click to answer This model has an AUC lower than 0.5, which means it performs worse than chance.

IMAGE: ROC curve that is approximately a straight line from (0,0) to (1,1), with a few zig-zags. The curve has an AUC of 0.508.

Click to answer This model performs slightly better than chance.

IMAGE: ROC curve that is a diagonal straight line from (0,0) to (1,1). The curve has an AUC of 0.5.

Click to answer This model performs the same as chance.

IMAGE: ROC curve that is composed of two perpendicular lines: a vertical line from (0,0) to (0,1) and a horizontal line from (0,1) to (1,1). This curve has an AUC of 1.0.

Click to answer This is a hypothetical perfect classifier.

(Optional, advanced) Bonus question

Which of the following changes can be made to the worse-than-chance model in the previous question to cause it to perform better than chance?

Reverse the predictions, so predictions of 1 become 0, and predictions of 0 become 1.

If a binary classifier reliably puts examples in the wrong classes more often than chance, switching the class label immediately makes its predictions better than chance without having to retrain the model.

Have it always predict the negative class.

This may or may not improve performance above chance. Also, as discussed in the Accuracy section, this isn't a useful model.

Have it always predict the positive class.

This may or may not improve performance above chance. Also, as discussed in the Accuracy section, this isn't a useful model.

Imagine a situation where it's better to allow some spam to reach the inbox than to send a business-critical email to the spam folder. You've trained a spam classifier for this situation where the positive class is spam and the negative class is not-spam. Which of the following points on the ROC curve for your classifier is preferable?

IMAGE: A ROC curve of AUC=0.84 showing three points on the convex part of the curve that are close to (0,1). Point A is at approximately (0.25, 0.75). Point B is at approximately (0.30, 0.90), and is the point that maximizes TPR while minimizing FPR. Point C is at approximately (0.4, 0.95).

Point A

Click to answer In this use case, it's better to minimize false positives, even if true positives also decrease.

Point B

Click to answer This threshold balances true and false positives.

Point C

Click to answer This threshold maximizes true positives (flags more spam) at a cost of more false positives (more legitimate emails flagged as spam).

Key terms:

Classification: Prediction bias

Calculating prediction bias is a quick check that can flag issues with the model or training data early on.

Prediction bias is the difference between the mean of a model's predictions and the mean of ground-truth labels in the data. A model trained on a dataset where 5% of the emails are spam should predict, on average, that 5% of the emails it classifies are spam. In other words, the mean of the labels in the ground-truth dataset is 0.05, and the mean of the model's predictions should also be 0.05. If this is the case, the model has zero prediction bias. Of course, the model might still have other problems.

If the model instead predicts 50% of the time that an email is spam, then something is wrong with the training dataset, the new dataset the model is applied to, or with the model itself. Any significant difference between the two means suggests that the model has some prediction bias.

Prediction bias can be caused by:

  • Biases or noise in the data, including biased sampling for the training set
  • Too-strong regularization, meaning that the model was oversimplified and lost some necessary complexity
  • Bugs in the model training pipeline
  • The set of features provided to the model being insufficient for the task

Key terms:

Classification: Multi-class classification

Multi-class classification can be treated as an extension of binary classification to more than two classes. If each example can only be assigned to one class, then the classification problem can be handled as a binary classification problem, where one class contains one of the multiple classes, and the other class contains all the other classes put together. The process can then be repeated for each of the original classes.

For example, in a three-class multi-class classification problem, where you're classifying examples with the labels A, B, and C, you could turn the problem into two separate binary classification problems. First, you might create a binary classifier that categorizes examples using the label A+B and the label C. Then, you could create a second binary classifier that reclassifies the examples that are labeled A+B using the label A and the label B.

An example of a multi-class problem is a handwriting classifier that takes an image of a handwritten digit and decides which digit, 0-9, is represented.

If class membership isn't exclusive, which is to say, an example can be assigned to multiple classes, this is known as a multi-label classification problem.

Key terms:

Data

Working with numerical data

Learning objectives

  • Understand feature vectors.
  • Explore your dataset's potential features visually and mathematically.
  • Identify outliers.
  • Understand four different techniques to normalize numerical data.
  • Understand binning and develop strategies for binning numerical data.
  • Understand the characteristics of good continuous numerical features.

ML practitioners spend far more time evaluating, cleaning, and transforming data than building models. Data is so important that this course devotes three entire units to the topic:

This unit focuses on numerical data, meaning integers or floating-point values that behave like numbers. That is, they are additive, countable, ordered, and so on. The next unit focuses on categorical data, which can include numbers that behave like categories. The third unit focuses on how to prepare your data to ensure high-quality results when training and evaluating your model.

Examples of numerical data include:

  • Temperature
  • Weight
  • The number of deer wintering in a nature preserve

In contrast, US postal codes, despite being five-digit or nine-digit numbers, don't behave like numbers or represent mathematical relationships. Postal code 40004 (in Nelson County, Kentucky) is not twice the quantity of postal code 20002 (in Washington, D.C.). These numbers represent categories, specifically geographic areas, and are considered categorical data.

Key terms:

Numerical data: How a model ingests data using feature vectors

Until now, we've given you the impression that a model acts directly on the rows of a dataset; however, models actually ingest data somewhat differently.

For example, suppose a dataset provides five columns, but only two of those columns (b and d) are features in the model. When processing the example in row 3, does the model simply grab the contents of the highlighted two cells (3b and 3d) as follows?

IMAGE: Figure 1. A model ingesting an example directly from a dataset. Columns b and d of Row 3 are highlighted.

Figure 1. Not exactly how a model gets its examples.

In fact, the model actually ingests an array of floating-point values called a feature vector. You can think of a feature vector as the floating-point values comprising one example.

IMAGE: Figure 2. The feature vector is an intermediary between the dataset and the model.

Figure 2. Closer to the truth, but not realistic.

However, feature vectors seldom use the dataset's raw values. Instead, you must typically process the dataset's values into representations that your model can better learn from. So, a more realistic feature vector might look something like this:

IMAGE: Figure 3. The feature vector contains two floating-point values: 0.13 and 0.47. A more realistic feature vector.

Figure 3. A more realistic feature vector.

Wouldn't a model produce better predictions by training from the actual values in the dataset than from altered values? Surprisingly, the answer is no.

You must determine the best way to represent raw dataset values as trainable values in the feature vector. This process is called feature engineering, and it is a vital part of machine learning. The most common feature engineering techniques are:

  • Normalization: Converting numerical values into a standard range.
  • Binning (also referred to as bucketing): Converting numerical values into buckets of ranges.

This unit covers normalizing and binning. The next unit, Working with categorical data, covers other forms of preprocessing, such as converting non-numerical data, like strings, to floating point values.

Every value in a feature vector must be a floating-point value. However, many features are naturally strings or other non-numerical values. Consequently, a large part of feature engineering is representing non-numerical values as numerical values. You'll see a lot of this in later modules.

Key terms:

Numerical data: First steps

Before creating feature vectors, we recommend studying numerical data in two ways:

  • Visualize your data in plots or graphs.
  • Get statistics about your data.

Visualize your data

Graphs can help you find anomalies or patterns hiding in the data. Therefore, before getting too far into analysis, look at your data graphically, either as scatter plots or histograms. View graphs not only at the beginning of the data pipeline, but also throughout data transformations. Visualizations help you continually check your assumptions.

We recommend working with pandas for visualization:

Note that certain visualization tools are optimized for certain data formats. A visualization tool that helps you evaluate protocol buffers may or may not be able to help you evaluate CSV data.

Statistically evaluate your data

Beyond visual analysis, we also recommend evaluating potential features and labels mathematically, gathering basic statistics such as:

  • mean and median
  • standard deviation
  • the values at the quartile divisions: the 0th, 25th, 50th, 75th, and 100th percentiles. The 0th percentile is the minimum value of this column; the 100th percentile is the maximum value of this column. (The 50th percentile is the median.)

Find outliers

An outlier is a value distant from most other values in a feature or label. Outliers often cause problems in model training, so finding outliers is important.

When the delta between the 0th and 25th percentiles differs significantly from the delta between the 75th and 100th percentiles, the dataset probably contains outliers.

Note: Don't over-rely on basic statistics. Anomalies can also hide in seemingly well-balanced data.

Outliers can fall into any of the following categories:

  • The outlier is due to a mistake. For example, perhaps an experimenter mistakenly entered an extra zero, or perhaps an instrument that gathered data malfunctioned. You'll generally delete examples containing mistake outliers.
  • The outlier is a legitimate data point, not a mistake. In this case, will your trained model ultimately need to infer good predictions on these outliers?
    • If yes, keep these outliers in your training set. After all, outliers in certain features sometimes mirror outliers in the label, so the outliers could actually help your model make better predictions. Be careful, extreme outliers can still hurt your model.
    • If no, delete the outliers or apply more invasive feature engineering techniques, such as clipping.

Key terms:

Numerical data: Normalization

After examining your data through statistical and visualization techniques, you should transform your data in ways that will help your model train more effectively. The goal of normalization is to transform features to be on a similar scale. For example, consider the following two features:

  • Feature X spans the range 154 to 24,917,482.
  • Feature Y spans the range 5 to 22.

These two features span very different ranges. Normalization might manipulate X and Y so that they span a similar range, perhaps 0 to 1.

Normalization provides the following benefits:

  • Helps models converge more quickly during training. When different features have different ranges, gradient descent can "bounce" and slow convergence. That said, more advanced optimizers like Adagrad and Adam protect against this problem by changing the effective learning rate over time.
  • Helps models infer better predictions. When different features have different ranges, the resulting model might make somewhat less useful predictions.
  • Helps avoid the "NaN trap" when feature values are very high. NaN is an abbreviation for not a number. When a value in a model exceeds the floating-point precision limit, the system sets the value to NaN instead of a number. When one number in the model becomes a NaN, other numbers in the model also eventually become a NaN.
  • Helps the model learn appropriate weights for each feature. Without feature scaling, the model pays too much attention to features with wide ranges and not enough attention to features with narrow ranges.

We recommend normalizing numeric features covering distinctly different ranges (for example, age and income). We also recommend normalizing a single numeric feature that covers a wide range, such as city population.

Warning: If you normalize a feature during training, you must also normalize that feature when making predictions.

Consider the following two features:

  • Feature A's lowest value is -0.5 and highest is +0.5.
  • Feature B's lowest value is -5.0 and highest is +5.0.

Feature A and Feature B have relatively narrow spans. However, Feature B's span is 10 times wider than Feature A's span. Therefore:

  • At the start of training, the model assumes that Feature B is ten times more "important" than Feature A.
  • Training will take longer than it should.
  • The resulting model may be suboptimal.

The overall damage due to not normalizing will be relatively small; however, we still recommend normalizing Feature A and Feature B to the same scale, perhaps -1.0 to +1.0.

Now consider two features with a greater disparity of ranges:

  • Feature C's lowest value is -1 and highest is +1.
  • Feature D's lowest value is +5000 and highest is +1,000,000,000.

If you don't normalize Feature C and Feature D, your model will likely be suboptimal. Furthermore, training will take much longer to converge or even fail to converge entirely!

This section covers three popular normalization methods:

  • linear scaling
  • Z-score scaling
  • log scaling

This section additionally covers clipping. Although not a true normalization technique, clipping does tame unruly numerical features into ranges that produce better models.

Linear scaling

Linear scaling (more commonly shortened to just scaling) means converting floating-point values from their natural range into a standard range—usually 0 to 1 or -1 to +1.

Click the icon to see the math.

Use the following formula to scale to the standard range 0 to 1, inclusive:

\[ x' = (x - x_{min}) / (x_{max} - x_{min}) \]

where:

  • \(x'\) is the scaled value.
  • \(x\) is the original value.
  • \(x_{min}\) is the lowest value in the dataset of this feature.
  • \(x_{max}\) is the highest value in the dataset of this feature.

For example, consider a feature named quantity whose natural range spans 100 to 900. Suppose the natural value of quantity in a particular example is 300. Therefore, you can calculate the normalized value of 300 as follows:

  • \(x\) = 300
  • \(x_{min}\) = 100
  • \(x_{max}\) = 900
x' = (300 - 100) / (900 - 100)
x' = 200 / 800
x' = 0.25

Linear scaling is a good choice when all of the following conditions are met:

  • The lower and upper bounds of your data don't change much over time.
  • The feature contains few or no outliers, and those outliers aren't extreme.
  • The feature is approximately uniformly distributed across its range. That is, a histogram would show roughly even bars for most values.

Suppose human age is a feature. Linear scaling is a good normalization technique for age because:

  • The approximate lower and upper bounds are 0 to 100.
  • age contains a relatively small percentage of outliers. Only about 0.3% of the population is over 100.
  • Although certain ages are somewhat better represented than others, a large dataset should contain sufficient examples of all ages.

Note: Most real-world features do not meet all of the criteria for linear scaling. Z-score scaling is typically a better normalization choice than linear scaling.

Exercise: Check your understanding

Suppose your model has a feature named net_worth that holds the net worth of different people. Would linear scaling be a good normalization technique for net_worth? Why or why not?

Click to see the answer **Answer:** Linear scaling would be a poor choice for normalizing `net_worth`. This feature contains many outliers, and the values are not uniformly distributed across its primary range. Most people would be squeezed within a very narrow band of the overall range.

Z-score scaling

A Z-score is the number of standard deviations a value is from the mean. For example, a value that is 2 standard deviations greater than the mean has a Z-score of +2.0. A value that is 1.5 standard deviations less than the mean has a Z-score of -1.5.

Representing a feature with Z-score scaling means storing that feature's Z-score in the feature vector. For example, the following figure shows two histograms:

  • On the left, a classic normal distribution.
  • On the right, the same distribution normalized by Z-score scaling.

IMAGE: Figure 4. Two histograms: both showing normal distributions with the identical distribution. The first histogram, which contains raw data, has a mean of 200 and a standard deviation of 30. The second histogram, which contains a Z-score version of the first distribution, has a mean of 0 and a standard deviation of 1.

Figure 4. Raw data (left) versus Z-score (right) for a normal distribution.

Z-score scaling is also a good choice for data like that shown in the following figure, which has only a vaguely normal distribution.

IMAGE: Figure 5. Two histograms of identical shape, each showing a steep rise to a plateau and then a relatively quick descent followed by gradual decay. One histogram illustrates the distribution of the raw data; the other histogram illustrates the distribution of the raw data when normalized by Z-score scaling. The values on the X-axis of the two histograms are very different. The raw data histogram spans the domain 0 to 29,000, while the Z-score scaled histogram ranges from -1 to about +4.8

Figure 5. Raw data (left) versus Z-score scaling (right) for a non-classic normal distribution.

Use the following formula to normalize a value, \(x\), to its Z-score:

\[ x' = (x - μ) / σ \]

where:

  • \(x'\) is the Z-score.
  • \(x\) is the raw value; that is, \(x\) is the value you are normalizing.
  • \(μ\) is the mean.
  • \(σ\) is the standard deviation.

For example, suppose:

  • mean = 100
  • standard deviation = 20
  • original value = 130

Therefore:

  Z-score = (130 - 100) / 20
  Z-score = 30 / 20
  Z-score = +1.5

Click here to learn more about normal distributions. In a classic normal distribution:
  • At least 68.27% of data has a Z-score between -1.0 and +1.0.
  • At least 95.45% of data has a Z-score between -2.0 and +2.0.
  • At least 99.73% of data has a Z-score between -3.0 and +3.0.
  • At least 99.994% of data has a Z-score between -4.0 and +4.0.

So, data points with a Z-score less than -4.0 or more than +4.0 are rare, but are they truly outliers? Since outliers is a concept without a strict definition, no one can say for sure. Note that a dataset with a sufficiently large number of examples will almost certainly contain at least a few of these "rare" examples. For example, a feature with one billion examples conforming to a classic normal distribution could have as many as 60,000 examples with a score outside the range -4.0 to +4.0.

Z-score is a good choice when the data follows a normal distribution or a distribution somewhat like a normal distribution.

Note that some distributions might be normal within the bulk of their range, but still contain extreme outliers. For example, nearly all of the points in a net_worth feature might fit neatly into 3 standard deviations, but a few examples of this feature could be hundreds of standard deviations away from the mean. In these situations, you can combine Z-score scaling with another form of normalization (usually clipping) to handle this situation.

Exercise: Check your understanding

Suppose your model trains on a feature named height that holds the adult heights of ten million women. Would Z-score scaling be a good normalization technique for height? Why or why not?

Click to see the answer **Answer:** Z-score scaling would be a good normalization technique for `height` because this feature conforms to a normal distribution. Ten million examples implies a lot of outliers—probably enough outliers for the model to learn patterns on very high or very low Z-scores.

Log scaling

Log scaling computes the logarithm of the raw value. In theory, the logarithm could be any base; in practice, log scaling usually calculates the natural logarithm (ln).

Use the following formula to normalize a value, \(x\), to its log:

\[ x' = ln(x) \]

where:

  • \(x'\) is the natural logarithm of \(x\).
  • original value = 54.598

Therefore, the log of the original value is about 4.0:

  4.0 = ln(54.598)

Log scaling is helpful when the data conforms to a power law distribution. Casually speaking, a power law distribution looks as follows:

  • Low values of X have very high values of Y.
  • As the values of X increase, the values of Y quickly decrease. Consequently, high values of X have very low values of Y.

Movie ratings are a good example of a power law distribution. In the following figure, notice:

  • A few movies have lots of user ratings. (Low values of X have high values of Y.)
  • Most movies have very few user ratings. (High values of X have low values of Y.)

Log scaling changes the distribution, which helps train a model that will make better predictions.

IMAGE: Figure 6. Two graphs comparing raw data versus the log of raw data. The raw data graph shows a lot of user ratings in the head, followed by a long tail. The log graph has a more even distribution.

Figure 6. Comparing a raw distribution to its log.

As a second example, book sales conform to a power law distribution because:

  • Most published books sell a tiny number of copies, maybe one or two hundred.
  • Some books sell a moderate number of copies, in the thousands.
  • Only a few bestsellers will sell more than a million copies.

Suppose you are training a linear model to find the relationship of, say, book covers to book sales. A linear model training on raw values would have to find something about book covers on books that sell a million copies that is 10,000 more powerful than book covers that sell only 100 copies. However, log scaling all the sales figures makes the task far more feasible. For example, the log of 100 is:

  ~4.6 = ln(100)

while the log of 1,000,000 is:

  ~13.8 = ln(1,000,000)

So, the log of 1,000,000 is only about three times larger than the log of 100. You probably could imagine a bestseller book cover being about three times more powerful (in some way) than a tiny-selling book cover.

Clipping

Clipping is a technique to minimize the influence of extreme outliers. In brief, clipping usually caps (reduces) the value of outliers to a specific maximum value. Clipping is a strange idea, and yet, it can be very effective.

For example, imagine a dataset containing a feature named roomsPerPerson, which represents the number of rooms (total rooms divided by number of occupants) for various houses. The following plot shows that over 99% of the feature values conform to a normal distribution (roughly, a mean of 1.8 and a standard deviation of 0.7). However, the feature contains a few outliers, some of them extreme:

IMAGE: Figure 7. A plot of roomsPerPerson in which nearly all the values are clustered between 0 and 4, but there's a verrrrry long tail reaching all the way out to 17 rooms per person

Figure 7. Mainly normal, but not completely normal.

How can you minimize the influence of those extreme outliers? Well, the histogram is not an even distribution, a normal distribution, or a power law distribution. What if you simply cap or clip the maximum value of roomsPerPerson at an arbitrary value, say 4.0?

IMAGE: A plot of roomsPerPerson in which all values lie between 0 and 4.0. The plot is bell-shaped, but there's an anomalous hill at 4.0

Figure 8. Clipping feature values at 4.0.

Clipping the feature value at 4.0 doesn't mean that your model ignores all values greater than 4.0. Rather, it means that all values that were greater than 4.0 now become 4.0. This explains the peculiar hill at 4.0. Despite that hill, the scaled feature set is now more useful than the original data.

Wait a second! Can you really reduce every outlier value to some arbitrary upper threshold? When training a model, yes.

You can also clip values after applying other forms of normalization. For example, suppose you use Z-score scaling, but a few outliers have absolute values far greater than 3. In this case, you could:

  • Clip Z-scores greater than 3 to become exactly 3.
  • Clip Z-scores less than -3 to become exactly -3.

Clipping prevents your model from overindexing on unimportant data. However, some outliers are actually important, so clip values carefully.

Summary of normalization techniques

The best normalization technique is one that works well in practice, so try new ideas if you think they'll work well on your feature distribution.

Normalization techniqueFormulaWhen to use
Linear scaling\[ x' = \frac{x - x_{min}}{x_{max} - x_{min}} \]When the feature is mostly uniformly distributed across range. Flat-shaped
Z-score scaling\[ x' = \frac{x - μ}{σ}\]When the feature is normally distributed (peak close to mean). Bell-shaped
Log scaling\[ x' = log(x)\]When the feature distribution is heavy skewed on at least either side of tail. Heavy Tail-shaped
ClippingIf \(x > max\), set \(x' = max\) If \(x < min\), set \(x' = min\)When the feature contains extreme outliers.

Exercise: Test your knowledge

Which technique would be most suitable for normalizing a feature with the following distribution?

IMAGE: A histogram showing a cluster of data with values in the range 0 to 200,000. The number of data points gradually increases for the range from 0 to 100,000 and then gradually decreases from 100,000 to 200,000.

Z-score scaling

The data points generally conform to a normal distribution, so Z-score scaling will force them into the range –3 to +3.

Linear scaling

Review the discussions of the normalization techniques on this page, and try again.

Log scaling

Review the discussions of the normalization techniques on this page, and try again.

Clipping

Review the discussions of the normalization techniques on this page, and try again.

Suppose you are developing a model that predicts a data center's productivity based on the temperature measured inside the data center. Almost all of the temperature values in your dataset fall between 15 and 30 (Celsius), with the following exceptions:

  • Once or twice per year, on extremely hot days, a few values between 31 and 45 are recorded in temperature.
  • Every 1,000th point in temperature is set to 1,000 rather than the actual temperature.

Which would be a reasonable normalization technique for temperature?

Clip the outlier values between 31 and 45, but delete the outliers with a value of 1,000

The values of 1,000 are mistakes, and should be deleted rather than clipped.

The values between 31 and 45 are legitimate data points. Clipping would probably be a good idea for these values, assuming the dataset doesn't contain enough examples in this temperature range to train the model to make good predictions. However, during inference, note that the clipped model would therefore make the same prediction for a temperature of 45 as for a temperature of 35.

Clip all the outliers

Review the discussions of the normalization techniques on this page, and try again.

Delete all the outliers

Review the discussions of the normalization techniques on this page, and try again.

Delete the outlier values between 31 and 45, but clip the outliers with a value of 1,000.

Review the discussions of the normalization techniques on this page, and try again.

Key terms:

Numerical data: Binning

Binning (also called bucketing) is a feature engineering technique that groups different numerical subranges into bins or buckets. In many cases, binning turns numerical data into categorical data. For example, consider a feature named X whose lowest value is 15 and highest value is 425. Using binning, you could represent X with the following five bins:

  • Bin 1: 15 to 34
  • Bin 2: 35 to 117
  • Bin 3: 118 to 279
  • Bin 4: 280 to 392
  • Bin 5: 393 to 425

Bin 1 spans the range 15 to 34, so every value of X between 15 and 34 ends up in Bin 1. A model trained on these bins will react no differently to X values of 17 and 29 since both values are in Bin 1.

The feature vector represents the five bins as follows:

Bin numberRangeFeature vector
115-34[1.0, 0.0, 0.0, 0.0, 0.0]
235-117[0.0, 1.0, 0.0, 0.0, 0.0]
3118-279[0.0, 0.0, 1.0, 0.0, 0.0]
4280-392[0.0, 0.0, 0.0, 1.0, 0.0]
5393-425[0.0, 0.0, 0.0, 0.0, 1.0]

Even though X is a single column in the dataset, binning causes a model to treat X as five separate features. Therefore, the model learns separate weights for each bin.

Binning is a good alternative to scaling or clipping when either of the following conditions is met:

  • The overall linear relationship between the feature and the label is weak or nonexistent.
  • When the feature values are clustered.

Binning can feel counterintuitive, given that the model in the previous example treats the values 37 and 115 identically. But when a feature appears more clumpy than linear, binning is a much better way to represent the data.

Binning example: number of shoppers versus temperature

Suppose you are creating a model that predicts the number of shoppers by the outside temperature for that day. Here's a plot of the temperature versus the number of shoppers:

IMAGE: Figure 9. A scatter plot of 45 points. The 45 points naturally cluster into three groups.

Figure 9. A scatter plot of 45 points.

The plot shows, not surprisingly, that the number of shoppers was highest when the temperature was most comfortable.

You could represent the feature as raw values: a temperature of 35.0 in the dataset would be 35.0 in the feature vector. Is that the best idea?

During training, a linear regression model learns a single weight for each feature. Therefore, if temperature is represented as a single feature, then a temperature of 35.0 would have five times the influence (or one-fifth the influence) in a prediction as a temperature of 7.0. However, the plot doesn't really show any sort of linear relationship between the label and the feature value.

The graph suggests three clusters in the following subranges:

  • Bin 1 is the temperature range 4-11.
  • Bin 2 is the temperature range 12-26.
  • Bin 3 is the temperature range 27-36.

IMAGE: Figure 10. The same scatter plot of 45 points as in the previous figure, but with vertical lines to make the bins more obvious.

Figure 10. The scatter plot divided into three bins.

The model learns separate weights for each bin.

While it's possible to create more than three bins, even a separate bin for each temperature reading, this is often a bad idea for the following reasons:

  • A model can only learn the association between a bin and a label if there are enough examples in that bin. In the given example, each of the 3 bins contains at least 10 examples, which might be sufficient for training. With 33 separate bins, none of the bins would contain enough examples for the model to train on.
  • A separate bin for each temperature results in 33 separate temperature features. However, you typically should minimize the number of features in a model.

Exercise: Check your understanding

The following plot shows the median home price for each 0.2 degrees of latitude for the mythical country of Freedonia:

IMAGE: Figure 11. A plot of home values per latitude. The lowest house value is about 327 and the highest is 712. The latitudes span 41.0 to 44.8, with a dot representing the median house value for every 0.2 degrees of latitude. The pattern is highly irregular, but with two distinct clusters (one cluster between latitude 41.0 and 41.8, and another cluster between latitude 42.6 and 43.4).

Figure 11. Median home value per 0.2 degrees latitude.

The graphic shows a nonlinear pattern between home value and latitude, so representing latitude as its floating-point value is unlikely to help a model make good predictions. Perhaps bucketing latitudes would be a better idea?

What would be the best bucketing strategy?

Don't bucket.

Given the randomness of most of the plot, this is probably the best strategy.

Create four buckets:

  • 41.0 to 41.8
  • 42.0 to 42.6
  • 42.8 to 43.4
  • 43.6 to 44.8

It would be hard for a model to find a single predictive weight for all the homes in the second bin or the fourth bin, which contain few examples.

Make each data point its own bucket.

This would only be helpful if the training set contains enough examples for each 0.2 degrees of latitude. In general, homes tend to cluster near cities and be relatively scarce in other places.

Quantile Bucketing

Quantile bucketing creates bucketing boundaries such that the number of examples in each bucket is exactly or nearly equal. Quantile bucketing mostly hides the outliers.

To illustrate the problem that quantile bucketing solves, consider the equally spaced buckets shown in the following figure, where each of the ten buckets represents a span of exactly 10,000 dollars. Notice that the bucket from 0 to 10,000 contains dozens of examples but the bucket from 50,000 to 60,000 contains only 5 examples. Consequently, the model has enough examples to train on the 0 to 10,000 bucket but not enough examples to train on for the 50,000 to 60,000 bucket.

IMAGE: Figure 13. A plot of car price versus the number of cars sold at that price. The number of cars sold peaks at a price of 6,000. Above a price of 6,000, the number of cars sold generally decreases, with very few cars sold between a price of 40,000 to 60,000. The plot is divided into 6 equally-sized buckets, each with a range of 10,000. So, the first bucket contains all the cars sold between a price of 0 and a price of 10,000, the second bucket contains all the cars sold between a price of 10,001 and 20,000, and so on. The first bucket contain many examples; each subsequent bucket contains fewer examples.

Figure 13. Some buckets contain a lot of cars; other buckets contain very few cars.

In contrast, the following figure uses quantile bucketing to divide car prices into bins with approximately the same number of examples in each bucket. Notice that some of the bins encompass a narrow price span while others encompass a very wide price span.

IMAGE: Figure 14. Same as previous figure, except with quantile buckets. That is, the buckets now have different sizes. The first bucket contains the cars sold from 0 to 4,000, the second bucket contains the cars sold from 4,001 to 6,000. The sixth bucket contains the cars sold from 25,001 to 60,000. The number of cars in each bucket is now about the same.

Figure 14. Quantile bucketing gives each bucket about the same number of cars.

Bucketing with equal intervals works for many data distributions. For skewed data, however, try quantile bucketing. Equal intervals give extra information space to the long tail while compacting the large torso into a single bucket. Quantile buckets give extra information space to the large torso while compacting the long tail into a single bucket. Key terms:

Numerical data: Scrubbing

Apple trees produce a mixture of great fruit and wormy messes. Yet the apples in high-end grocery stores display 100% perfect fruit. Between orchard and grocery, someone spends significant time removing the bad apples or spraying a little wax on the salvageable ones. As an ML engineer, you'll spend enormous amounts of your time tossing out bad examples and cleaning up the salvageable ones. Even a few bad apples can spoil a large dataset.

Many examples in datasets are unreliable due to one or more of the following problems:

Problem categoryExample
Omitted valuesA census taker fails to record a resident's age.
Duplicate examplesA server uploads the same logs twice.
Out-of-range feature values.A human accidentally types an extra digit.
Bad labelsA human evaluator mislabels a picture of an oak tree as a maple.

You can write a program or script to detect any of the following problems:

  • Omitted values
  • Duplicate examples
  • Out-of-range feature values

For example, the following dataset contains six repeated values:

IMAGE: Figure 15. The first six values are repeated. The final eight values are not.

Figure 15. The first six values are repeated.

As another example, suppose the temperature range for a certain feature must be between 10 and 30 degrees, inclusive. But accidents happen—perhaps a thermometer is temporarily exposed to the sun which causes a bad outlier. Your program or script must identify temperature values less than 10 or greater than 30:

IMAGE: Figure 16. Nineteen in-range values and one out-of-range value.

Figure 16. An out-of-range value.

When labels are generated by multiple people, we recommend statistically determining whether each rater generated equivalent sets of labels. Perhaps one rater was a harsher grader than the other raters or used a different set of grading criteria?

Once detected, you typically "fix" examples that contain bad features or bad labels by removing them from the dataset or imputing their values. For details, see the Data characteristics section of the Datasets, generalization, and overfitting module.

Numerical data: Qualities of good numerical features

This unit has explored ways to map raw data into suitable feature vectors. Good numerical features share the qualities described in this section.

Clearly named

Each feature should have a clear, sensible, and obvious meaning to any human on the project. For example, the meaning of the following feature value is confusing:

Not recommended

house_age: 851472000

In contrast, the following feature name and value are far clearer:

Recommended

house_age_years: 27

Note: Although your co-workers will rebel against confusing feature and label names, the model won't care (assuming you normalize values properly).

Checked or tested before training

Although this module has devoted a lot of time to outliers, the topic is important enough to warrant one final mention. In some cases, bad data (rather than bad engineering choices) causes unclear values. For example, the following user_age_in_years came from a source that didn't check for appropriate values:

Not recommended

user_age_in_years: 224

But people can be 24 years old:

Recommended

user_age_in_years: 24

Check your data!

Sensible

A "magic value" is a purposeful discontinuity in an otherwise continuous feature. For example, suppose a continuous feature named watch_time_in_seconds can hold any floating-point value between 0 and 30 but represents the absence of a measurement with the magic value -1:

Not recommended

watch_time_in_seconds: -1

A watch_time_in_seconds of -1 would force the model to try to figure out what it means to watch a movie backwards in time. The resulting model would probably not make good predictions.

A better technique is to create a separate Boolean feature that indicates whether or not a watch_time_in_seconds value is supplied. For example:

Recommended

watch_time_in_seconds: 4.82
is_watch_time_in_seconds_defined=True

watch_time_in_seconds: 0
is_watch_time_in_seconds_defined=False

This is a way to handle a continuous dataset with missing values. Now consider a discrete numerical feature, like product_category, whose values must belong to a finite set of values. In this case, when a value is missing, signify that missing value using a new value in the finite set. With a discrete feature, the model will learn different weights for each value, including original weights for missing features.

For example, we can imagine possible values fitting in the set:

{0: 'electronics', 1: 'books', 2: 'clothing', 3: 'missing_category'}.

Key terms:

Numerical data: Polynomial transforms

Sometimes, when the ML practitioner has domain knowledge suggesting that one variable is related to the square, cube, or other power of another variable, it's useful to create a synthetic feature from one of the existing numerical features.

Consider the following spread of data points, where pink circles represent one class or category (for example, a species of tree) and green triangles another class (or species of tree):

IMAGE: Figure 17. y=x^2 spread of data points, with triangles below the curve and circles above the curve.

Figure 17. Two classes that can't be separated by a line.

It's not possible to draw a straight line that cleanly separates the two classes, but it is possible to draw a curve that does so:

IMAGE: Figure 18. Same image as Figure 17, only this time with y=x^2 overlaid to create a clear boundary between the triangles and circles.

Figure 18. Separating the classes with y = x2.

As discussed in the Linear regression module, a linear model with one feature, \(x_1\), is described by the linear equation:

\[y = b + w_1x_1\]

Additional features are handled by the addition of terms (w_2x_2), (w_3x_3), etc.

Gradient descent finds the weight \(w_1\) (or weights (w_1), (w_2), (w_3), in the case of additional features) that minimizes the loss of the model. But the data points shown cannot be separated by a line. What can be done?

It's possible to keep both the linear equation and allow nonlinearity by defining a new term, (x_2), that is simply (x_1) squared:

\[x_2 = x_1^2\]

This synthetic feature, called a polynomial transform, is treated like any other feature. The previous linear formula becomes:

\[y = b + w_1x_1 + w_2x_2\]

This can still be treated like a linear regression problem, and the weights determined through gradient descent, as usual, despite containing a hidden squared term, the polynomial transform. Without changing how the linear model trains, the addition of a polynomial transform allows the model to separate the data points using a curve of the form \(y = b + w_1x + w_2x^2\).

Usually the numerical feature of interest is multiplied by itself, that is, raised to some power. Sometimes an ML practitioner can make an informed guess about the appropriate exponent. For example, many relationships in the physical world are related to squared terms, including acceleration due to gravity, the attenuation of light or sound over distance, and elastic potential energy.

If you transform a feature in a way that changes its scale, you should consider experimenting with normalizing it as well. Normalizing after transforming might make the model perform better. For more information, see Numerical Data: Normalization.

A related concept in categorical data is the feature cross, which more frequently synthesizes two different features.

Key terms:

Working with categorical data

Categorical data has a specific set of possible values. For example:

  • The different species of animals in a national park
  • The names of streets in a particular city
  • Whether or not an email is spam
  • The colors that house exteriors are painted
  • Binned numbers, which are described in the Working with Numerical Data module

Numbers can also be categorical data

True numerical data can be meaningfully multiplied. For example, consider a model that predicts the value of a house based on its area. Note that a useful model for evaluating house prices typically relies on hundreds of features. That said, all else being equal, a house of 200 square meters should be roughly twice as valuable as an identical house of 100 square meters.

Oftentimes, you should represent features that contain integer values as categorical data instead of numerical data. For example, consider a postal code feature in which the values are integers. If you represent this feature numerically rather than categorically, you're asking the model to find a numeric relationship between different postal codes. That is, you're telling the model to treat postal code 20004 as twice (or half) as large a signal as postal code 10002. Representing postal codes as categorical data lets the model weight each individual postal code separately.

Encoding

Encoding means converting categorical or other data to numerical vectors that a model can train on. This conversion is necessary because models can only train on floating-point values; models can't train on strings such as "dog" or "maple". This module explains different encoding methods for categorical data.

Key terms:

Help Center

Categorical data: Vocabulary and one-hot encoding

The term dimension is a synonym for the number of elements in a feature vector. Some categorical features are low dimensional. For example:

Feature name# of categoriesSample categories
snowed_today2True, False
skill_level3Beginner, Practitioner, Expert
season4Winter, Spring, Summer, Autumn
day_of_week7Monday, Tuesday, Wednesday
planet8Mercury, Venus, Earth

When a categorical feature has a low number of possible categories, you can encode it as a vocabulary. With a vocabulary encoding, the model treats each possible categorical value as a separate feature. During training, the model learns different weights for each category.

For example, suppose you are creating a model to predict a car's price based, in part, on a categorical feature named car_color. Perhaps red cars are worth more than green cars. Since manufacturers offer a limited number of exterior colors, car_color is a low-dimensional categorical feature. The following illustration suggests a vocabulary (possible values) for car_color:

IMAGE: Figure 1. Each color in the palette is represented as a separate feature. That is, each color is a separate feature in the feature vector. For instance, 'Red' is a feature, 'Orange' is a separate feature, and so on.

Figure 1. A unique feature for each category.

Exercise: Check your understanding

True or False: A machine learning model can train directly on raw string values, like "Red" and "Black", without converting these values to numerical vectors.

True

During training, a model can only manipulate floating-point numbers. The string "Red" is not a floating-point number. You must convert strings like "Red" to floating-point numbers.

False

A machine learning model can only train on features with floating-point values, so you'll need to convert those strings to floating-point values before training.

Index numbers

Machine learning models can only manipulate floating-point numbers. Therefore, you must convert each string to a unique index number, as in the following illustration:

IMAGE: Figure 2. Each color is associated with a unique integer value. For example, 'Red' is associated with the integer 0, 'Orange' with the integer 1, and so on.

Figure 2. Indexed features.

After converting strings to unique index numbers, you'll need to process the data further to represent it in ways that help the model learn meaningful relationships between the values. If the categorical feature data is left as indexed integers and loaded into a model, the model would treat the indexed values as continuous floating-point numbers. The model would then consider "purple" six times more likely than "orange."

One-hot encoding

The next step in building a vocabulary is to convert each index number to its one-hot encoding. In a one-hot encoding:

  • Each category is represented by a vector (array) of N elements, where N is the number of categories. For example, if car_color has eight possible categories, then the one-hot vector representing will have eight elements.
  • Exactly one of the elements in a one-hot vector has the value 1.0; all the remaining elements have the value 0.0.

For example, the following table shows the one-hot encoding for each color in car_color:

FeatureRedOrangeBlueYellowGreenBlackPurpleBrown
"Red"10000000
"Orange"01000000
"Blue"00100000
"Yellow"00010000
"Green"00001000
"Black"00000100
"Purple"00000010
"Brown"00000001

It is the one-hot vector, not the string or the index number, that gets passed to the feature vector. The model learns a separate weight for each element of the feature vector.

Note: In a true one-hot encoding, only one element has the value 1.0. In a variant known as multi-hot encoding, multiple values can be 1.0.

The following illustration suggests the various transformations in the vocabulary representation:

IMAGE: Figure 3. Diagram of the end-to-end process to map categories to feature vectors. In the diagram, the input features are 'Yellow', 'Orange', 'Blue', and 'Blue' a second time. The system uses a stored vocabulary ('Red' is 0, 'Orange' is 1, 'Blue' is 2, 'Yellow' is 3, and so on) to map the input value to an ID. Thus, the system maps 'Yellow', 'Orange', 'Blue', and 'Blue' to 3, 1, 2, 2. The system then converts those values to a one-hot feature vector. For example, given a system with eight possible colors, 3 becomes 0, 0, 0, 1, 0, 0, 0, 0.

Figure 3. The end-to-end process to map categories to feature vectors.

Sparse representation

A feature whose values are predominantly zero (or empty) is termed a sparse feature. Many categorical features, such as car_color, tend to be sparse features. Sparse representation means storing the position of the 1.0 in a sparse vector. For example, the one-hot vector for "Blue" is:

[0, 0, 1, 0, 0, 0, 0, 0]

Since the 1 is in position 2 (when starting the count at 0), the sparse representation for the preceding one-hot vector is:

2

Notice that the sparse representation consumes far less memory than the eight-element one-hot vector. Importantly, the model must train on the one-hot vector, not the sparse representation.

Note: The sparse representation of a multi-hot encoding stores the positions of all the nonzero elements. For example, the sparse representation of a car that is both "Blue" and "Black" is 2, 5.

Outliers in categorical data

Like numerical data, categorical data also contains outliers. Suppose car_color contains not only the popular colors, but also some rarely used outlier colors, such as "Mauve" or "Avocado". Rather than giving each of these outlier colors a separate category, you can lump them into a single "catch-all" category called out-of-vocabulary (OOV). In other words, all the outlier colors are binned into a single outlier bucket. The system learns a single weight for that outlier bucket.

Encoding high-dimensional categorical features

Some categorical features have a high number of dimensions, such as those in the following table:

Feature name# of categoriesSample categories
words_in_english~500,000"happy", "walking"
US_postal_codes~42,000"02114", "90301"
last_names_in_Germany~850,000"Schmidt", "Schneider"

When the number of categories is high, one-hot encoding is usually a bad choice. Embeddings, detailed in a separate Embeddings module, are usually a much better choice. Embeddings substantially reduce the number of dimensions, which benefits models in two important ways:

  • The model typically trains faster.
  • The built model typically infers predictions more quickly. That is, the model has lower latency.

Hashing (also called the hashing trick) is a less common way to reduce the number of dimensions.

Click here to learn about hashing

In brief, hashing maps a category (for example, a color) to a small integer—the number of the "bucket" that will hold that category.

In detail, you implement a hashing algorithm as follows:

  1. Set the number of bins in the vector of categories to N, where N is less than the total number of remaining categories. As an arbitrary example, say N = 100.
  2. Choose a hash function. (Often, you will choose the range of hash values as well.)
  3. Pass each category (for example, a particular color) through that hash function, generating a hash value, say 89237.
  4. Assign each bin an index number of the output hash value modulo N. In this case, where N is 100 and the hash value is 89237, the modulo result is 37 because 89237 % 100 is 37.
  5. Create a one-hot encoding for each bin with these new index numbers.

For more details about hashing data, see the Randomization section of the Production machine learning systems module.

Key terms:

Categorical data: Common issues

Numerical data is often recorded by scientific instruments or automated measurements. Categorical data, on the other hand, is often categorized by human beings or by machine learning (ML) models. Who decides on categories and labels, and how they make those decisions, affects the reliability and usefulness of that data.

Human raters

Data manually labeled by human beings is often referred to as gold labels, and is considered more desirable than machine-labeled data for training models, due to relatively better data quality.

This doesn't necessarily mean that any set of human-labeled data is of high quality. Human errors, bias, and malice can be introduced at the point of data collection or during data cleaning and processing. Check for them before training.

Any two human beings may label the same example differently. The difference between human raters' decisions is called inter-rater agreement. You can get a sense of the variance in raters' opinions by using multiple raters per example and measuring inter-rater agreement.

Click to learn about inter-rater agreement metrics

The following are ways to measure inter-rater agreement:

  • Cohen's kappa and variants
  • Intra-class correlation (ICC)
  • Krippendorff's alpha

For details on Cohen's kappa and intra-class correlation, see Hallgren 2012. For details on Krippendorff's alpha, see Krippendorff 2011.

Machine raters

Machine-labeled data, where categories are automatically determined by one or more classification models, is often referred to as silver labels. Machine-labeled data can vary widely in quality. Check it not only for accuracy and biases but also for violations of common sense, reality, and intention. For example, if a computer-vision model mislabels a photo of a chihuahua as a muffin, or a photo of a muffin as a chihuahua, models trained on that labeled data will be of lower quality.

Similarly, a sentiment analyzer that scores neutral words as -0.25, when 0.0 is the neutral value, might be scoring all words with an additional negative bias that is not actually present in the data. An oversensitive toxicity detector may falsely flag many neutral statements as toxic. Try to get a sense of the quality and biases of machine labels and annotations in your data before training on it.

High dimensionality

Categorical data tends to produce high-dimensional feature vectors; that is, feature vectors having a large number of elements. High dimensionality increases training costs and makes training more difficult. For these reasons, ML experts often seek ways to reduce the number of dimensions prior to training.

For natural-language data, the main method of reducing dimensionality is to convert feature vectors to embedding vectors. This is discussed in the Embeddings module later in this course.

Key terms:

Categorical data: Feature crosses

Feature crosses are created by crossing (taking the Cartesian product of) two or more categorical or bucketed features of the dataset. Like polynomial transforms, feature crosses allow linear models to handle nonlinearities. Feature crosses also encode interactions between features.

For example, consider a leaf dataset with the categorical features:

  • edges, containing values smooth, toothed, and lobed
  • arrangement, containing values opposite and alternate

Assume the order above is the order of the feature columns in a one-hot representation, so that a leaf with smooth edges and opposite arrangement is represented as {(1, 0, 0), (1, 0)}.

The feature cross, or Cartesian product, of these two features would be:

{Smooth_Opposite, Smooth_Alternate, Toothed_Opposite, Toothed_Alternate, Lobed_Opposite, Lobed_Alternate}

where the value of each term is the product of the base feature values, such that:

  • Smooth_Opposite = edges[0] * arrangement[0]
  • Smooth_Alternate = edges[0] * arrangement[1]
  • Toothed_Opposite = edges[1] * arrangement[0]
  • Toothed_Alternate = edges[1] * arrangement[1]
  • Lobed_Opposite = edges[2] * arrangement[0]
  • Lobed_Alternate = edges[2] * arrangement[1]

For example, if a leaf has a lobed edge and an alternate arrangement, the feature-cross vector will have a value of 1 for Lobed_Alternate, and a value of 0 for all other terms:

{0, 0, 0, 0, 0, 1}

This dataset could be used to classify leaves by tree species, since these characteristics do not vary within a species.

Click here to compare polynomial transforms with feature crosses

Feature crosses are somewhat analogous to Polynomial transforms. Both combine multiple features into a new synthetic feature that the model can train on to learn nonlinearities. Polynomial transforms typically combine numerical data, while feature crosses combine categorical data.

When to use feature crosses

Domain knowledge can suggest a useful combination of features to cross. Without that domain knowledge, it can be difficult to determine effective feature crosses or polynomial transforms by hand. It's often possible, if computationally expensive, to use neural networks to automatically find and apply useful feature combinations during training.

Be careful—crossing two sparse features produces an even sparser new feature than the two original features. For example, if feature A is a 100-element sparse feature and feature B is a 200-element sparse feature, a feature cross of A and B yields a 20,000-element sparse feature.

Key terms:

Datasets, generalization, and overfitting

Introduction

This module begins with a leading question. Choose one of the following answers:

If you had to prioritize improving one of the following areas in your machine learning project, which would have the most impact?

Improving the quality of your dataset

Data trumps all. The quality and size of the dataset matters much more than which shiny algorithm you use to build your model.

Applying a more clever loss function to training your model

True, a better loss function can help a model train faster, but it's still a distant second to another item in this list.

And here's an even more leading question:

Take a guess: In your machine learning project, how much time do you typically spend on data preparation and transformation?

More than half of the project time

Yes, ML practitioners spend the majority of their time constructing datasets and doing feature engineering.

Less than half of the project time

Plan for more! Typically, 80% of the time on a machine learning project is spent constructing datasets and transforming data.

In this module, you'll learn more about the characteristics of machine learning datasets, and how to prepare your data to ensure high-quality results when training and evaluating your model.

Datasets: Data characteristics

A dataset is a collection of examples.

Many datasets store data in tables (grids), for example, as comma-separated values (CSV) or directly from spreadsheets or database tables. Tables are an intuitive input format for machine learning models. You can imagine each row of the table as an example and each column as a potential feature or label. That said, datasets may also be derived from other formats, including log files and protocol buffers.

Regardless of the format, your ML model is only as good as the data it trains on. This section examines key data characteristics.

Types of data

A dataset could contain many kinds of datatypes, including but certainly not limited to:

  • numerical data, which is covered in a separate unit
  • categorical data, which is covered in a separate unit
  • human language, including individual words and sentences, all the way up to entire text documents
  • multimedia (such as images, videos, and audio files)
  • outputs from other ML systems
  • embedding vectors, which are covered in a later unit

Quantity of data

As a rough rule of thumb, your model should train on at least an order of magnitude (or two) more examples than trainable parameters. However, good models generally train on substantially more examples than that.

Models trained on large datasets with few features generally outperform models trained on small datasets with a lot of features. Google has historically had great success training simple models on large datasets.

Different datasets for different machine learning programs may require wildly different amounts of examples to build a useful model. For some relatively simple problems, a few dozen examples might be sufficient. For other problems, a trillion examples might be insufficient.

It's possible to get good results from a small dataset if you are adapting an existing model already trained on large quantities of data from the same schema.

Quality and reliability of data

Everyone prefers high quality to low quality, but quality is such a vague concept that it could be defined many different ways. This course defines quality pragmatically:

A high-quality dataset helps your model accomplish its goal. A low quality dataset inhibits your model from accomplishing its goal.

A high-quality dataset is usually also reliable. Reliability refers to the degree to which you can trust your data. A model trained on a reliable dataset is more likely to yield useful predictions than a model trained on unreliable data.

In measuring reliability, you must determine:

  • How common are label errors? For example, if your data is labeled by humans, how often did your human raters make mistakes?
  • Are your features noisy? That is, do the values in your features contain errors? Be realistic—you can't purge your dataset of all noise. Some noise is normal; for example, GPS measurements of any location always fluctuate a little, week to week.
  • Is the data properly filtered for your problem? For example, should your dataset include search queries from bots? If you're building a spam-detection system, then likely the answer is yes. However, if you're trying to improve search results for humans, then no.

The following are common causes of unreliable data in datasets:

  • Omitted values. For example, a person forgot to enter a value for a house's age.
  • Duplicate examples. For example, a server mistakenly uploaded the same log entries twice.
  • Bad feature values. For example, someone typed an extra digit, or a thermometer was left out in the sun.
  • Bad labels. For example, a person mistakenly labeled a picture of an oak tree as a maple tree.
  • Bad sections of data. For example, a certain feature is very reliable, except for that one day when the network kept crashing.

We recommend using automation to flag unreliable data. For example, unit tests that define or rely on an external formal data schema can flag values that fall outside of a defined range.

Note: Any sufficiently large or diverse dataset almost certainly contains outliers that fall outside your data schema or unit test bands. Determining how to handle outliers is an important part of machine learning. The Numerical data unit details how to handle numeric outliers.

Complete vs. incomplete examples

In a perfect world, each example is complete; that is, each example contains a value for each feature.

IMAGE: Figure 1. An example containing values for all five of its features.

Figure 1. A complete example.

Unfortunately, real-world examples are often incomplete, meaning that at least one feature value is missing.

IMAGE: Figure 2. An example containing values for four of its five features. One feature is marked missing.

Figure 2. An incomplete example.

Don't train a model on incomplete examples. Instead, fix or eliminate incomplete examples by doing one of the following:

  • Delete incomplete examples.
  • Impute missing values; that is, convert the incomplete example to a complete example by providing well-reasoned guesses for the missing values.

IMAGE: Figure 3. A dataset containing three examples, two of which are incomplete examples. Someone has stricken these two incomplete examples from the dataset.

Figure 3. Deleting incomplete examples from the dataset.

IMAGE: Figure 4. A dataset containing three examples, two of which were incomplete examples containing missing data. Some entity (a human or imputation software) has imputed values that replaced the missing data.

Figure 4. Imputing missing values for incomplete examples.

If the dataset contains enough complete examples to train a useful model, then consider deleting the incomplete examples. Similarly, if only one feature is missing a significant amount of data and that one feature probably can't help the model much, then consider deleting that feature from the model inputs and seeing how much quality is lost by its removal. If the model works just or almost as well without it, that's great. Conversely, if you don't have enough complete examples to train a useful model, then you might consider imputing missing values.

It's fine to delete useless or redundant examples, but it's bad to delete important examples. Unfortunately, it can be difficult to differentiate between useless and useful examples. If you can't decide whether to delete or impute, consider building two datasets: one formed by deleting incomplete examples and the other by imputing. Then, determine which dataset trains the better model.

Clever algorithms can impute some pretty good missing values; however, imputed values are rarely as good as the actual values. Therefore, a good dataset tells the model which values are imputed and which are actual. One way to do this is to add an extra Boolean column to the dataset that indicates whether a particular feature's value is imputed. For example, given a feature named temperature, you could add an extra Boolean feature named something like temperature_is_imputed. Then, during training, the model will probably gradually learn to trust examples containing imputed values for feature temperature less than examples containing actual (non-imputed) values.


Imputation is the process of generating well-reasoned data, not random or deceptive data. Be careful: good imputation can improve your model; bad imputation can hurt your model.

One common algorithm is to use the mean or median as the imputed value. Consequently, when you represent a numerical feature with Z-scores, then the imputed value is typically 0 (because 0 is generally the mean Z-score).

Exercise: Check your understanding

A sorted dataset, like the one in the following exercise, can sometimes simplify imputation. However, it is a bad idea to train on a sorted dataset. So, after imputation, randomize the order of examples in the training set.

Here are two columns of a dataset sorted by Timestamp.

TimestampTemperature
June 8, 2023 09:0012
June 8, 2023 10:0018
June 8, 2023 11:00missing
June 8, 2023 12:0024
June 8, 2023 13:0038

Which of the following would be a reasonable value to impute for the missing value of Temperature?

23

Probably. 23 is the mean of the adjacent values (12, 18, 24, and 38). However, we aren't seeing the rest of the dataset, so it is possible that 23 would be an outlier for 11:00 on other days.

31

Unlikely. The limited part of the dataset that we can see suggests that 31 is much too high for the 11:00 Temperature. However, we can't be sure without basing the imputation on a larger number of examples.

51

Very unlikely. 51 is much higher than any of the displayed values (and, therefore, much higher than the mean).

Key terms:

Datasets: Labels

This section focuses on labels.

Direct versus proxy labels

Consider two different kinds of labels:

  • Direct labels, which are labels identical to the prediction your model is trying to make. That is, the prediction your model is trying to make is exactly present as a column in your dataset. For example, a column named bicycle owner would be a direct label for a binary classification model that predicts whether or not a person owns a bicycle.
  • Proxy labels, which are labels that are similar—but not identical—to the prediction your model is trying to make. For example, a person subscribing to Bicycle Bizarre magazine probably—but not definitely—owns a bicycle.

Direct labels are generally better than proxy labels. If your dataset provides a possible direct label, you should probably use it. Oftentimes though, direct labels aren't available.

Proxy labels are always a compromise—an imperfect approximation of a direct label. However, some proxy labels are close enough approximations to be useful. Models that use proxy labels are only as useful as the connection between the proxy label and the prediction.

Recall that every label must be represented as a floating-point number in the feature vector (because machine learning is fundamentally just a huge amalgam of mathematical operations). Sometimes, a direct label exists but can't be easily represented as a floating-point number in the feature vector. In this case, use a proxy label.

Exercise: Check your understanding

Your company wants to do the following:

Mail coupons ("Get 15% off a new bicycle helmet") to bicycle owners.

So, your model must do the following:

Predict which people own a bicycle.

Unfortunately, the dataset doesn't contain a column named bike owner. However, the dataset does contain a column named recently bought a bicycle.

Would recently bought a bicycle be a good proxy label or a poor proxy label for this model?

Good proxy label

The column recently bought a bicycle is a relatively good proxy label. After all, most of the people who buy bicycles now own bicycles. Nevertheless, like all proxy labels, even very good ones, recently bought a bicycle is imperfect. After all, the person buying an item isn't always the person using (or owning) that item. For example, people sometimes buy bicycles as a gift.

Poor proxy label

Like all proxy labels, recently bought a bicycle is imperfect (some bicycles are bought as gifts and given to others). However, recently bought a bicycle is still a relatively good indicator that someone owns a bicycle.

Human-generated data

Some data is human-generated; that is, one or more humans examine some information and provide a value, usually for the label. For example, one or more meteorologists could examine pictures of the sky and identify cloud types.

Alternatively, some data is automatically-generated. That is, software (possibly, another machine learning model) determines the value. For example, a machine learning model could examine sky pictures and automatically identify cloud types.

This section explores the advantages and disadvantages of human-generated data.

Advantages

  • Human raters can perform a wide range of tasks that even sophisticated machine learning models may find difficult.
  • The process forces the owner of the dataset to develop clear and consistent criteria.

Disadvantages

  • You typically pay human raters, so human-generated data can be expensive.
  • To err is human. Therefore, multiple human raters might have to evaluate the same data.

Think through these questions to determine your needs:

  • How skilled must your raters be? (For example, must the raters know a specific language? Do you need linguists for dialogue or NLP applications?)
  • How many labeled examples do you need? How soon do you need them?
  • What's your budget?

Always double-check your human raters. For example, label 1000 examples yourself, and see how your results match other raters' results. If discrepancies surface, don't assume your ratings are the correct ones, especially if a value judgment is involved. If human raters have introduced errors, consider adding instructions to help them and try again.

Click the plus icon to learn more about human-generated data.

Looking at your data by hand is a good exercise regardless of how you obtained your data. Andrej Karpathy did this on ImageNet and wrote about the experience.

Models can train on a mix of automated and human-generated labels. However, for most models, an extra set of human-generated labels (which can become stale) are generally not worth the extra complexity and maintenance. That said, sometimes the human-generated labels can provide extra information not available in the automated labels.


Key terms:

Datasets: Class-imbalanced datasets

This section explores the following three questions:

  • What's the difference between class-balanced datasets and class-imbalanced datasets?
  • Why is training an imbalanced dataset difficult?
  • How can you overcome the problems of training imbalanced datasets?

Class-balanced datasets versus class-imbalanced datasets

Consider a dataset containing a categorical label whose value is either the positive class or the negative class. In a class-balanced dataset, the number of positive classes and negative classes is about equal. For example, a dataset containing 235 positive classes and 247 negative classes is a balanced dataset.

In a class-imbalanced dataset, one label is considerably more common than the other. In the real world, class-imbalanced datasets are far more common than class-balanced datasets. For example, in a dataset of credit card transactions, fraudulent purchases might make up less than 0.1% of the examples. Similarly, in a medical diagnosis dataset, the number of patients with a rare virus might be less than 0.01% of the total examples. In a class-imbalanced dataset:

The difficulty of training severely class-imbalanced datasets

Training aims to create a model that successfully distinguishes the positive class from the negative class. To do so, batches need a sufficient number of both positive classes and negative classes. That's not a problem when training on a mildly class-imbalanced dataset since even small batches typically contain sufficient examples of both the positive class and the negative class. However, a severely class-imbalanced dataset might not contain enough minority class examples for proper training.

For example, consider the class-imbalanced dataset illustrated in Figure 6 in which:

  • 200 labels are in the majority class.
  • 2 labels are in the minority class.

IMAGE: Figure 6. A dataset with a 202 examples. 200 of the examples have a sunflower label and 2 of the examples have a rose label.

Figure 6. A highly imbalanced floral dataset containing far more sunflowers than roses.

If the batch size is 20, most batches won't contain any examples of the minority class. If the batch size is 100, each batch will contain an average of only one minority class example, which is insufficient for proper training. Even a much larger batch size will still yield such an imbalanced proportion that the model might not train properly.

Note: Accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset. See Classification: Accuracy, recall, precision, and related metrics for details.

Training a class-imbalanced dataset

During training, a model should learn two things:

  • What each class looks like; that is, what feature values correspond to what class?
  • How common each class is; that is, what is the relative distribution of the classes?

Standard training conflates these two goals. In contrast, the following two-step technique called downsampling and upweighting the majority class separates these two goals, enabling the model to achieve both goals.

Note: Many students read the following section and say some variant of, "That just can't be right." Be warned that downsampling and upweighting the majority class is somewhat counterintuitive.

Step 1: Downsample the majority class

Downsampling means training on a disproportionately low percentage of majority class examples. That is, you artificially force a class-imbalanced dataset to become somewhat more balanced by omitting many of the majority class examples from training. Downsampling greatly increases the probability that each batch contains enough examples of the minority class to train the model properly and efficiently.

For example, the class-imbalanced dataset shown in Figure 6 consists of 99% majority class and 1% minority class examples. Downsampling the majority class by a factor of 25 artificially creates a more balanced training set (80% majority class to 20% minority class) suggested in Figure 7:

IMAGE: Figure 7. 10 examples, 8 of which are sunflowers and 2 of which are roses.

Figure 7. Downsampling the majority class by a factor of 25.

Step 2: Upweight the downsampled class

Downsampling introduces a prediction bias by showing the model an artificial world where the classes are more balanced than in the real world. To correct this bias, you must "upweight" the majority classes by the factor to which you downsampled. Upweighting means treating the loss on a majority class example more harshly than the loss on a minority class example.

For example, we downsampled the majority class by a factor of 25, so we must upweight the majority class by a factor of 25. That is, when the model mistakenly predicts the majority class, treat the loss as if it were 25 errors (multiply the regular loss by 25).

IMAGE: Figure 8. The loss for a bad prediction on the minority class is treated normally. However, the loss for a bad prediction on the majority class is treated 25 times more harshly.

Figure 8. Upweighting the majority class by a factor of 25.

How much should you downsample and upweight to rebalance your dataset? To determine the answer, you should experiment with different downsampling and upweighting factors just as you would experiment with other hyperparameters.

Benefits of this technique

Downsampling and upweighting the majority class brings the following benefits:

  • Better model: The resultant model "knows" both of the following:
    • The connection between features and labels
    • The true distribution of the classes
  • Faster convergence: During training, the model sees the minority class more often, which helps the model converge faster.

Key terms:

Datasets: Dividing the original dataset

All good software engineering projects devote considerable energy to testing their apps. Similarly, we strongly recommend testing your ML model to determine the correctness of its predictions.

Training, validation, and test sets

You should test a model against a different set of examples than those used to train the model. As you'll learn a little later, testing on different examples is stronger proof of your model's fitness than testing on the same set of examples. Where do you get those different examples? Traditionally in machine learning, you get those different examples by splitting the original dataset. You might assume, therefore, that you should split the original dataset into two subsets:

IMAGE: Figure 8. A horizontal bar divided into two pieces: ~80% of which is the training set and ~20% is the test set.

Figure 8. Not an optimal split.

Exercise: Check your intuition

Suppose you train on the training set and evaluate on the test set over multiple rounds. In each round, you use the test set results to guide how to update hyperparameters and the feature set. Can you see anything wrong with this approach? Pick only one answer.

Doing many rounds of this procedure might cause the model to implicitly fit the peculiarities of the test set.

Yes! The more often you use the same test set, the more likely the model closely fits the test set. Like a teacher "teaching to the test," the model inadvertently fits the test set, which might make it harder for the model to fit real-world data.

This approach is fine. After all, you're training on the training set and evaluating on a separate test set.

Actually, there's a subtle issue here. Think about what might gradually go wrong.

This approach is computationally inefficient. Don't change hyperparameters or feature sets after each round of testing.

Frequent testing is expensive but critical. However, frequent testing is far less expensive than additional training. Optimizing hyperparameters and the feature set can dramatically improve model quality, so always budget time and computational resources to work on these.

Dividing the dataset into two sets is a decent idea, but a better approach is to divide the dataset into three subsets. In addition to the training set and the test set, the third subset is:

  • A validation set performs the initial testing on the model as it is being trained.

IMAGE: Figure 9. A horizontal bar divided into three pieces: 70% of which is the training set, 15% the validation set, and 15% the test set

Figure 9. A much better split.

Use the validation set to evaluate results from the training set. After repeated use of the validation set suggests that your model is making good predictions, use the test set to double-check your model.

The following figure suggests this workflow. In the figure, "Tweak model" means adjusting anything about the model —from changing the learning rate, to adding or removing features, to designing a completely new model from scratch.

IMAGE: Figure 10. A workflow diagram

Figure 10. A good workflow for development and testing.

Note: When you transform a feature in your training set, you must make the same transformation in the validation set, test set, and real-world dataset.

The workflow shown in Figure 10 is optimal, but even with that workflow, test sets and validation sets still "wear out" with repeated use. That is, the more you use the same data to make decisions about hyperparameter settings or other model improvements, the less confidence that the model will make good predictions on new data. For this reason, it's a good idea to collect more data to "refresh" the test set and validation set. Starting anew is a great reset.

Exercise: Check your intuition

You shuffled all the examples in the dataset and divided the shuffled examples into training, validation, and test sets. However, the loss value on your test set is so staggeringly low that you suspect a mistake. What might have gone wrong?

Many of the examples in the test set are duplicates of examples in the training set.

Yes. This can be a problem in a dataset with a lot of redundant examples. We strongly recommend deleting duplicate examples from the test set before testing.

Training and testing are nondeterministic. Sometimes, by chance, your test loss is incredibly low. Rerun the test to confirm the result.

Although loss does vary a little on each run, it shouldn't vary so much that you think you won the machine learning lottery.

By chance, the test set just happened to contain examples that the model performed well on.

The examples were well shuffled, so this is extremely unlikely.

Additional problems with test sets

As the previous question illustrates, duplicate examples can affect model evaluation. After splitting a dataset into training, validation, and test sets, delete any examples in the validation set or test set that are duplicates of examples in the training set. The only fair test of a model is against new examples, not duplicates.

For example, consider a model that predicts whether an email is spam, using the subject line, email body, and sender's email address as features. Suppose you divide the data into training and test sets, with an 80-20 split. After training, the model achieves 99% precision on both the training set and the test set. You'd probably expect a lower precision on the test set, so you take another look at the data and discover that many of the examples in the test set are duplicates of examples in the training set. The problem is that you neglected to scrub duplicate entries for the same spam email from your input database before splitting the data. You've inadvertently trained on some of your test data.

In summary, a good test set or validation set meets all of the following criteria:

  • Large enough to yield statistically significant testing results.
  • Representative of the dataset as a whole. In other words, don't pick a test set with different characteristics than the training set.
  • Representative of the real-world data that the model will encounter as part of its business purpose.
  • Zero examples duplicated in the training set.

Exercises: Check your understanding

Given a single dataset with a fixed number of examples, which of the following statements is true?

Every example used in testing the model is one less example used in training the model.

Dividing examples into train/test/validation sets is a zero-sum game. This is the central trade-off.

The number of examples in the test set must be greater than the number of examples in the validation set.

In theory, the validation set and test test should contain the same number of examples or nearly so.

The number of examples in the test set must be greater than the number of examples in the validation set or training set.

The number of examples in the training set is usually greater than the number of examples in the validation set or test set; however, there are no percentage requirements for the different sets.

Suppose your test set contains enough examples to perform a statistically significant test. Furthermore, testing against the test set yields low loss. However, the model performed poorly in the real world. What should you do?

Determine how the original dataset differs from real-life data.

Yes. Even the best datasets are only a snapshot of real-life data; the underlying ground truth tends to change over time. Although your test set matched your training set well enough to suggest good model quality, your dataset probably doesn't adequately match real-world data. You might have to retrain and retest against a new dataset.

Retest on the same test set. The test results might have been an anomaly.

Although retesting might yield slightly different results, this tactic probably isn't very helpful.

How many examples should the test set contain?

Enough examples to yield a statistically significant test.

Yes. How many examples is that? You'll need to experiment.

At least 15% of the original dataset.

15% may or may not be enough examples.

Key terms:

Datasets: Transforming data

Machine learning models can only train on floating-point values. However, many dataset features are not naturally floating-point values. Therefore, one important part of machine learning is transforming non-floating-point features to floating-point representations.

For example, suppose street names is a feature. Most street names are strings, such as "Broadway" or "Vilakazi". Your model can't train on "Broadway", so you must transform "Broadway" to a floating-point number. The Categorical Data module explains how to do this.

Additionally, you should even transform most floating-point features. This transformation process, called normalization, converts floating-point numbers to a constrained range that improves model training. The Numerical Data module explains how to do this.

Sample data when you have too much of it

Some organizations are blessed with an abundance of data. When the dataset contains too many examples, you must select a subset of examples for training. When possible, select the subset that is most relevant to your model's predictions.

Filter examples containing PII

Good datasets omit examples containing Personally Identifiable Information (PII). This policy helps safeguard privacy but can influence the model.

See the Safety and Privacy module later in the course for more on these topics.

Key terms:

Generalization  |  Machine Learning  |  Google for Developers

Source: https://developers.google.com/machine-learning/crash-course/overfitting/generalization

Send feedback

Generalization Stay organized with collections Save and categorize content based on your preferences.

Watch the video below to learn about a common problem ML practitioners face when training a model on their training dataset.

Help Center

[Previous

arrow_back

Transforming data (5 min)](/machine-learning/crash-course/overfitting/transforming-data)

[Next

Overfitting (10 min)

arrow_forward](/machine-learning/crash-course/overfitting/overfitting)

Send feedback

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.

Need to tell us more?

[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-25 UTC."],[],[]]

Overfitting

Overfitting means creating a model that matches (memorizes) the training set so closely that the model fails to make correct predictions on new data. An overfit model is analogous to an invention that performs well in the lab but is worthless in the real world.

Tip: Overfitting is a common problem in machine learning, not an academic hypothetical.

In Figure 11, imagine that each geometric shape represents a tree's position in a square forest. The blue diamonds mark the locations of healthy trees, while the orange circles mark the locations of sick trees.

IMAGE: Figure 11. This figure contains about 60 dots, half of which are healthy trees and the other half sick trees. The healthy trees are mainly in the northeast quadrant, though a few healthy trees sneak into the northwest quadrants. The sick trees are mainly in the southeast quadrant, but a few of the sick trees spill into other quadrants.

Figure 11. Training set: locations of healthy and sick trees in a square forest.

Mentally draw any shapes—lines, curves, ovals...anything—to separate the healthy trees from the sick trees. Then, expand the next line to examine one possible separation.

IMAGE: Figure 12. The same arrangement of healthy and sick trees as in Figure 11. However, a model of complex geometric shapes separates nearly all of the healthy trees from the sick trees.

Figure 12. A complex model for distinguishing sick from healthy trees.

The complex shapes shown in Figure 12 successfully categorized all but two of the trees. If we think of the shapes as a model, then this is a fantastic model.

Or is it? A truly excellent model successfully categorizes new examples. Figure 13 shows what happens when that same model makes predictions on new examples from the test set:

IMAGE: Figure 13. A new batch of healthy and sick trees overlaid on the model shown in Figure 12. The model miscategorizes many of the trees.

**Figure 13.**Test set: a complex model for distinguishing sick from healthy trees.

So, the complex model shown in Figure 12 did a great job on the training set but a pretty bad job on the test set. This is a classic case of a model overfitting to the training set data.

Fitting, overfitting, and underfitting

A model must make good predictions on new data. That is, you're aiming to create a model that "fits" new data.

As you've seen, an overfit model makes excellent predictions on the training set but poor predictions on new data. An underfit model doesn't even make good predictions on the training data. If an overfit model is like a product that performs well in the lab but poorly in the real world, then an underfit model is like a product that doesn't even do well in the lab.

IMAGE: Figure 14. Cartesian plot.

Figure 14. Underfit, fit, and overfit models.

Generalization is the opposite of overfitting. That is, a model that generalizes well makes good predictions on new data. Your goal is to create a model that generalizes well to new data.

Detecting overfitting

The following curves help you detect overfitting:

  • loss curves
  • generalization curves

A loss curve plots a model's loss against the number of training iterations. A graph that shows two or more loss curves is called a generalization curve. The following generalization curve shows two loss curves:

IMAGE: Figure 15. The loss function for the training set gradually declines. The loss function for the validation set also declines, but then it starts to rise after a certain number of iterations.

Figure 15. A generalization curve that strongly implies overfitting.

Notice that the two loss curves behave similarly at first and then diverge. That is, after a certain number of iterations, loss declines or holds steady (converges) for the training set, but increases for the validation set. This suggests overfitting.

In contrast, a generalization curve for a well-fit model shows two loss curves that have similar shapes.

What causes overfitting?

Very broadly speaking, overfitting is caused by one or both of the following problems:

  • The training set doesn't adequately represent real life data (or the validation set or test set).
  • The model is too complex.

Generalization conditions

A model trains on a training set, but the real test of a model's worth is how well it makes predictions on new examples, particularly on real-world data. While developing a model, your test set serves as a proxy for real-world data. Training a model that generalizes well implies the following dataset conditions:

  • Examples must be independently and identically distributed, which is a fancy way of saying that your examples can't influence each other.
  • The dataset is stationary, meaning the dataset doesn't change significantly over time.
  • The dataset partitions have the same distribution. That is, the examples in the training set are statistically similar to the examples in the validation set, test set, and real-world data.

Explore the preceding conditions through the following exercises.

Exercises: Check your understanding

Consider the following dataset partitions. IMAGE: A horizontal bar divided into three pieces. What should you do to ensure that the examples in the training set have a similar statistical distribution to the examples in the validation set and the test set?

Shuffle the examples in the dataset extensively before partitioning them.

Yes. Good shuffling of examples makes partitions much more likely to be statistically similar.

Sort the examples from earliest to most recent.

If the examples in the dataset are not stationary, then sorting makes the partitions less similar.

Do nothing. Given enough examples, the law of averages naturally ensures that the distributions will be statistically similar.

Unfortunately, this is not the case. The examples in certain sections of the dataset may differ from those in other sections.

A streaming service is developing a model to predict the popularity of potential new television shows for the next three years. The streaming service plans to train the model on a dataset containing hundreds of millions of examples, spanning the previous ten years. Will this model encounter a problem?

Probably. Viewers' tastes change in ways that past behavior can't predict.

Yes. Viewer tastes are not stationary. They constantly change.

Definitely not. The dataset is large enough to make good predictions.

Unfortunately, viewers' tastes are nonstationary.

Probably not. Viewers' tastes change in predictably cyclical ways. Ten years of data will enable the model to make good predictions on future trends.

Although certain aspects of entertainment are somewhat cyclical, a model trained from past entertainment history will almost certainly have trouble making predictions about the next few years.

A model aims to predict the time it takes for people to walk a mile based on weather data (temperature, dew point, and precipitation) collected over one year in a city whose weather varies significantly by season. Can you build and test a model from this dataset, even though the weather readings change dramatically by season?

Yes

Yes, it is possible to build and test a model from this dataset. You just have to ensure that the data is partitioned equally, so that data from all four seasons is distributed equally into the different partitions.

No

Assuming this dataset contains enough examples of temperature, dew point, and precipitation, then you can build and test a model from this dataset. You just have to ensure that the data is partitioned equally, so that data from all four seasons is distributed equally into the different partitions.

Challenge exercise

You are creating a model that predicts the ideal date for riders to buy a train ticket for a particular route. For example, the model might recommend that users buy their ticket on July 8 for a train that departs July 23. The train company updates prices hourly, basing their updates on a variety of factors but mainly on the current number of available seats. That is:

  • If a lot of seats are available, ticket prices are typically low.
  • If very few seats are available, ticket prices are typically high.

Your model exhibits low loss on the validation set and the test set but sometimes makes terrible predictions on real-world data. Why?

Click here to see the answer

Answer: The real world model is struggling with a feedback loop.

For example, suppose the model recommends that users buy tickets on July 8. Some riders who use the model's recommendation buy their tickets at 8:30 in the morning on July 8. At 9:00, the train company raises prices because fewer seats are now available. Riders using the model's recommendation have altered prices. By evening, ticket prices might be much higher than in the morning.

Key terms:

Overfitting: Model complexity

The previous unit introduced the following model, which miscategorized a lot of trees in the test set:

IMAGE: Figure 16. The same image as Figure 13. This is a complex shape that miscategorizes many trees.

Figure 16. The misbehaving complex model from the previous unit.

The preceding model contains a lot of complex shapes. Would a simpler model handle new data better? Suppose you replace the complex model with a ridiculously simple model--a straight line.

IMAGE: Figure 17. A straight line model that does an excellent job separating the sick trees from the healthy trees.

Figure 17. A much simpler model.

The simple model generalizes better than the complex model on new data. That is, the simple model made better predictions on the test set than the complex model.

Simplicity has been beating complexity for a long time. In fact, the preference for simplicity dates back to ancient Greece. Centuries later, a fourteenth-century friar named William of Occam formalized the preference for simplicity in a philosophy known as Occam's razor. This philosophy remains an essential underlying principle of many sciences, including machine learning.

Note: Complex models typically outperform simple models on the training set. However, simple models typically outperform complex models on the test set (which is more important).

Exercises: Check your understanding

You are developing a physics equation. Which of the following formulas conform more closely to Occam's Razor?

A formula with three variables.

Three variables is more Occam-friendly than twelve variables.

A formula with twelve variables.

Twelve variables seems overly complicated, doesn't it? The two most famous physics formulas of all time (F=ma and E=mc2) each involve only three variables.

You're on a brand-new machine learning project, about to select your first features. How many features should you pick?

Pick 1–3 features that seem to have strong predictive power.

It's best for your data collection pipeline to start with only one or two features. This will help you confirm that the ML model works as intended. Also, when you build a baseline from a couple of features, you'll feel like you're making progress!

Pick 4–6 features that seem to have strong predictive power.

You might eventually use this many features, but it's still better to start with fewer. Fewer features usually means fewer unnecessary complications.

Pick as many features as you can, so you can start observing which features have the strongest predictive power.

Start smaller. Every new feature adds a new dimension to your training dataset. When the dimensionality increases, the volume of the space increases so fast that the available training data become sparse. The sparser your data, the harder it is for a model to learn the relationship between the features that actually matter and the label. This phenomenon is called "the curse of dimensionality."

Regularization

Machine learning models must simultaneously meet two conflicting goals:

  • Fit data well.
  • Fit data as simply as possible.

One approach to keeping a model simple is to penalize complex models; that is, to force the model to become simpler during training. Penalizing complex models is one form of regularization.

A regularization analogy: Suppose every student in a lecture hall had a little buzzer that emitted a sound that annoyed the professor. Students would press the buzzer whenever the professor's lecture became too complicated. Annoyed, the professor would be forced to simplify the lecture. The professor would complain, "When I simplify, I'm not being precise enough." The students would counter with, "The only goal is to explain it simply enough that I understand it." Gradually, the buzzers would train the professor to give an appropriately simple lecture, even if the simpler lecture isn't as sufficiently precise.

Loss and complexity

So far, this course has suggested that the only goal when training was to minimize loss; that is:

\[\text{minimize(loss)}\]

As you've seen, models focused solely on minimizing loss tend to overfit. A better training optimization algorithm minimizes some combination of loss and complexity:

\[\text{minimize(loss + complexity)}\]

Unfortunately, loss and complexity are typically inversely related. As complexity increases, loss decreases. As complexity decreases, loss increases. You should find a reasonable middle ground where the model makes good predictions on both the training data and real-world data. That is, your model should find a reasonable compromise between loss and complexity.

What is complexity?

You've already seen a few different ways of quantifying loss. How would you quantify complexity? Start your exploration through the following exercise:

Exercise: Check your intuition

So far, we've been pretty vague about what complexity actually is. Which of the following ideas do you think would be reasonable complexity metrics?

Complexity is a function of the model's weights.

Yes, this is one way to measure some models' complexity. This metric is called L1 regularization.

Complexity is a function of the square of the model's weights.

Yes, you can measure some models' complexity this way. This metric is called L2 regularization.

Complexity is a function of the biases of all the features in the model.

Bias doesn't measure complexity.

Key terms:

Overfitting: L2 regularization

L2 regularization is a popular regularization metric, which uses the following formula:

\[L_2\text{ regularization } = {w_1^2 + w_2^2 + ... + w_n^2}\]

For example, the following table shows the calculation of L2 regularization for a model with six weights:

ValueSquared value
w10.20.04
w2-0.50.25
w35.025.0
w4-1.21.44
w50.30.09
w6-0.10.01
26.83 = total

Notice that weights close to zero don't affect L2 regularization much, but large weights can have a huge impact. For example, in the preceding calculation:

  • A single weight (w3) contributes about 93% of the total complexity.
  • The other five weights collectively contribute only about 7% of the total complexity.

L2 regularization encourages weights toward 0, but never pushes weights all the way to zero.

Exercises: Check your understanding

If you use L2 regularization while training a model, what will typically happen to the overall complexity of the model?

The overall complexity of the system will probably drop.

Since L2 regularization encourages weights towards 0, the overall complexity will probably drop.

The overall complexity of the model will probably stay constant.

This is very unlikely.

The overall complexity of the model will probably increase.

This is unlikely. Remember that L2 regularization encourages weights towards 0.

If you use L2 regularization while training a model, some features will be removed from the model.

True

Although L2 regularization may make some weights very small, it will never push any weights all the way to zero. Consequently, all features will still contribute something to the model.

False

L2 regularization never pushes weights all the way to zero.

Regularization rate (lambda)

As noted, training attempts to minimize some combination of loss and complexity:

\[\text{minimize(loss} + \text{ complexity)}\]

Model developers tune the overall impact of complexity on model training by multiplying its value by a scalar called the regularization rate. The Greek character lambda typically symbolizes the regularization rate.

That is, model developers aim to do the following:

\[\text{minimize(loss} + \lambda \text{ complexity)}\]

A high regularization rate:

  • Strengthens the influence of regularization, thereby reducing the chances of overfitting.
  • Tends to produce a histogram of model weights having the following characteristics:
    • a normal distribution
    • a mean weight of 0.

A low regularization rate:

  • Lowers the influence of regularization, thereby increasing the chances of overfitting.
  • Tends to produce a histogram of model weights with a flat distribution.

For example, the histogram of model weights for a high regularization rate might look as shown in Figure 18.

IMAGE: Figure 18. Histogram of a model's weights with a mean of zero and a normal distribution.

Figure 18. Weight histogram for a high regularization rate. Mean is zero. Normal distribution.

In contrast, a low regularization rate tends to yield a flatter histogram, as shown in Figure 19.

IMAGE: Figure 19. Histogram of a model's weights with a mean of zero that is somewhere between a flat distribution and a normal distribution.

Figure 19. Weight histogram for a low regularization rate. Mean may or may not be zero.

Note: Setting the regularization rate to zero removes regularization completely. In this case, training focuses exclusively on minimizing loss, which poses the highest possible overfitting risk.

Picking the regularization rate

The ideal regularization rate produces a model that generalizes well to new, previously unseen data. Unfortunately, that ideal value is data-dependent, so you must do some tuning.

Early stopping: an alternative to complexity-based regularization

Early stopping is a regularization method that doesn't involve a calculation of complexity. Instead, early stopping simply means ending training before the model fully converges. For example, you end training when the loss curve for the validation set starts to increase (slope becomes positive).

Although early stopping usually increases training loss, it can decrease test loss.

Early stopping is a quick, but rarely optimal, form of regularization. The resulting model is very unlikely to be as good as a model trained thoroughly on the ideal regularization rate.

Finding equilibrium between learning rate and regularization rate

Learning rate and regularization rate tend to pull weights in opposite directions. A high learning rate often pulls weights away from zero; a high regularization rate pulls weights towards zero.

If the regularization rate is high with respect to the learning rate, the weak weights tend to produce a model that makes poor predictions. Conversely, if the learning rate is high with respect to the regularization rate, the strong weights tend to produce an overfit model.

Your goal is to find the equilibrium between learning rate and regularization rate. This can be challenging. Worst of all, once you find that elusive balance, you may have to ultimately change the learning rate. And, when you change the learning rate, you'll again have to find the ideal regularization rate.

Key terms:

Overfitting: Interpreting loss curves

Machine learning would be much simpler if all your loss curves looked like this the first time you trained your model:

IMAGE: Figure 20. A plot showing the ideal loss curve when training a machine learning model.

Figure 20. An ideal loss curve.

Unfortunately, loss curves are often challenging to interpret. Use your intuition about loss curves to solve the exercises on this page.

Exercise 1: Oscillating loss curve

IMAGE: Figure 21. A loss curve (loss on the y-axis; number of training steps on the x-axis) in which the loss doesn't flatten out. Instead, loss oscillates erratically.

Figure 21. Oscillating loss curve.

What three things could you do to try improve the loss curve shown in Figure 21?

Check your data against a data schema to detect bad examples, and then remove the bad examples from the training set.

Yes, this is a good practice for all models.

Reduce the learning rate.

Yes, reducing learning rate is often a good idea when debugging a training problem.

Reduce the training set to a tiny number of trustworthy examples.

Although this technique sounds artificial, it is actually a good idea. Assuming that the model converges on the small set of trustworthy examples, you can then gradually add more examples, perhaps discovering which examples cause the loss curve to oscillate.

Increase the number of examples in the training set.

This is a tempting idea, but it is extremely unlikely to fix the problem.

Increase the learning rate.

In general, avoid increasing the learning rate when a model's learning curve indicates a problem.

Exercise 2. Loss curve with a sharp jump

IMAGE: Figure 22. A loss curve plot that shows the loss decreasing up to a certain number of training steps and then suddenly increasing with further training steps.

Figure 22. Sharp rise in loss.

Which two of the following statements identify possible reasons for the exploding loss shown in Figure 22?

The input data contains one or more NaNs—for example, a value caused by a division by zero.

This is more common than you might expect.

The input data contains a burst of outliers.

Sometimes, due to improper shuffling of batches, a batch might contain a lot of outliers.

The learning rate is too low.

A very low learning rate might increase training time, but it is not the cause of the strange loss curve.

The regularization rate is too high.

True, a very high regularization could prevent a model from converging; however, it won't cause the strange loss curve shown in Figure 22.

Exercise 3. Test loss diverges from training loss

IMAGE: Figure 23. The training loss curve appears to converge, but the validation loss begins to rise after a certain number of training steps.

Figure 23. Sharp rise in validation loss.

Which one of the following statements best identifies the reason for this difference between the loss curves of the training and test sets?

The model is overfitting the training set.

Yes, it probably is. Possible solutions:

  • Make the model simpler, possibly by reducing the number of features.
  • Increase the regularization rate.
  • Ensure that the training set and test set are statistically equivalent.

The learning rate is too high.

If the learning rate were too high, the loss curve for the training set would likely not have behaved as it did.

Exercise 4. Loss curve gets stuck

IMAGE: Figure 24. A plot of a loss curve showing the loss beginning to converge with training but then displaying repeated patterns that look like a rectangular wave.

Figure 24. Chaotic loss after a certain number of steps.

Which one of the following statements is the most likely explanation for the erratic loss curve shown in Figure 24?

The training set is not shuffled well.

This is a possibility. For example, a training set that contains 100 images of dogs followed by 100 images of cats may cause loss to oscillate as the model trains. Ensure that you shuffle examples sufficiently.

The regularization rate is too high.

This is unlikely to be the cause.

The training set contains too many features.

This is unlikely to be the cause.

Key terms: