From regression to neural networks

While ML is all the rage right now, it is important to understand the fundamentals powering the incredible breakthroughs in fields like natural language processing (NLP), computer vision (CV), and generative AI in the past years.

While there are many resources available, Andrew Ng’s courses at deeplearning.ai have been a popular choice for people getting into deep learning, myself included. In this guide, I want to build an intuition for artificial neural networks (or neural networks, in short) connecting regression to neural networks in the process. This content is mostly based on the aforementioned courses and material available online, so please make sure to check out Andrew’s courses, where he takes more time to explain everything outlined in this post in more detail.

Linear Regression

I first heard about linear regression in my undergrad courses on Statistics, Operations Research, and Production and Logistics. The easiest way to think about regression is that you have a function that given an input, predicts an output based on examples it’s seen in the past. In math terms, linear regression is a simple linear function term of the following shape: $f(x):wx+b$ , where $w$ defines a weight and $b$ is the bias term. In past lectures, you might have gotten to know $w$ as $\alpha$ and $b$ as $\beta$ .

Imagine you want to estimate the price $\hat{y}$ for a train ticket in euros based on the distance covered in km $x$ . You have collected two examples, ( $x$ : 100, $y$ : 30) and ( $x$ : 500, $y$ : 80), which will be our training data set. In this notation, the first value represents the $x$ value or feature (distance covered in km), while the second value is the actual output/target value $y$ (price in €). The goal now is to find (fit) a line that best matches the two values to predict new values $\hat{y}$ for given $x$ .

What does best mean in this case? If you think about different possible lines, you want to find a slope and y-axis intercept that is “closest” to your training data, so the two examples above. One metric we can employ is the mean squared error, which is a function that compares the actual value (e.g. 30€ for an input x of 100km) to the predicted value $\hat{y}=f(x)$ .

One way to write this is as a cost function $J$ that given weights $w$ and bias $b$ determines a cost value for the training data set. The lower this cost, the more aligned the linear function is to your data. You can imagine the cost function as a curve in a plot where $w$ is on one axis and $b$ on the other. The respective $w$ and $b$ values minimizing the curve thus yield the lowest cost.

\min J(w,b)={1 \over 2m}\sum_{i=0}^{m-1}(f_{w,b}(x^{(i)})-y^{(i)})^2

Having this function is great and all, but how do you minimize the cost to make sure your model is properly fit to the training data? And more importantly, how do you make sure the model works for values it hasn’t seen before? To explain this, we start by looking at gradient descent, a procedure to iteratively minimize your cost function by changing the weight $w$ and bias $b$ accordingly.

Gradient descent

To gain an intuition of gradient descent, imagine you’re standing on the top of a mountain and want to get downhill as fast as possible. Given multiple paths, you would change the one with the steepest slope (also called gradient) to reach the valley in the smallest amount of time. This is exactly what gradient descent does: Given a cost function and initial weights and bias, it determines the steepest slope and adjusts weights in that direction, minimizing the cost as fast as possible.

Not all functions necessarily have a single minimum, but for this guide, we are considering convex cost functions, which are bowl-shaped with a single local and global minimum. If you think about your cost function as a bowl-shaped curve, a good intuition would be to progressively move to the lowest point of the curve to get to the minimum. If we resolve $f(x)$ in the cost function, we end up with the following term:

J(w,b)={1 \over 2m}\sum_{i=0}^{m-1}({(wx^{(i)}+b)}-y^{(i)})^2

If our goal is to find the values for $w,b$ with the minimum associated cost, we can iteratively increase or decrease $w,b$ . In which direction and how far we should change our weights depends on the location of the cost function: If we are far away from the minimum in a steep curve, we want to move “downhill” a lot. If we are close to the minimum, small steps will get us to our goal. Thinking back to differential algebra, one way to identify the slope of our cost function is by finding the derivative, more precisely the partial derivatives of $J$ with respect to $w$ and $b$ . Don’t worry about the math if you haven’t thought about the chain rule in a long time, the most important takeaway should be an intuition about why we use gradient descent to progressively minimize the cost function by updating $w$ and $b$ to move to the minimum of our cost curve.

f_{w,b}(x^{(i)})=wx^{(i)}+b \tag{1}

{\partial J(w,b) \over \partial b} = {1 \over m} \sum_{i=0}^{m-1}{((f_{w,b}(x^{(i)})-y^{(i)})} \tag{2}

{\partial J(w,b) \over \partial w} = {1 \over m} \sum_{i=0}^{m-1}{((f_{w,b}(x^{(i)})-y^{(i)})}*x^{(i)}) \tag{3}

With the partial derivatives $(2)$ and $(3)$ at hand, we can now complete the optimization algorithm known as gradient descent. A gradient is the direction of the fastest increase, and we try to move in the opposite direction of the gradient at any given point, as this is the direction of the steepest descent. We select a learning rate $\alpha$ , determining how fast we should move along the cost function. You might wonder why we choose a fixed learning rate that could lead to adverse effects, and you’d be right. We’ll cover alternative optimization methods with adaptive learning rates in a subsequent section.

\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & b := b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \; & w := w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{4} \; \newline & \rbrace\end{align*}

Setting a high learning rate can lead to “bouncing” back and forth along the cost curve without ever converging at the minimum. A low learning rate can lead to little progress. Gradient descent is run for a predefined number of iterations or epochs, converging on a cost value.

Applying gradient descent multiple times progressively updates the weights until the curve fits our training data to an acceptable degree. If we try to input different distance values for our train ticket price estimator, we can gradually observe more aligned price values with each iteration.

The process of running multiple iterations of gradient descent on a training data set is called training.

While it was nice to predict continuous output values like the price for a train ticket, what if we want to estimate categorical values, such as whether a customer is expected to buy a product or not based on a price? This shifts the problem from regression to classification or, put differently, from the task of predicting a numeric value to predicting one of multiple known classes or categories. In binary classification, the output value can take the values 0 (customer does not buy the product) or 1 (customer buys the product), but never both at the same time. To see how we can solve this case, let’s look into logistic regression, an alternative approach to linear regression used for binary classification.

Logistic regression for binary classification

Logistic regression is analogous to linear regression with a key difference: Instead of predicting a continuous output, logistic regression uses the logistic function or sigmoid function to output either 0 or 1 for a specific input, a process known as binary classification. If the last sentence sounded like I just made up words, don’t worry, we’ll work our way through it in a second!

While you could run linear regression to estimate whether a customer buys a product or not, it is not the best fit for classification tasks, in which you want to predict a categorical output value from a range of possible solutions given an input. The reason for this is that fitting a linear curve is rarely sufficient to model the difference between two classes or categories of values. At the same time, using an S-shaped curve for regression can yield better results. This is precisely why logistical regression works better for classification tasks.

One function modeling an S-Shaped curve is the logistic function or sigmoid function. The sigmoid function maps continuous input values to output values in the range between 0 and 1, making it ideal for modeling a probability, which also lies between 0 and 1.

To give an example, consider the case where we want to check whether a customer buys a product given its price. In our training data, we have collected multiple historical instances of $x/y$ -pairs: $(30, 1), (40, 1), (55, 0), (70, 1), (130, 0)$ . To calculate the probability $g(z)$ of whether the customer would buy a product for a given price $x$ , the following function is used:

z = f_{w,b}(x)=wx+b \tag{1}

g(z)={1 \over 1+e^{-z}} \tag{2}

While $(1)$ should be familiar as we compute the value for $z$ using linear regression, $g(z)$ is the logistic function. Applying $g(z)$ yields a probability that the output $y$ is 1 for a given input $x$ . As a final step, we add a threshold after which $y$ is assumed to be 1, normally choosing 0.5.

The only implementation difference between linear regression and logistic regression is that logistic regression applies $g(z)$ to the regression output and chooses the resulting class using a threshold value. To update and fit weights, we still use gradient descent, however, the cost function undergoes a slight change:

J(w,b)={1 \over m}\sum_{i=0}^{m-1}[\text{loss}(f_{w,b}(x^{(i)}), y^{(i)})]

You might be wondering what the loss function **is doing here. A loss function describes the cost for a single data point in our training set. Knowing the loss function, the cost is the sum of all losses. For logistic regression, the loss function is the log loss function

\text{loss}(f_{w,b}(x^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{w,b}\left(x^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{w,b}\left(x^{(i)} \right) \right)

If you look closely, you can see that like using an if-statement for branching, one part of the term will always end up being 0 as our $y$ can only assume 0 or 1. You might also know this loss function as log-likelihood. For each value of our training data set, we assign a loss of 0 if the expected output $y$ is 0 and the predicted output $\hat{y}$ is 0 as well or if the expected output and predicted output are both 1. If the predicted probability is somewhere between 0 and 1, the negative sign will cancel out the resulting negative value returned from the log function. If the expected value is 0 and the actual value is 1 or vice versa, the log function receives 0 and the loss is undefined or infinity. We use log probabilities to map our probability $g(z)$ on a log scale.

Please note, however, that

f_{w,b}(x^{(i)})=g(w*x^{(i)}+b)

where the function $g$ is the sigmoid function. This is an important difference If we retrieve the partial derivatives of the cost function with respect to our weights and bias, we end up with the same gradient descent formulas we used for linear regression, with the big difference being the application of the sigmoid function in $f_{w,b}$ for logistic regression.

\frac{\partial J(w,b)}{\partial w_j} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x_{j}^{(i)}

\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})

\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & b := b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \; & w_j := w_j - \alpha \frac{\partial J(w,b)}{\partial w_j} \; & \text{for j := 0..n-1}\newline & \rbrace\end{align*}

That’s already it for logistic regression. Remember that you can use logistic regression for binary classification because it uses the logistic or sigmoid function to return the probabilities of the target value being 1. The logistic loss function computes the deviation from the expected probability. Applying a threshold to the $\hat{y}$ or the probability of $y=1$ yields a value of 0 or 1, which is the respective class predicted by logistic regression.

Multiple regression

While our previous examples worked with one input or feature value, usually you’d like to use multiple features in a feature vector to make predictions. For estimating the price of train tickets, maybe in addition to the distance in km $x_1$ the current capacity $x_2$ of the train or price reductions $x_3$ should be taken into account. A general model of linear regression allows using $n$ features in multiple regression.

y_i=w_1*x^{(i)}_1+w_2*x^{(i)}_2+...+w_n*x^{(i)}_n+b

A generalization for our vectorized regression function is

f(\vec{x})=\vec{w} \cdot \vec{x}+b

What changes in multiple regression is that in addition to receiving a vector $\vec{x}$ of inputs $x_1^{(i)},x_2^{(i)},...,x_n^{(i)}$ for each value $i$ in the training data instead of a single input, we calculate the dot product of our feature vector with a new vector of weights. Instead of one weight $w$ , we compute weights $w_1,w_2,...,w_n$ for each feature, to account for the varying impact on the output for each feature.

What changes in the training process is that we have to update all weights in gradient descent instead of a single weight, that’s it.

Regularization against overfitting

When you fit a model, be it with linear or logistic regression, you may end up overfitting the model to the training data. Ultimately, you want to use your model to make predictions for inputs not seen in the training data before. The model should use the training data to encode information and generalize for further usage as opposed to remembering just the training data but not working for new inputs.

One way to fight overfitting is by adding a regularization term to your cost function $J$ like the following:

J(w,b)=...+{\lambda \over 2m} \sum_{j=1}^{n}{w_j^2}

As you can see, for every feature $j$ in $n$ features, the cost function is increased by the squared weights. This forces weights to be small, adapting the cost function to penalize large values of $w$ in addition to fitting to data. The regularization term must also be added to the weight updates in gradient descent

{\partial{J(w,b)} \over \partial w_j}=...+{\lambda \over m}w_j

We don’t have to regularize our bias.

Other approaches to fighting overfitting include adding more training data and adding or removing different features.

Supervised learning, deep learning, machine learning

There have been many areas working on processing large volumes of data to make better predictions, creating models to understand multi-modal content such as images and videos, and working on artificial general intelligence. While the term machine learning has been used historically, deep learning is a trend that has been started around ten years ago to build multi-layered networks that are trained to solve problems in various areas, from natural language processing to computer vision, speech recognition, and more.

Supervised learning describes one training architecture, where inputs are provided with associated targets or labels, telling the model about the “correct” or expected output and aligning the predictions with the training data. In contrast, unsupervised learning algorithms detect patterns in unlabeled data, without further inputs. There are further paradigms such as reinforcement learning, which go beyond the scope of this post, even though they are incredibly interesting.

Artificial Neural Networks

Previously, we learned about regression to create models for predictions based on training by iteratively updating weights using gradient descent. If we think about it, this was a form of supervised learning, where we provided labeled inputs and expected outputs in a training data set, helping the algorithm to improve.

If we follow this intuition and scale it up to multiple units performing regression, we arrive at artificial neural networks. While this is a simplification, a neural network is indeed a chain or directed graph of functions performing regression and converging to produce an output for a given input.

\text{network}: input \rarr \text{hidden layers} \rarr \text{output layer} \rarr output

Artificial networks are split into multiple layers. Receiving an input, individual units in each layer compute an output which is then passed on to the next layer up until the final layer, where the output value is returned. Intermediate layers (excluding the final layer) are also called hidden layers.

\text{neuron}: \text{input } \vec{w} \rarr g(z)=g(\vec{w}\cdot \vec{x} + \vec{b}) \rarr \text{output } a

The actual magic happens in each unit of each layer, also called neuron due to the original idea of representing the human brain, though most of its inner workings remain to be discovered. In a so-called dense layer, each neuron receives an input as the activation argument and calculates the dot product using its stored weights (a vector of the same dimension as the input) and adding its stored bias, after which it applies an activation function. The activation function can be the sigmoid function we discovered earlier to perform logistic regression, a linear function to perform linear regression or a different function which we will learn about in a bit.

In a dense layer or fully-connected layer, every neuron is connected to every neuron in the previous layer and each connection has a dedicated weight. This is in contrast to convolutional layers in Convolutional Neural Networks (CNNs), where each neuron is connected only to a local region of the input and the same set of weights (i.e., a filter or kernel) is used for every such local region. While CNNs have an advantage in processing grid-like data structures (such as images) by capturing local spatial patterns and having fewer parameters, fully connected layers are generic and can model any kind of relationship, but often come with a much higher number of parameters and therefore increased computational cost and risk of overfitting.

A network constructed using multiple dense layers exclusively is also known as a multilayer perceptron.

Activation functions are critical to introducing non-linearity into a network. While this sounds a bit cryptic, let’s think about what would happen if we were to design a neural network with multiple layers and multiple neurons in each layer, each running linear regression. If we just returned the output value without further modification, we would end up running a chain of linear regression. This chain could be reduced to a single linear regression step, rendering the layers unnecessary.

The choice of an activation function depends on the type of problem to be solved. For instance, for binary classification, the sigmoid activation function is often used in the last layer to produce the probability of the output being 1. For regression, often no activation function (or a linear activation) is used in the output layer.

ReLU:f(x)=max(0,x)

For hidden layers, an activation function called rectified linear unit (ReLU) is often used as it tends to achieve better performance and train faster.

While it’s possible to use mixed activation functions in the same layer, for the sake of simplicity and efficient training dynamics (enabling performant gradient flows in backpropagation) it is common to choose a single activation function per layer.

Activation colloquially refers to a neuron producing a significant (usually positive) output, so in the context of a ReLU, a neuron is said to be "activated" if it's outputting a positive value, and "not activated" if it's outputting a value of zero.

Softmax

The softmax or softargmax function converts a vector of values to a probability distribution of possible outcomes. This is especially useful in multiclass classification, where you want to assign one discrete outcome for each input.

Building a neural network for multiclass classification thus involves the usual configuration of multiple ReLU hidden layers, but instead of using the sigmoid activation function in the final layer to produce a probability for binary classification, the softmax function is used in the output layer to produce a probability distribution, assigning each outcome a probability between 0 and 1. The output layer should have one unit for each possible output class $z_j$ for $j=1,...,N$ possible outputs. Thus, each neuron calculates the probability for the output class being $j$ .

z=w \cdot x+b

softmax: a_j= {{e^{z_j}} \over \sum_{k=1}^N e^{z_k}}

All the $a_1,a_2,...,a_N$ should sum up to 1.

For numeric precision, softmax should be combined with loss during training, which can be achieved by removing the activation function in the output layer (setting it to a linear layer) and configuring the loss function to apply the softmax operation in its calculation. This allows for an optimized implementation and more stable and accurate results.

The loss function for a predicted outcome $a$ and the actual target category $y$ in multiclass classification is the categorical cross-entropy or log loss, which is defined as

L(a,y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases}

For the numerically precise cost function, it follows that the cost is the sum of losses for all training examples where the individual loss for a given category is 0 if the category is in fact the expected output category and uses the softmax function instead of the previous output for greater precision.

\mathbf{1}\{y == n\} = \begin{cases} 1, & \text{if $y=n$}.\\ 0, & \text{otherwise}. \end{cases}

J(\mathbf{w},b) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right]

Information flows in neural network

In classic (feedforward) neural networks, information flows through the network from the first layer to the last layer just once, producing the output value by invoking a chain of neurons. This flow can be represented as a directed acyclic graph, as values flow from one end to another in forward propagation.

To adjust the model’s parameters during training, gradient descent can be used. For gradient descent, we need to compute the partial derivative of the cost function with respect to the model’s weights. For this, a backward pass is performed in a process called backpropagation. First, the model is traversed from the input to the output layer through hidden layers to compute the hidden layers’ outputs and the final output value. Then, in the output layer, the derivative of the cost function is calculated with respect to the input and the hidden layers. Knowing the final result from forward propagation, backpropagation thus retrieves the partial derivative by applying the chain rule for each node in a so-called computation graph made up of each operation performed to get to the output value, creating partial derivatives for each step along the way.

In addition to feedforward neural networks where values flow from one end to the other just once in forward propagation, model architectures such as recurrent neural networks feed values into neurons multiple times. In this case, the graph becomes cyclic and an internal state is created for each neuron, as it can access the memory of events that happened thousands of discrete steps before. This architecture has created models like LSTMs (long short-term memory) and is particularly effective in speech recognition.

One remaining problem in training neural networks is that gradients may get so small that weights are not changed during training runs. This is called the vanishing gradient problem, and it is only partially solved by backpropagation. One possible solution to this problem is adding skip connections, which first appeared in the Highway Network (one of the first, fully-functional deep neural networks). Skip connections are also known as information highways, as gradients flow independently of the layers, and values from a neuron in one layer can be passed directly into a neuron on a much deeper layer. Networks using skip connections are known as residual neural networks.

If you’re still unsure about backpropagation, feel free to check out Andrew Ng’s video on the topic to get an intuition.

Networks encode information

In deeper neural networks, training iteratively compresses and encodes a state of the world into the model. This has been shown for image classification networks, where the first layer starts distinguishing edges while subsequent layers identify concepts like faces or letters. This separation of concepts into layers is, luckily, not something you as an engineer have to come up with, it just so happens to be the most effective way to reduce loss and is thus picked up by the model at training time.

Vectorization and matrix multiplication

While we’ve seen a single neuron in action, we can compute each layer using matrix multiplication. When each neuron accepts the input vector $\vec{x}$ and calculates the output $\vec{a}$ using the weights $\vec{w}$ and bias $b$ , we can bundle the weights $w$ and biases $b$ of all neurons in a layer into a matrix $W$ (and $B$ respectively), and combine all the inputs $x$ into a matrix $X$ as well, after which we can apply the activation function for the given layer, leading to the following equation

Z=g(W X+B)

Matrix multiplication is significantly more efficient than running operations on each neuron independently and allows scaling up training and inference for large models.

SGD, Adam, and other optimizers

While we’ve learned about the gradient descent optimizer, there are different methods to decrease the cost or loss of your model during training.

Stochastic gradient descent (SGD) is an iterative method approximating gradient descent, making it more computationally efficient for larger models, however at the cost of a lower convergence rate. While SGD by default still uses a fixed learning rate or step size, it has been extended with concepts like momentum, which computes the weight update based on a linear combination of the previous update and the current gradient.

The adaptive gradient algorithm (AdaGrad) is a modified SGD with per-parameter learning rates, often improving convergence performance over standard stochastic gradient descent in settings where data is sparse and sparse parameters are more informative.

Root mean square propagation (RMSProp) is a method in which the learning rate is adapted for each of the parameters, dividing the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight.

One of the most widely used optimizers is the Adaptive Moment Estimation optimizer (Adam) and its extension AdamW. Adam combines the RMSProp optimizer with the Momentum method, using running averages with exponential forgetting of both the gradients and the second moments of the gradients.

Now that we’ve explored the underlying concepts of neural networks, let’s talk about the requirements for proper training and model evaluation.

Splitting training data

Ideally, your model fits well to the training data and generalizes well to inputs it has not seen before. To evaluate this performance, it is insufficient to use all the available data for training and then hope it generalizes well. This forces you to split your data into training data and test data. Training data are used to tune the model parameters $W$ and $B$ in training or fitting, whereas test data are used to test the model after tuning to gauge performance on new data.

This way, you can train your model on the training data set until you reach an acceptable loss, and then validate it performs similarly well on the test data. If your model works well on the training data but not at all on the test data, you have an overfitted model or a case of high variance. If your training loss itself is too high, your model may be too simple or underfit, which is a case of high bias. When your model is underfit, there is no reason in scaling up, but you can try getting additional features or decreasing the regularization parameter. When your model has high variance, you should get more training examples, try increasing the regularization parameter $\lambda$ or try smaller sets of features.

There is a tradeoff between a very simple model with high bias and a very complex model with high variance, and you want to find the sweet spot which works best for training and test data.

Usually, you don’t just build a single model, however, and you might want to experiment with different hyperparameters (number of layers or units (neurons) in each layer, regularization, the degree of polynomials, etc.). In this case, you should split your training data into training and cross-validation data. The cross-validation data set is used to compare the resulting cost for different model variations, without using up your test data to prevent overfitting the model on your test data for the same reason we split our data set into training and test data before. This way, you can fine-tune your model on the validation set and in a final step, evaluate how your model performs on unseen data.

As a rule of thumb, training data usually takes up 60%, cross-validation data 20%, and test data the remaining 20% of the original data set. Always make sure you split up the data set into training, validation, and test data sets as early as possible, and never, ever train your model on test data.

This was a very brief introduction into the topics of regression and deep learning to get a rough intuition and cover the most important concepts such as gradient descent, backpropagation, dense layers, activation functions, and vectorization.

Feel free to let me know about any feedback, questions, or suggestions by sending a mail!