Inverse error propagation method --Start writing backpropagation

In deep learning, backpropagation is probably the most difficult part to understand when reading source code.

In the source code, there is a part to transpose the matrix. Differentiation comes out. It is a mixture of inputs and outputs, activated outputs, weights and biases. The expression for calculating the slope is different between the final output and the intermediate output.

What kind of operation is differentiation in the conversion function from m input to n output? It's not like high school math. Jacobian determinant? After all, unless you understand the concept of advanced mathematics, can you not understand deep learning?

At first glance at the source code, you may be wondering, "What on earth are you doing?"

This is something I also felt. However, as I proceeded to decipher the source code, I realized that there was a considerable difference from the information written on the Web.

Deep learning can be understood with "high school mathematics", "how to find the slope in a multi-stage function", and "understanding the repetition of for statements". It turns out that matrix calculations are done for faster performance and mathematical abstractions in software.

One element can be read concretely, it is only repeated, and the repetition can be calculated easily and quickly by using a matrix.

Now, even those who are involved in work in the humanities are feeling the demand to deepen their understanding of AI deep learning. There is a demand to know what AI is a little more correctly so that we do not make rushed decisions with misunderstandings. However, I also feel that advanced mathematical understanding has become a barrier and difficult to understand.

The introduction to deep learning learned with Perl is a site devised so that you can understand deep learning with if sentence, for sentence and high school mathematics (understanding of inclination) + alpha.

Definition of differentiation

Let's understand only a little simple differentiation that can be understood within the range of high school mathematics necessary for understanding the inverse error propagation method.

The definition of derivative in mathematics is as follows.

#Definition of derivative --- In the following equation, make Δx as close to 0 as possible.
f'(x) = (f (x + Δx) --f (x)) / ((x + Δx) --x)

"Ratio of change in output to small change in input" is the definition of derivative. Intuitively, it shows how sensitive the output changes to a change in an input.

Differentiation of constant function

The derivative of the constant is "0".

#Constant function
f (x) = 5;

# Differentiation of constant function
f'(x) = 0;

The "=" above represents the equivalence of mathematical symbols. Please note that it is not an assignment in the program.

Let's calculate according to the definition. Let's write it as a Perl program so that it is easy to understand. With just the definition of mathematics, software engineers can be dazzled.

It's a constant function, so think of the input as a convenience. Think of "$x_delta" as close to 0 in the mathematical definition.

use strict;
use warnings;

my $x = 3;
my $x_delta = 0.0000001;

# Differentiation of constant function
my $const_derivative = (const ($x + $x_delta) --const ($x)) / (($x + $x_delta)-$x);

#Constant function
sub const {
  return 5;
}

print $const_derivative;

Make sure the result is "0" no matter what the value of "$x" is.

That is, the derivative of the constant function is expressed as follows.

# Differentiation of constant function
sub const_derivative {
  return 0;
}

Relationship between differential and derivative of function, derivative and slope

derivative has exactly the same meaning as "differential of function". On this site, "differentiation of a certain function" is called "derivative". And the derivative is explained as a function to find the ratio of the change of the output to the minute change of a certain input, that is, the slope (also called the differential coefficient).

The difference between the derivative of a function (also called the derivative) and the derivative (also called the gradient) is that the derivative of a function abstracts a general input because it is a function, whereas the derivative is specific. It is the actual value corresponding to the input.

As a software engineer, understand that the derivative is the function for finding the slope, and the slope is the value actually found.

Differentiation of linear function

The derivative of the linear function is the coefficient of x. A linear function is a function with a degree of x such as "3x + 2".

# Linear function
f (x) = 3x + 2;

#Differentiation of linear function (same value as coefficient of x)
f'(x) = 3;

The "=" above represents the equivalence of mathematical symbols. Please note that it is not an assignment in the program.

Let's calculate according to the definition. Let's write it as a Perl program so that it is easy to understand. With just the definition of mathematics, software engineers can be dazzled.

Since it is a linear function, consider the input to be convenient. Think of "$x_delta" as close to 0 in the mathematical definition.

use strict;
use warnings;

my $x = 3;
my $x_delta = 0.0000001;

#Differentiation of linear function
my $liner_derivative = (liner ($x + $x_delta) --liner ($x)) / (($x + $x_delta)-$x);

# Linear function
sub liner {
  my ($x) = @_;
  
  my $y = 3 * $x + 2;
  
  return $y;
}

print $liner_derivative;

Make sure the result is "3" no matter what the value of "$x" is. In this case, there was an error and it was "3.00000000444089". When "$x_delta" approaches 0 to the limit, it becomes 3.

In other words, the derivative of the linear function "3 x + 2" is expressed as follows.

#Differentiation of linear function
sub liner_derivative {
  return 3;
}

Differentiation of activation function

In deep learning, there is a function called the activation function to express the degree of activation of neurons. However, in reality, it may be better to think that it is introduced to find the slope of the loss function for one parameter in the inverse error propagation method, rather than expressing the degree of neuron activity. Maybe.

We do not derive the derivative of the activation function, that is, the derivative of the activation function. What you need to understand as a software engineer is that the derivative is a function for finding the slope.

When the derivative appears in the source code, think "Oh, I'm looking for the slope here."

#Activation function
sub activate {
  my ($input) = @_;
  
  # ...
  
  return $output;
}

# Differentiation of activation function
sub activate_derivative {
  my ($input) = @_;
  
  # ...
  
  return $output;
}

Differentiation of loss function

In deep learning, the parameters are adjusted with the goal of reducing the value of the loss function, which is an indicator of error. Consider the derivative of the loss function.

There are multiple inputs to the loss function. The output of the loss function is one.

The loss function differs from the functions we've seen so far in that it returns a single output for multiple inputs.

# Loss function
sub cost {
  my ($inputs) = @_;
  
  my $output;
  
  # ...
  
  return $output;
}

I think it will be "Oya" here. I've never learned to differentiate a function that produces one output for multiple inputs. What is it in the first place?

To write the conclusion first, the result of the derivative of the loss function is multiple outputs for multiple inputs.

#Differentiation of loss function
sub cost_derivative {
  my ($inputs) = @_;
  
  my $outputs = [];
  
  # ...
  
  return $outputs;
}

However, this just looks like that, and in fact, it just summarizes that there is one output for one input. Let's rewrite the derivative of the loss function using a function that produces one output for one input called cost_derivative_each.

sub cost_derivative {
  my ($inputs) = @_;
  
  my $outputs = [];
  for (my $i = 0; $i <@$inputs; $i ++) {
    $outputs->[$i] = cost_derivative_each ($inputs->[$i]);
  }
  # ...
  
  return $outputs;
}

# Loss function
sub cost_derivative_each {
  my ($input) = @_;
  
  my $output;
  
  # ...
  
  return $output;
}

Don't be afraid to misunderstand, the inputs other than the one you are interested in are treated like constants and have no effect on the result of the derivative of the loss function.

Take a look at the close entropy function, which is one of the loss functions. The formula is complicated, but it's okay if you can feel that the results of each input are independent.

# Cross-entropy cost derivative
sub cross_entropy_cost_derivative {
  my ($vec_a, $vec_y) = @_;
  
  my $vec_out = [];
  for (my $i = 0; $i <@$vec_a; $i ++) {
    $vec_out->[$i] = $vec_a->[$i]-$vec_y->[$i];
  }
  
  return $vec_out;
}

What is partial differential?

Partial differentiation is differentiation (laughs). It is a derivative that considers all variables other than the input variable of interest as constants.

Find the slope of the loss function with respect to the last bias

Let's find the slope of the loss function with respect to the final bias.

You need to understand how to find the slope of a multi-stage function.

The slope of a multi-stage function can be calculated by the following formula.

Slope of multi-stage function = Slope of function 1 * Slope of function 2 * Slope of function 3 * ... * Slope of function n-1 * Slope of function n

The slope of a multi-stage function is the product of the slopes of each function. In other words, if you know the slope of each function, you can find it by multiplying it.

In this equation, think of the loss function as the function n.

Let's focus on one bias of the last m-to-n conversion function. Seen from one bias, the expression containing the bias leads to the last m to n conversion function, activation function, and loss function.

# Find the output from the equation including the bias (last m to n conversion function)
my $output0 = $weigth0_0 * $input0 + $weigth01 * $input1 + $biase0;

# Apply activation function
my $activated_output0 = activate ($output0);

#Apply a loss function (arguments are conveniently described in the form of the $activate_output of interest and other arguments)
my $cost = cost ($activate_output0, $other_args);

The m-to-n conversion function is just a linear expression by focusing on one bias. The answer to "What is the derivative of the function that transforms from m to n?" Was "By paying attention to one bias, it becomes a mere linear expression."

Other than the bias of interest, it can be regarded as a constant. To write it in an easy-to-understand manner, it is as follows.

# Find the output from the equation including the bias (last m to n conversion function)
my $output0 = $biase0 + C;

What if we differentiate this? C is a constant function, so the derivative is 0, isn't it? Since the coefficient of $bias0 is "1", the derivative is "1". That is, the slope is "1".

my $m_to_n_grad_for_biase0 = 1;

What about the slope of the activation function next? It uses the derivative of the activation function.

#Inclination of activation function
my $acivate_grad_for_output = activate_derivative ($output);

What about the slope of the loss function next? It uses the derivative of the loss function.

#Loss function slope
my $cost_grad_for_activate_output = cost_derivative ($activate_output, $other_args);

The slope of the loss function with respect to the bias was calculated by multiplying the slopes of each.

# Slope of one bias with respect to the loss function
my $cost_grad_for_biase0 = $m_to_n_grad_for_biase0 * $private_grad_for_output * $cost_grad_for_activate_output;

I was asked. The same method can be used for other biases. If all were obtained, the array would be as follows.

[
  $cost_grad_for_biase0,
  $cost_grad_for_biase1,
  $cost_grad_for_biase2,
]

Find the slope of the loss function with respect to the last weight

Let's find the slope of the loss function with respect to the last weight. Next to the bias, we will challenge the weight.

Let's focus on one of the weights of the last m-to-n conversion function. From the point of view of one weight, the expression containing the weight is connected to the last m to n conversion function, activation function, and loss function.

# Output output from an expression containing weights (last m to n conversion function)
my $output0 = $weigth0_0 * $input0 + $weigth01 * $input1 + $biase0;

# Apply activation function
my $activated_output0 = activate ($output0);

#Apply a loss function (arguments are conveniently described in the form of the $activate_output of interest and other arguments)
my $cost = cost ($activate_output0, $other_args);

The conversion function from m to n is just a linear expression by focusing on one weight.

Other than the weight of interest, it can be regarded as a constant. To write it in an easy-to-understand manner, it is as follows. Notice that the input is also written as "INPUT 0" and is considered a constant. (In the actual code, it is treated as a variable as it is.)

# Output output from an expression containing weights (last m to n conversion function)
my $output0 = $weigth0_0 * INPUT0 + C;

What if we differentiate this? C is a constant function, so the derivative is 0, isn't it? Since the coefficient of $weight0 is "INPUT0", the derivative is "INPUT0". That is, the slope is "INPUT 0".

my $m_to_n_grad_for_weight = INPUT0;

What about the slope of the activation function next? It uses the derivative of the activation function.

#Inclination of activation function
my $acivate_grad_for_output = activate_derivative ($output);

What about the slope of the loss function next? It uses the derivative of the loss function.

#Loss function slope
my $cost_grad_for_activate_output = cost_derivative ($activate_output, $other_args);

The slope of the loss function with respect to the weight was calculated by the product of each slope.

# Slope of one weight with respect to the loss function
my $cost_grad_for_weight = $m_to_n_grad_for_weight * $allocated_grad_for_output * $cost_grad_for_activate_output;

I was asked. The same method can be used for other weights.

Let's look at the relationship between the slope of the weight with respect to the loss function and the slope of the bias with respect to the loss function. The weight factor is "INPUT 0" and the bias factor is "1". The slope for weights is "INPUT 0" times the slope for bias.

This means that if you calculate the slope of the bias with respect to the loss function and multiply it by "INPUT 0", you can find the slope of the weight with respect to the loss function.

my $cost_grad_for_weight = $cost_grad_for_biase0 * INPUT0;

Assuming that the slopes for the loss functions of all weights are found, the array looks like this: It is expressed as a 3 row 2 column column priority matrix.

[
  $cost_grad_for_weight0_0,
  $cost_grad_for_weight1_0,
  $cost_grad_for_weight2_0,

  $cost_grad_for_weight0_1,
  $cost_grad_for_weight1_1,
  $cost_grad_for_weight2_1,
]

The slope of the weight with respect to the loss function is described below in consideration of the relationship with the slope of the bias with respect to the loss function.

[
  $cost_grad_for_biase0 * INPUT0,
  $cost_grad_for_biase1 * INPUT0,
  $cost_grad_for_biase2 * INPUT0,

  $cost_grad_for_biase0 * INPUT1,
  $cost_grad_for_biase1 * INPUT1,
  $cost_grad_for_biase2 * INPUT1,
]

Matrix representation of the slope of the weight with respect to the loss function

Let's express the slope of the weight with respect to the loss function as a matrix. The slope of the loss function for each weight can be obtained one by one, but the matrix is ​​a collective representation of this.

It is the product of the transposed matrix of the bias matrix and the input matrix.

my $cost_grads_biases = [
  $cost_grad_for_biase0,
  $cost_grad_for_biase1,
  $cost_grad_for_biase2,
];;

my $inputs = [
  $input0,
  $input1
];;

#Bias matrix * Transpose of input matrix
my $cost_grads_weights = mul_mat (to_mat ($cost_grads_biases), transpose (to_mat ($inputs)));

The result is as follows. Make sure that what you think individually matches the result of the matrix calculation.

[
  $cost_grad_for_biase0 * $input0,
  $cost_grad_for_biase1 * $input0,
  $cost_grad_for_biase2 * $input0,

  $cost_grad_for_biase0 * $input1
  $cost_grad_for_biase1 * $input1
  $cost_grad_for_biase2 * $input1
]

Find the slope of the loss function for the penultimate bias

Let's find the slope of the loss function for one bias of the penultimate m to n conversion function.

Seen from one bias, the expression containing the bias leads to the penultimate m to n conversion function, the activation function, and the last m to n conversion function.

# Find the output from the equation including the bias (last m to n conversion function)
my $output0 = $weigth0_0 * $input0 + $weigth01 * $input1 + $biase0;

# Apply activation function
my $activated_output0 = activate ($output0);

#The activated output will be the next input
my $forword_input0 = $activate_output0;

# $forword_input0 will be the input of the next function from m to n
my $forword_output0 = $forword_weigth0_0 * $forword_input0 + $forword_weigth0_1 * $forword_input1 + $forword_biase0;
my $forword_output1 = $forword_weigth1_0 * $forword_input0 + $forword_weigth1_1 * $forword_input1 + $forword_biase1;
my $forword_output2 = $forword_weigth2_0 * $forword_input0 + $forword_weigth2_1 * $forword_input1 + $forword_biase2;

First of all, it can be regarded as a constant except for the bias of interest.

# Find the output from the equation including the bias (last m to n conversion function)
my $output0 = $biase0 + C;

What if we differentiate this? C is a constant function, so the derivative is 0, isn't it? Since the coefficient of $bias0 is "1", the derivative is "1". That is, the slope is "1".

my $m_to_n_grad_for_biase0 = 1;

What about the slope of the activation function next? It uses the derivative of the activation function.

#Inclination of activation function
my $acivate_grad_for_output0 = activate_derivative ($output0);

Next, let's take a closer look that the activated output will be the input of the next m-to-n conversion function.

#The activated output will be the next input
my $forword_input0 = $activate_output0;

# $forword_input0 will be the input of the next function from m to n
my $forword_output0 = $forword_weigth0_0 * $forword_input0 + $forword_weigth0_1 * $forword_input1 + $forword_biase0;
my $forword_output1 = $forword_weigth1_0 * $forword_input0 + $forword_weigth1_1 * $forword_input1 + $forword_biase1;
my $forword_output2 = $forword_weigth2_0 * $forword_input0 + $forword_weigth2_1 * $forword_input1 + $forword_biase2;

The fact that there are multiple inputs means that the slope is the sum of them. If there are multiple inputs, the slope will be the sum of the slopes calculated for each.

Slope of loss function when there are multiple inputs = Slope of loss function of Eq. 0 with respect to Input 1 + Slope of loss function of Eq. 1 with respect to Input 1 + Slope of loss function of Eq. 2 with respect to Input 1

Now, regarding Expression 0, let's think of everything other than the "$forword_input0" that we are paying attention to as constants. The coefficient for $forword_input0 is "FORWORD_WEIGTH0_0".

my $forword_output0 = FORWORD_WEIGTH0_0 * $forword_input0 + C;

Now, have you seen each shape somewhere? It's similar to the following.

# Output output from an expression containing weights (last m to n conversion function)
my $output0 = $weigth0_0 * INPUT0 + C;

The slope of the loss function with respect to the last weight was the slope of the loss function of the last bias multiplied by INPUT0.

$cost_grad_for_biase0 * INPUT0,

Since the format of the expression is the same, only the constant part is different, "INPUT 0" should be changed to "FORWORD_WEIGTH0_0". When viewed from the current position, it is a bias one step ahead, so change "biase" to "forword_biase".

$cost_grad_for_forword_biase0 * FORWORD_WEIGTH0_0,

The important point here is that "$cost_grad_for_forword_biase0" has already been calculated in the calculation to calculate the loss function for the final bias.

Now that we have calculated for Equation 0, let's calculate Equation 1 and Equation 2.

$cost_grad_for_forword_biase0 * FORWORD_WEIGTH0_0,
$cost_grad_for_forword_biase1 * FORWORD_WEIGTH1_0,
$cost_grad_for_forword_biase2 * FORWORD_WEIGTH2_0,

Taking the slope of the activation function into consideration and multiplying it, the sum of the slopes obtained by Equation 0, Equation 1 and Equation 2 is calculated as follows.

my $cost_grad_for_biase0
  = $cost_grad_for_forword_biase0 * FORWORD_WEIGTH0_0 * $private_grad_for_output0 * $m_to_n_grad_for_biase0
  + $cost_grad_for_forword_biase1 * FORWORD_WEIGTH1_0 * $private_grad_for_output0 * $m_to_n_grad_for_biase0
  + $cost_grad_for_forword_biase2 * FORWORD_WEIGTH2_0 * $private_grad_for_output0 * $m_to_n_grad_for_biase0

Let's find the slope for the loss function for other biases.

my $cost_grad_for_biase1
  = $cost_grad_for_forword_biase0 * FORWORD_WEIGTH0_1 * $private_grad_for_output1 * $m_to_n_grad_for_biase1
  + $cost_grad_for_forword_biase1 * FORWORD_WEIGTH1_1 * $allocated_grad_for_output1 * $m_to_n_grad_for_biase1
  + $cost_grad_for_forword_biase2 * FORWORD_WEIGTH2_1 * $allocated_grad_for_output1 * $m_to_n_grad_for_biase1
my $cost_grad_for_biase2
  = $cost_grad_for_forword_biase0 * FORWORD_WEIGTH0_2 * $allocated_grad_for_output2 * $m_to_n_grad_for_biase2
  + $cost_grad_for_forword_biase1 * FORWORD_WEIGTH1_2 * $allocated_grad_for_output2 * $m_to_n_grad_for_biase2
  + $cost_grad_for_forword_biase2 * FORWORD_WEIGTH2_2 * $allocated_grad_for_output2 * $m_to_n_grad_for_biase2

Associated Information