Skip to main content

Loss Functions and Activation Functions

Gradient Descent

  • An example of fitting with the yy-axis representing Height and the xx-axis representing Weight; how to find kx+bkx + b during gradient descent?

    • Suppose we already know the value of kk and are looking for the optimal intercept:

      • First, select an initial value for the intercept, calculate the residuals, and use multiple residuals to compute the sum of squared residuals (loss function). Plot this against the intercept.
      • By changing the intercept, we get a curve similar to those found in backpropagation.
    • Gradient Descent: When repeatedly calculating and plotting intercept values to find the lowest point, fewer computations are made when far from the optimal solution (larger step size); when closer to the solution, more computations are made (smaller step size). This ensures accuracy near the optimal solution without wasting too many calculations.

      ![[Pasted image 20240906064557.png]]

    • Instead of manually adjusting the intercept each time, we can simplify the calculation of squared residuals for each point. Given (observedpredicted)2(observed - predicted)^2, with (unchangedObserved(intercept+unchangedSlope×unchangedXCoordinate))(unchangedObserved - (intercept + unchangedSlope \times unchangedXCoordinate)), every point can be calculated similarly: (constant(intercept+constant))2(constant - (intercept + constant))^2.

    • By differentiating this curve, we can find the slope at any intercept xx. The derivative is the sum of the derivatives of each individual (observedpredicted)2(observed - predicted)^2.

      • For example, calculating ddintercept(1.3(intercept+0.64×5))2\frac{d}{dintercept}(1.3 - (intercept + 0.64 \times 5))^2 gives us (1.3(intercept+3.2))2(1.3 - (intercept + 3.2))^2. Using the chain rule to differentiate:

        =2(1.3(intercept+3.2))×(1.3(intercept+3.2))=2(1.3(intercept+3.2))×(0(1+0))=2(1.3(intercept+3.2))=2(1.9intercept)=3.8+2×intercept\begin{aligned} & = 2(1.3 - (intercept + 3.2)) \times (1.3 - (intercept + 3.2)) \\ & = 2(1.3 - (intercept + 3.2)) \times (0 - (1 + 0)) \\ & = -2(1.3 - (intercept + 3.2)) \\ & = -2(1.9 - intercept) \\ & = 3.8 + 2 \times intercept \end{aligned}
      • In the previous example, finding the point where the slope equals 0 was straightforward. However, in gradient descent, we approximate the optimal value when the slope cannot reach 0. This is especially useful in cases where the slope will never be exactly 0.

        • By changing the intercept and calculating the slope at each point using the above derivative, we can see that as the slope approaches 0, we are closer to the optimal solution. Therefore, the step size needs to become smaller as we approach the minimum.
        • Gradient descent updates the intercept by subtracting the step size (which is the slope multiplied by a learning rate, a small number) from the previous intercept. As we approach from the left, the slope is negative, and the step size decreases as the slope approaches 0.
        • Gradient descent stops when the step size is very close to 0 (e.g., 0.001) or after a certain number of iterations (e.g., 1000).
    • How does gradient descent optimize both the slope and the intercept simultaneously?

      • Similar to before, we use the sum of squared residuals as the loss function, but now we also include the slope in our formula, such as (1.4(intercept+slope×0.5))2(1.4 - (intercept + slope \times 0.5))^2. We can then plot a 3D graph of this function. We differentiate this with respect to both the intercept and the slope.

        • When differentiating with respect to one variable (using the chain rule), treat the other variable (intercept/slope) as a constant.
        • This gives us the gradient, which includes both the intercept and the slope. Gradient descent finds the minimum of the loss function (sum of squared residuals), which is why it's called "gradient descent."
        • We start by selecting random values for the intercept and slope (e.g., 0 and 1):
        • Substitute these initial values into the gradient (the derivative of the sum of squared residuals with respect to both intercept and slope) to get two slopes.
        • Use the slopes to calculate step sizes (using a small learning rate). The optimization algorithm can adjust the learning rate automatically.
        • Calculate new intercept and slope values based on the step sizes.
        • Repeat until the step sizes are very small (e.g., 0.001) or after a certain number of iterations (e.g., 1000).
      • In summary:

        1. Differentiate the loss function (i.e., compute the gradient).
        2. Choose random initial values for the parameters (gradient).
        3. Calculate the step sizes using the learning rate and the slope.
        4. Update the parameters using the step sizes.
        5. Stop when the step size is very small or the maximum number of iterations is reached, giving you the optimal point!
    • In real-world scenarios, there may be many observed data points, requiring significant computational resources to calculate residuals (the loss function). In this case, we can use Stochastic Gradient Descent, where a random subset of the data is selected at each step instead of using the entire dataset.

Trained Weights

  • Through multiple segments (curved) (part of the activation function), these segments are added together to produce the final image:
    • Can be Soft-Plus, ReLU, or Sigmoid, which are activation functions.
    • The input layer (which has specific weights calculated through backpropagation):
      • First-round calculation of weights and biases.
      • Pass through the activation function.
        • In the Soft-Plus activation function, the input (e.g., Dosage) is passed into the activation function (serving as the xx input for the hidden layer). The result is stored. For each input, plot a new point representing the output of the activation function.
        • The output is multiplied by a specific number (scaling), and another point is generated for the next input in the hidden layer.
        • Repeat for other connections to the hidden layer to obtain a series of points representing the activation function and a series of operations. These points always follow the trend of the activation function and are determined by the weights and biases.
      • Plot these two curves (partial activation functions) on the graph, aligning them based on their respective outputs. Adding them together yields a new curve!
      • This new curve is then subtracted by a specific value to produce a final fitted data curve (function).
      • Advanced fitting machine:
        • Multiplications represent weights.
        • Additions represent biases. ![[Pasted image 20240906064619.png]]

Backpropagation

  • How to estimate the two corresponding weights and biases.

  • In this example, we only need to estimate the final bias b3b_3:

    • Start with an initial value for the bias, typically 0. We can determine the difference between two points (Observed is the original data, Predicted is the raw output of the neural network) using the residual (as mentioned in the gradient descent section).

    • Square the residuals for each point individually, sum them to get the Sum of Squared Residuals (SSR) i=1n=3(ObservediPredictedi)2\sum^{n=3}_{i=1}(Observed_i - Predicted_i)^2, which corresponds to a point on the graph for SSR vs. b3b_3 (when b3b_3 is a certain value, what is the SSR).

    • Adjust b3b_3 multiple times, recalculating the corresponding SSR (loss function), and plot these values to obtain a curve (pink curve). The lowest point represents the smallest residual. s ![[Pasted image 20240906064806.png]]

    • Now, we can perform gradient descent on dSSRdb3\frac{dSSR}{db_3} (the loss function) as we did earlier. Although SSR and b3b_3 are not directly related, they are connected via Predicted (Sum+b3Sum + b_3), so we need to use the chain rule to differentiate SSR with respect to b3b_3:

      dSSRdb3=dSSRdPredicted×dPredicteddb3\begin{aligned} \frac{dSSR}{db_3} = \frac{dSSR}{dPredicted} \times \frac{dPredicted}{db_3} \end{aligned}

      dSSRdPredicted\frac{dSSR}{dPredicted}:

      • ddPredictedi=1n=3(ObservediPredictedi)2\frac{d}{dPredicted} \sum^{n=3}_{i=1}(Observed_i - Predicted_i)^2

      • Power rule for the first term → ddPredictedi=1n=32×(ObservediPredictedi)\frac{d}{dPredicted} \sum^{n=3}_{i=1} 2 \times (Observed_i - Predicted_i)

      • For the second term → (ObservediPredictedi)=0+(1)=1(Observed_i - Predicted_i) = 0 + (-1) = -1

      • Merge → i=1n=32×(ObservediPredictedi)\sum^{n=3}_{i=1} -2 \times (Observed_i - Predicted_i)

        dPredicteddb3\frac{dPredicted}{db_3}:

      • ddb3(blue+orange+b3)=0+0+1=1\frac{d}{db_3} (blue + orange + b_3) = 0 + 0 + 1 = 1

        The total result is: i=1n=32×(ObservediPredictedi)×1\sum^{n=3}_{i=1} -2 \times (Observed_i - Predicted_i) \times 1

    • Next, we apply gradient descent (in this case, there is no need to consider the gradient itself since we're directly minimizing the loss function). By substituting the observed data into the residual equation, we obtain a loss function equation dependent only on Predicted. By inserting the Predicted values into the loss function (sum of squared residuals), we compute the total loss.

      • We then adjust b3b_3, substitute it into the derivative to calculate the slope, and use that to determine the step size. This process is repeated to update b3b_3. Once gradient descent stops, we have the optimized b3b_3 bias.
    • Summary: Unlike pure gradient descent, this approach requires using the chain rule to derive the indirect relationship between SSR and b3b_3, allowing us to perform gradient descent to optimize the bias.

ReLU Activation Function

  • In the previous example, we used the Soft-Plus activation function. Now, let's replace it with the ReLU (Rectified Linear Unit) function.
  • Let's revisit weight calculations (similar to the Soft-Plus section):
    • Before producing the final output, we add an additional activation function.
    • As before, after passing through the first round of weights and biases, we input the result into the activation function. The ReLU function operates as f(x)=max(0,x)f(x) = \max(0, x), meaning the output is xx if x>0x > 0, and 0 otherwise.
    • We plot the output of the activation function against the input. The curve is part of the ReLU function.
    • The output of the activation function is multiplied by the next weight, and the process is repeated for the next node. Eventually, we obtain graphs for two sets of ReLU activation function outputs.
    • Adding the results of these two ReLU activations gives us a new line segment:
    • Since we added an activation function before the output, the result is passed through the ReLU function again, yielding a new curve.
    • Even though the ReLU function may appear less complex than other activation functions, through scaling and shifting by weights and biases, multiple ReLU functions can interact and sum together to create a shape that fits the data.
- The ReLU function is a piecewise function and lacks a derivative across its entire domain (which is critical for gradient descent). However, we can treat the ReLU function as two separate cases:
- When $x < 0$: The ReLU function is always 0 in the negative range, so the derivative is 0.
- When $x > 0$: The ReLU function is $f(x) = x$ in the positive range, so the derivative is 1.
- When $x = 0$: The derivative can be defined as 0 or 1, or randomly chosen between the two.

This property makes ReLU highly efficient during back propagation, which is why it is so widely used.