ML Mathematical Concepts - Cost Functions and Optimization in Linear Regression

1. Cost Function for Linear Regression Problems

The cost function measures the error between predicted values and actual target values in a regression model. The goal of training the regression model is to minimize this cost function.

Mean Squared Error (MSE)

The most commonly used cost function for regression is the Mean Squared Error (MSE):

$$ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 $$

Where,

$ J(\theta) $: The cost function to minimize.
$ m $: The number of training examples.
$ h_\theta(x^{(i)}) $: The predicted value (hypothesis) for the $i$-th training example.
$ y^{(i)} $: The actual target value for the $i$-th training example.
The factor $ \frac{1}{2} $ is used for convenience during differentiation.

Other Cost Functions

Mean Absolute Error (MAE): \[ J(\theta) = \frac{1}{m} \sum_{i=1}^m \left| h_\theta(x^{(i)}) - y^{(i)} \right| \]
Huber Loss: Combines MSE and MAE to handle outliers.

2. Optimization in Regression Problems

Optimization is the process of finding the best parameters θ that minimize the cost function. In regression, we often use Gradient Descent to perform optimization.

Gradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize the cost function by adjusting model parameters in the direction of the steepest descent of the cost function.

Key Equation

The parameters are updated as follows:

\[ \theta_j := \theta_j - \eta \frac{\partial}{\partial \theta_j} J(\theta) \]

θ_j: Parameter to be updated.
η: Learning rate, a hyperparameter that controls the size of each step.
\[\frac{\partial}{\partial \theta_j} J(\theta)\]: Partial derivative of the cost function w.r.t. the parameter θ_j.

Steps of Gradient Descent

Initialize the parameters θ with random values.
Compute the gradient of the cost function w.r.t. each parameter.
Update the parameters using the update rule.
Repeat until convergence (i.e., when the changes in the cost function are below a predefined threshold).

Stochastic Gradient Descent (SGD)

In SGD, the cost function is computed and the parameters are updated for each training sample instead of the entire dataset. This results in faster but noisier updates. The update rule remains the same:

\[ \theta_j := \theta_j - \eta \frac{\partial}{\partial \theta_j} J(\theta) \]

Regularization in Optimization

To prevent overfitting, regularization techniques add a penalty term to the cost function, encouraging simpler models by penalizing large parameter values.

L1 Regularization (Lasso)

Adds the absolute value of the coefficients to the cost function:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^n |\theta_j| \]

$ \lambda $: Regularization parameter that controls the penalty strength.
$ |\theta_j| $: Absolute value of the coefficients.

L2 Regularization (Ridge)

Adds the squared value of the coefficients to the cost function:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^n \theta_j^2 \]

$ \lambda $: Regularization parameter that controls the penalty strength.
$ \theta_j^2 $: Squared value of the coefficients.

Elastic Net Regularization

Combines L1 and L2 regularization techniques:

\[ J(\theta) = \frac{1}{2m} \sum_{i=1}^m \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda_1 \sum_{j=1}^n |\theta_j| + \lambda_2 \sum_{j=1}^n \theta_j^2 \]

$ \lambda_1, \lambda_2 $: Regularization parameters for L1 and L2 penalties.

Effect of Regularization

L1 Regularization: Encourages sparsity by setting some coefficients to zero, leading to feature selection.
L2 Regularization: Shrinks all coefficients uniformly, reducing model complexity without eliminating features.

Convergence in Optimization

Convergence is achieved when the cost function stabilizes or changes below a certain threshold across iterations. Factors affecting convergence include:

Learning Rate ($ \eta $): Too small causes slow convergence, while too large may overshoot the minimum.
Initial Parameter Values: Poor initialization can lead to suboptimal solutions or slow convergence.
Gradient Descent Variant: Batch, Stochastic, or Mini-batch Gradient Descent can influence speed and stability.

3. Closed-Form Solution for Linear Regression

In cases where computational efficiency is a concern, ordinary least squares (OLS) provides a closed-form solution for linear regression:

\[ \theta = \left( X^T X \right)^{-1} X^T y \]

Explanation:

$ X $: The design matrix (input data).
$ y $: The target vector.
$ \left( X^T X \right)^{-1} $: The inverse of the matrix product $ X^T X $.

Summary of Mathematical Flow

Define the cost function $ J(\theta) $.
Compute its gradient $ \frac{\partial J(\theta)}{\partial \theta} $.
Use gradient descent or a similar optimization algorithm to minimize $ J(\theta) $.
Regularize to reduce overfitting (if necessary).

Binary Study

Search This Blog