 Jul03

### Related Posts

Gradient descent is an optimization algorithm which finds a local minimum of a function by taking steps proportional to the negative of the gradient of the function at the current point. In a machine learning problem, the function usually is the cost function you want to minimize.

$\beta_j^{i+1} \leftarrow \beta_j^{i} – \alpha \frac{dJ(\beta^i)}{d\beta_j}$

The step size, between the current position and the next, parametered by alpha in the previous formula, is usually predefined. If it’s too big the algorithm might not converge while if it’s too small it will converge too slowly to a local minimum. At each step you might want to check that the current position renders a smaller cost value than the previous position, if not your step size is too big. If the function you want to optimize has multiple minimums the local minimum you end to will depend of your starting position. In that case, it might make sense to try multiple random values for the starting positions.

Gradient descent is not a fast algorithm because it needs you to process the entire dataset at each step to calculate the gradient.

There are close alternative to the gradient descent method. On one side, they accelerate the optimization because they do not use the entire dataset and because they are parallelizable but they are more complex to handle due to the fact that the cost function is now allowed to increase at a single step:
• The stochastic gradient descent picks one random data point at each step and calculates the gradient only from this point.
• The batch gradient descent clusters your dataset in random smaller subsets of data points and at each step only takes into account one subset to calculate the gradient.