July 15, 2020
Title: What is Gradient Descent Algorithm and its working.
Gradient descent is a type of machine learning algorithm that helps us in optimizing neural networks and many other algorithms. This article ventures into how this algorithm actually works, its types and its significance in the real world.
Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne’s, caffe’s, and keras’ documentation).
The reason we’re talking about it here is not merely theoretical. Gradient Descent algorithm is much more than it seems to be. It is used time and again by ML practitioners, Data scientists and students to optimize their models.
Gradient descent is a way to minimize an objective function parameterized by a model’s parameters by updating the parameters in the opposite direction of the gradient of the objective function w.r.t. to the parameters. The learning rate determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
Now that you’ve gotten a basic insight of the algorithm, let’s dig deep in it in this post. We will define and cover some important aspects like its working, it’s working examples, types and a final conclusion to mould it all.
Answer the question posed by the title of this post directly below this header. This will increase your chances of ranking for the featured snippet on Google for this phrase and provide readers with an immediate answer. Keep the length of this definition – at least in this very basic introduction – between 50 and 60 words.
After the brief definition, dive further into the concept and add more context and explanation if needed.
Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).
Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.
Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. To find a local minimum of a function using gradient descent, we take steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. But if we instead take steps proportional to the positive of the gradient, we approach a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent was originally proposed by Cauchy in 1847.
Gradient descent is also known as steepest descent; but gradient descent should not be confused with the method of steepest descent for approximating integrals.
Provide your readers with a few reasons why they should care about the term or the concept you’re writing about. If this is a consumer-level concept, talk about the implications this could have on their businesses, finances, personal happiness, etc. If you’re writing for an audience of professionals, mention the impact this term or concept has on profit, efficiency, and/or customer satisfaction. To make the most of this section, make sure it includes at least one statistic, quote, or outside reference.
Include at Least One of These Next Three Sections
There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.
The procedure starts off with initial values for the coefficient or coefficients for the function. These could be 0.0 or a small random value.
coefficient = 0.0
The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.
cost = f(coefficient)
or
cost = evaluate(f(coefficient))
The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.
delta = derivative(cost)
Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter (alpha) must be specified that controls how much the coefficients can change on each update.
coefficient = coefficient – (alpha * delta)
This process is repeated until the cost of the coefficients (cost) is 0.0 or close enough to zero to be good enough.
You can see how simple gradient descent is. It does require you to know the gradient of your cost function or the function you are optimizing, but besides that, it’s very straightforward. Next we will see the math behind it and how we can use this in machine learning algorithms.
Suppose we have the following given:
Hypothesis: hθ(x)= θ^Tx=θ0x0+θ1x1+…+θnxn
Parameters: θ0, θ1, θ2,…,θn
Cost function: J(θ)=J(θ0, θ1, θ2,…,θn)
Consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:
θj := θj − α ∂/∂θj (J(θ))
(This update is simultaneously performed for all values of j = 0,…,n.) Here, α is called the learning rate. This is a very natural algorithm that repeatedly takes a step in the direction of steepest decrease of J.
We’d derived the LMS rule for when there was only a single training example. There are two ways to modify this method for a training set of more than one example. The first is replace it with the following algorithm:
The reader can easily verify that the quantity in the summation in the update rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is simply gradient descent on the original cost function J. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic function.
*Note: This section only applies for posts about math and equations.*
Provide a step-by-step explanation and example of how to calculate the rate, point, or number you’re providing a definition for.
Variables used:Let m be the number of training examples.Let n be the number of features.
Note: if b == m, then mini batch gradient descent will behave similarly to batch gradient descent.
Algorithm for batch gradient descent :Let hθ(x) be the hypothesis for linear regression. Then, the cost function is given by:Let Σ represents the sum of all training examples from i=1 to m.
Jtrain(θ) = (1/2m) Σ( hθ(x(i)) - y(i))2
Repeat {
θj = θj – (learning rate/m) * Σ( hθ(x(i)) - y(i))xj(i)
For every j =0 …n
}
Where xj(i) Represents the jth feature of the ith training example. So if m is very large(e.g. 5 million training samples), then it takes hours or even days to converge to the global minimum.That’s why for large datasets, it is not recommended to use batch gradient descent as it slows down the learning.
Algorithm for stochastic gradient descent:
In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent).
Hence,
Let (x(i),y(i)) be the training example
Cost(θ, (x(i),y(i))) = (1/2) Σ( hθ(x(i)) - y(i))2
Jtrain(θ) = (1/m) Σ Cost(θ, (x(i),y(i)))
Repeat {
For i=1 to m{
θj = θj – (learning rate) * Σ( hθ(x(i)) - y(i))xj(i)
For every j =0 …n
}
}
Algorithm for mini batch gradient descent:Say b be the no of examples in one batch, where b < m.Assume b = 10, m = 100;
Note: However we can adjust the batch size. It is generally kept as power of 2. The reason behind it is because some hardware such as GPUs achieve better run time with common batch sizes such as power of 2.
Repeat {
For i=1,11, 21,…..,91
Let Σ be the summation from i to i+9 represented by k.
θj = θj – (learning rate/size of (b) ) * Σ( hθ(x(k)) - y(k))xj(k)
For every j =0 …n
}
To choose α, try …,0.001,0.01,0.1,1,… etc.
Batch gradient descent has to scan through the entire training set before taking a single step—a costly operation if m is large—stochastic gradient descent can start making progress right away, and continues to make progress with each example it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the minimum will be reasonably good approximations to the true minimum.) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.
If you feel like it would benefit your readers, list a few examples of the concept you’re explaining in action. You can elevate this section by embedding images, videos, and/or social media posts.
*Remember, this post is not a list post – so try to keep this list between three and five examples if you do decide to include it.*
When breaking down a difficult concept or definition, some readers may still feel overwhelmed and unsure of their ability to address it. Break down a few best practices on how to approach the concept, and/or a few reminders about it. Again, this is not a list post, so keep this short list to three to five pieces of advice.
This section lists some tips and tricks for getting the most out of the gradient descent algorithm for machine learning.
Convergence trends in different variants of Gradient Descents:
In case of Batch Gradient Descent, the algorithm follows a straight path towards the minimum. If the cost function is convex, then it converges to a global minimum and if the cost function is not convex, then it converges to a local minimum. Here the learning rate is typically held constant.
In case of stochastic gradient Descent and mini-batch gradient descent, the algorithm does not converge but keeps on fluctuating around the global minimum. Therefore in order to make it converge, we have to slowly change the learning rate. However the convergence of Stochastic gradient descent is much noisier as in one iteration, it processes only one training example.
Wrap up your amazing new blog post with a great closing. Remind your readers of the key takeaway you want them to walk away with and consider pointing them to other resources you have on your website.
In this post you discovered gradient descent for machine learning. You learned that:
Do you have any questions about gradient descent for machine learning or this post? Leave a comment and ask your question and I will do my best to answer it.
Last but not least, place a call-to-action at the bottom of your blog post. This should be to a lead-generating piece of content or to a sales-focused landing page for a demo or consultation.
Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning
Gradient Descent For Machine Learning - Machine Learning Mastery