Which of the following statements is false about gradient descent algorithms?
A.
Each time the global gradient updates its weight, all training samples need to be calculated.
A.
Each time the global gradient updates its weight, all training samples need to be calculated.
B.
When GPUs are used for parallel computing, the mini-batch gradient descent (MBGD) takes less time than the stochastic gradient descent (SGD) to complete an epoch.
B.
When GPUs are used for parallel computing, the mini-batch gradient descent (MBGD) takes less time than the stochastic gradient descent (SGD) to complete an epoch.
C.
The global gradient descent is relatively stable, which helps the model converge to the global extremum.
C.
The global gradient descent is relatively stable, which helps the model converge to the global extremum.
D.
When there are too many samples and GPUs are not used for parallel computing, the convergence process of the global gradient algorithm is time-consuming.
D.
When there are too many samples and GPUs are not used for parallel computing, the convergence process of the global gradient algorithm is time-consuming.
Suggested answer: B
Explanation:
The statement that mini-batch gradient descent (MBGD) takes less time than stochastic gradient descent (SGD) to complete an epoch when GPUs are used for parallel computing is incorrect. Here's why:
Stochastic Gradient Descent (SGD) updates the weights after each training sample, which can lead to faster updates but more noise in the gradient steps. It completes an epoch after processing all samples one by one.
Mini-batch Gradient Descent (MBGD) processes small batches of data at a time, updating the weights after each batch. While MBGD leverages the computational power of GPUs effectively for parallelization, the comparison made in this question is not about overall computation speed, but about completing an epoch.
MBGD does not necessarily complete an epoch faster than SGD, as MBGD processes multiple samples in each batch, meaning fewer updates per epoch compared to SGD, where weights are updated after every individual sample.
Therefore, the correct answer is B. FALSE, as MBGD does not always take less time than SGD for completing an epoch, even when GPUs are used for parallelization.
HCIA AI
AI Development Framework: Discussion of gradient descent algorithms and their efficiency on different hardware architectures like GPUs.
Question