ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 207 - Professional Machine Learning Engineer discussion

Report
Export

You have deployed a scikit-learn model to a Vertex Al endpoint using a custom model server. You enabled auto scaling; however, the deployed model fails to scale beyond one replica, which led to dropped requests. You notice that CPU utilization remains low even during periods of high load. What should you do?

A.
Attach a GPU to the prediction nodes.
Answers
A.
Attach a GPU to the prediction nodes.
B.
Increase the number of workers in your model server.
Answers
B.
Increase the number of workers in your model server.
C.
Schedule scaling of the nodes to match expected demand.
Answers
C.
Schedule scaling of the nodes to match expected demand.
D.
Increase the minReplicaCount in your DeployedModel configuration.
Answers
D.
Increase the minReplicaCount in your DeployedModel configuration.
Suggested answer: B

Explanation:

Auto scaling is a feature that allows you to automatically adjust the number of prediction nodes based on the traffic and load of your deployed model1.However, auto scaling depends on the CPU utilization of your prediction nodes, which is the percentage of CPU resources used by your model server1.If your CPU utilization is low, even during periods of high load, it means that your model server is not fully utilizing the available CPU resources, and thus auto scaling will not trigger more replicas2.

One possible reason for low CPU utilization is that your model server is using a single worker process to handle prediction requests3.A worker process is a subprocess that runs your model code and handles prediction requests3.If you have only one worker process, it can only handle one request at a time, which can lead to dropped requests when the traffic is high3.To increase the CPU utilization and the throughput of your model server, you can increase the number of worker processes, which will allow your model server to handle multiple requests in parallel3.

To increase the number of workers in your model server, you need to modify your custom model server code and use the--workersflag to specify the number of worker processes you want to use3. For example, if you are using a Gunicorn server, you can use the following command to start your model server with four worker processes:

gunicorn --bind :$PORT --workers 4 --threads 1 --timeout 60 main:app

By increasing the number of workers in your model server, you can increase the CPU utilization of your prediction nodes, and thus enable auto scaling to scale beyond one replica.

The other options are not suitable for your scenario, because they either do not address the root cause of low CPU utilization, such as attaching a GPU or scheduling scaling, or they do not enable auto scaling, such as increasing the minReplicaCount, which is a fixed number of nodes that will always run regardless of the traffic1.

Scaling prediction nodes | Vertex AI | Google Cloud

Troubleshooting | Vertex AI | Google Cloud

Using a custom prediction routine with online prediction | Vertex AI | Google Cloud

asked 18/09/2024
Amir Arefi
35 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first