ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 73 - Professional Machine Learning Engineer discussion

Report
Export

You are experimenting with a built-in distributed XGBoost model in Vertex AI Workbench user-managed notebooks. You use BigQuery to split your data into training and validation sets using the following queries:

CREATE OR REPLACE TABLE 'myproject.mydataset.training' AS

(SELECT * FROM 'myproject.mydataset.mytable' WHERE RAND() <= 0.8);

CREATE OR REPLACE TABLE 'myproject.mydataset.validation' AS

(SELECT * FROM 'myproject.mydataset.mytable' WHERE RAND() <= 0.2);

After training the model, you achieve an area under the receiver operating characteristic curve (AUC ROC) value of 0.8, but after deploying the model to production, you notice that your model performance has dropped to an AUC ROC value of 0.65. What problem is most likely occurring?

A.
There is training-serving skew in your production environment.
Answers
A.
There is training-serving skew in your production environment.
B.
There is not a sufficient amount of training data.
Answers
B.
There is not a sufficient amount of training data.
C.
The tables that you created to hold your training and validation records share some records, and you may not be using all the data in your initial table.
Answers
C.
The tables that you created to hold your training and validation records share some records, and you may not be using all the data in your initial table.
D.
The RAND() function generated a number that is less than 0.2 in both instances, so every record in the validation table will also be in the training table.
Answers
D.
The RAND() function generated a number that is less than 0.2 in both instances, so every record in the validation table will also be in the training table.
Suggested answer: C

Explanation:

The most likely problem is that the tables that you created to hold your training and validation records share some records, and you may not be using all the data in your initial table. This is because the RAND() function generates a random number between 0 and 1 for each row, and the probability of a row being in both the training and validation tables is 0.2 * 0.8 = 0.16, which is not negligible. This means that some of the records that you use to validate your model are also used to train your model, which can lead to overfitting and poor generalization. Moreover, the probability of a row being in neither the training nor the validation table is 0.2 * 0.2 = 0.04, which means that you are wasting some of the data in your initial table and reducing the size of your datasets. A better way to split your data into training and validation sets is to use a hash function on a unique identifier column, such as the following queries:

CREATE OR REPLACE TABLE 'myproject.mydataset.training' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE MOD(FARM_FINGERPRINT(id), 10) < 8); CREATE OR REPLACE TABLE 'myproject.mydataset.validation' AS (SELECT * FROM 'myproject.mydataset.mytable' WHERE MOD(FARM_FINGERPRINT(id), 10) >= 8);

This way, you can ensure that each row has a fixed 80% chance of being in the training table and a 20% chance of being in the validation table, without any overlap or omission.

Professional ML Engineer Exam Guide

Preparing for Google Cloud Certification: Machine Learning Engineer Professional Certificate

Google Cloud launches machine learning engineer certification

BigQuery ML: Splitting data for training and testing

BigQuery: FARM_FINGERPRINT function

asked 18/09/2024
Mpho Ntshontsi
41 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first