ExamGecko
Home Home / Snowflake / DSA-C02

Snowflake DSA-C02 Practice Test - Questions Answers, Page 6

Question list
Search
Search

Which of the following cross validation versions may not be suitable for very large datasets with hundreds of thousands of samples?

A.
k-fold cross-validation
A.
k-fold cross-validation
Answers
B.
Leave-one-out cross-validation
B.
Leave-one-out cross-validation
Answers
C.
Holdout method
C.
Holdout method
Answers
D.
All of the above
D.
All of the above
Answers
Suggested answer: B

Explanation:

Leave-one-out cross-validation (LOO cross-validation) is not suitable for very large datasets due to the fact that this validation technique requires one model for every sample in the training set to be created and evaluated.

Cross validation

It is a technique to evaluate a machine learning model and it is the basis for whole class of model evaluation methods. The goal of cross-validation is to test the model's ability to predict new data that was not used in estimating it. It works by the idea of splitting dataset into number of subsets, keep a subset aside, train the model, and test the model on the holdout subset.

Leave-one-out cross validation

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As be-fore the average error is computed and used to evaluate the model. The evaluation given by leave-one-out cross validation is very expensive to compute at first pass.

Which of the following cross validation versions is suitable quicker cross-validation for very large datasets with hundreds of thousands of samples?

A.
k-fold cross-validation
A.
k-fold cross-validation
Answers
B.
Leave-one-out cross-validation
B.
Leave-one-out cross-validation
Answers
C.
Holdout method
C.
Holdout method
Answers
D.
All of the above
D.
All of the above
Answers
Suggested answer: C

Explanation:

Holdout cross-validation method is suitable for very large dataset because it is the simplest and quicker to compute version of cross-validation.

Holdout method

In this method, the dataset is divided into two sets namely the training and the test set with the basic property that the training set is bigger than the test set. Later, the model is trained on the training dataset and evaluated using the test dataset.

Which of the following is a common evaluation metric for binary classification?

A.
Accuracy
A.
Accuracy
Answers
B.
F1 score
B.
F1 score
Answers
C.
Mean squared error (MSE)
C.
Mean squared error (MSE)
Answers
D.
Area under the ROC curve (AUC)
D.
Area under the ROC curve (AUC)
Answers
Suggested answer: D

Explanation:

The area under the ROC curve (AUC) is a common evaluation metric for binary classification, which measures the performance of a classifier at different threshold values for the predicted probabilities. Other common metrics include accuracy, precision, recall, and F1 score, which are based on the confusion matrix of true positives, false positives, true negatives, and false negatives.

The most widely used metrics and tools to assess a classification model are:

A.
Confusion matrix
A.
Confusion matrix
Answers
B.
Cost-sensitive accuracy
B.
Cost-sensitive accuracy
Answers
C.
Area under the ROC curve
C.
Area under the ROC curve
Answers
D.
All of the above
D.
All of the above
Answers
Suggested answer: D

How do you handle missing or corrupted data in a dataset?

A.
Drop missing rows or columns
A.
Drop missing rows or columns
Answers
B.
Replace missing values with mean/median/mode
B.
Replace missing values with mean/median/mode
Answers
C.
Assign a unique category to missing values
C.
Assign a unique category to missing values
Answers
D.
All of the above
D.
All of the above
Answers
Suggested answer: D

Which of the following is a useful tool for gaining insights into the relationship between features and predictions?

A.
numpy plots
A.
numpy plots
Answers
B.
sklearn plots
B.
sklearn plots
Answers
C.
Partial dependence plots(PDP)
C.
Partial dependence plots(PDP)
Answers
D.
FULL dependence plots (FDP)
D.
FULL dependence plots (FDP)
Answers
Suggested answer: C

Explanation:

Partial dependence plots (PDP) is a useful tool for gaining insights into the relationship between features and predictions. It helps us understand how different values of a particular feature impact model's predictions.

Which ones are the correct rules while using a data science model created via External function in Snowflake?

A.
External functions return a value. The returned value can be a compound value, such as a VARIANT that contains JSON.
A.
External functions return a value. The returned value can be a compound value, such as a VARIANT that contains JSON.
Answers
B.
External functions can be overloaded.
B.
External functions can be overloaded.
Answers
C.
An external function can appear in any clause of a SQL statement in which other types of UDF can appear.
C.
An external function can appear in any clause of a SQL statement in which other types of UDF can appear.
Answers
D.
External functions can accept Model parameters.
D.
External functions can accept Model parameters.
Answers
Suggested answer: A, B, C, D

Explanation:

From the perspective of a user running a SQL statement, an external function behaves like any other UDF . External functions follow these rules:

External functions return a value.

External functions can accept parameters.

An external function can appear in any clause of a SQL statement in which other types of UDF can appear. For example:

1. select my_external_function_2(column_1, column_2)

2. from table_1;

1. select col1

2. from table_1

3. where my_external_function_3(col2) < 0;

1. create view view1 (col1) as

2. select my_external_function_5(col1)

3. from table9;

An external function can be part of a more complex expression:

1. select upper(zipcode_to_city_external_function(zipcode))

2. from address_table;

The returned value can be a compound value, such as a VARIANT that contains JSON.

External functions can be overloaded; two different functions can have the same name but different signatures (different numbers or data types of input parameters).

All Snowpark ML modeling and preprocessing classes are in the ________ namespace?

A.
snowpark.ml.modeling
A.
snowpark.ml.modeling
Answers
B.
snowflake.sklearn.modeling
B.
snowflake.sklearn.modeling
Answers
C.
snowflake.scikit.modeling
C.
snowflake.scikit.modeling
Answers
D.
snowflake.ml.modeling
D.
snowflake.ml.modeling
Answers
Suggested answer: D

Explanation:

All Snowpark ML modeling and preprocessing classes are in the snowflake.ml.modeling namespace. The Snowpark ML modules have the same name as the corresponding module from the sklearn namespace. For example, the Snowpark ML module corresponding to sklearn.calibration is snow-flake.ml.modeling.calibration.

The xgboost and lightgbm modules correspond to snowflake.ml.modeling.xgboost and snow-flake.ml.modeling.lightgbm, respectively.

Not all of the classes from scikit-learn are supported in Snowpark ML.

Which type of Python UDFs let you define Python functions that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series?

A.
MPP Python UDFs
A.
MPP Python UDFs
Answers
B.
Scaler Python UDFs
B.
Scaler Python UDFs
Answers
C.
Vectorized Python UDFs
C.
Vectorized Python UDFs
Answers
D.
Hybrid Python UDFs
D.
Hybrid Python UDFs
Answers
Suggested answer: C

Explanation:

Vectorized Python UDFs let you define Python functions that receive batches of input rows as Pandas DataFrames and return batches of results as Pandas arrays or Series. You call vectorized Py-thon UDFs the same way you call other Python UDFs.

Advantages of using vectorized Python UDFs compared to the default row-by-row processing pat-tern include:

The potential for better performance if your Python code operates efficiently on batches of rows.

Less transformation logic required if you are calling into libraries that operate on Pandas Data-Frames or Pandas arrays.

When you use vectorized Python UDFs:

You do not need to change how you write queries using Python UDFs. All batching is handled by the UDF framework rather than your own code.

As with non-vectorized UDFs, there is no guarantee of which instances of your handler code will see which batches of input.

Which of the following metrics are used to evaluate classification models?

A.
Area under the ROC curve
A.
Area under the ROC curve
Answers
B.
F1 score
B.
F1 score
Answers
C.
Confusion matrix
C.
Confusion matrix
Answers
D.
All of the above
D.
All of the above
Answers
Suggested answer: D

Explanation:

Evaluation metrics are tied to machine learning tasks. There are different metrics for the tasks of classification and regression. Some metrics, like precision-recall, are useful for multiple tasks. Classification and regression are examples of supervised learning, which constitutes a majority of machine learning applications. Using different metrics for performance evaluation, we should be able to im-prove our model's overall predictive power before we roll it out for production on unseen data. Without doing a proper evaluation of the Machine Learning model by using different evaluation metrics, and only depending on accuracy, can lead to a problem when the respective model is deployed on unseen data and may end in poor predictions.

Classification metrics are evaluation measures used to assess the performance of a classification model. Common metrics include accuracy (proportion of correct predictions), precision (true positives over total predicted positives), recall (true positives over total actual positives), F1 score (har-monic mean of precision and recall), and area under the receiver operating characteristic curve (AUC-ROC).

Confusion Matrix

Confusion Matrix is a performance measurement for the machine learning classification problems where the output can be two or more classes. It is a table with combinations of predicted and actual values.

It is extremely useful for measuring the Recall, Precision, Accuracy, and AUC-ROC curves.

The four commonly used metrics for evaluating classifier performance are:

1. Accuracy: The proportion of correct predictions out of the total predictions.

2. Precision: The proportion of true positive predictions out of the total positive predictions (precision = true positives / (true positives + false positives)).

3. Recall (Sensitivity or True Positive Rate): The proportion of true positive predictions out of the total actual positive instances (recall = true positives / (true positives + false negatives)).

4. F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics (F1 score = 2 * ((precision * recall) / (precision + recall))).

These metrics help assess the classifier's effectiveness in correctly classifying instances of different classes.

Understanding how well a machine learning model will perform on unseen data is the main purpose behind working with these evaluation metrics. Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced then other methods like ROC/AUC perform better in evaluating the model performance.

ROC curve isn't just a single number but it's a whole curve that provides nuanced details about the behavior of the classifier. It is also hard to quickly compare many ROC curves to each other.

Total 65 questions
Go to page: of 7