ExamGecko
Home Home / Snowflake / DSA-C02

Snowflake DSA-C02 Practice Test - Questions Answers, Page 5

Question list
Search
Search

Which of the following is a Python-based web application framework for visualizing data and analyzing results in a more efficient and flexible way?

A.
StreamBI
A.
StreamBI
Answers
B.
Streamlit
B.
Streamlit
Answers
C.
Streamsets
C.
Streamsets
Answers
D.
Rapter
D.
Rapter
Answers
Suggested answer: B

Explanation:

Streamlit is a Python-based web application framework for visualizing data and analyzing results in a more efficient and flexible way. It is an open source library that assists data scientists and academics to develop Machine Learning (ML) visualization dashboards in a short period of time. We can build and deploy powerful data applications with just a few lines of code.

Why Streamlit?

Currently, real-world applications are in high demand and developers are developing new libraries and frameworks to make on-the-go dashboards easier to build and deploy. Streamlit is a library that reduces your dashboard development time from days to hours. Following are some reasons to choose the Streamlit:

It is a free and open-source library.

Installing Streamlit is as simple as installing any other python package

It is easy to learn because you won't need any web development experience, only a basic under-standing of Python is enough to build a data application.

It is compatible with almost all machine learning frameworks, including Tensorflow and Pytorch, Scikit-learn, and visualization libraries such as Seaborn, Altair, Plotly, and many others.

Which is the visual depiction of data through the use of graphs, plots, and informational graphics?

A.
Data Interpretation
A.
Data Interpretation
Answers
B.
Data Virtualization
B.
Data Virtualization
Answers
C.
Data visualization
C.
Data visualization
Answers
D.
Data Mining
D.
Data Mining
Answers
Suggested answer: D

Explanation:

Data visualization is the visual depiction of data through the use of graphs, plots, and informational graphics. Its practitioners use statistics and data science to convey the meaning behind data in ethical and accurate ways.

Which method is used for detecting data outliers in Machine learning?

A.
Scaler
A.
Scaler
Answers
B.
Z-Score
B.
Z-Score
Answers
C.
BOXI
C.
BOXI
Answers
D.
CMIYC
D.
CMIYC
Answers
Suggested answer: B

Explanation:

What are outliers?

Outliers are the values that look different from the other values in the data. Below is a plot high-lighting the outliers in 'red' and outliers can be seen in both the extremes of data.

Reasons for outliers in data

Errors during data entry or a faulty measuring device (a faulty sensor may result in extreme readings).

Natural occurrence (salaries of junior level employees vs C-level employees)

Problems caused by outliers

Outliers in the data may causes problems during model fitting (esp. linear models).

Outliers may inflate the error metrics which give higher weights to large errors (example, mean squared error, RMSE).

Z-score method is of the method for detecting outliers. This method is generally used when a variable' distribution looks close to Gaussian. Z-score is the number of standard deviations a value of a variable is away from the variable' mean.

Z-Score = (X-mean) / Standard deviation

IQR method , Box plots are some more example of methods used to detect data outliers in Data science.

Mark the correct steps for saving the contents of a DataFrame to a Snowflake table as part of Moving Data from Spark to Snowflake?

A.
Step 1.Use the PUT() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the NAME() method. Step 3.Use the dbtable option to specify the table to which data is written. Step 4.Specify the connector options using either the option() or options() method. Step 5.Use the save() method to specify the save mode for the content.
A.
Step 1.Use the PUT() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the NAME() method. Step 3.Use the dbtable option to specify the table to which data is written. Step 4.Specify the connector options using either the option() or options() method. Step 5.Use the save() method to specify the save mode for the content.
Answers
B.
Step 1.Use the PUT() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Specify the connector options using either the option() or options() method. Step 4.Use the dbtable option to specify the table to which data is written. Step 5.Use the save() method to specify the save mode for the content.
B.
Step 1.Use the PUT() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Specify the connector options using either the option() or options() method. Step 4.Use the dbtable option to specify the table to which data is written. Step 5.Use the save() method to specify the save mode for the content.
Answers
C.
Step 1.Use the write() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Specify the connector options using either the option() or options() method. Step 4.Use the dbtable option to specify the table to which data is written. Step 5.Use the mode() method to specify the save mode for the content. (Correct)
C.
Step 1.Use the write() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Specify the connector options using either the option() or options() method. Step 4.Use the dbtable option to specify the table to which data is written. Step 5.Use the mode() method to specify the save mode for the content. (Correct)
Answers
D.
Step 1.Use the writer() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Use the dbtable option to specify the table to which data is written. Step 4.Specify the connector options using either the option() or options() method. Step 5.Use the save() method to specify the save mode for the content.
D.
Step 1.Use the writer() method of the DataFrame to construct a DataFrameWriter. Step 2.Specify SNOWFLAKE_SOURCE_NAME using the format() method. Step 3.Use the dbtable option to specify the table to which data is written. Step 4.Specify the connector options using either the option() or options() method. Step 5.Use the save() method to specify the save mode for the content.
Answers
Suggested answer: C

Explanation:

Moving Data from Spark to Snowflake

The steps for saving the contents of a DataFrame to a Snowflake table are similar to writing from Snowflake to Spark:

1. Use the write() method of the DataFrame to construct a DataFrameWriter.

2. Specify SNOWFLAKE_SOURCE_NAME using the format() method.

3. Specify the connector options using either the option() or options() method.

4. Use the dbtable option to specify the table to which data is written.

5. Use the mode() method to specify the save mode for the content.

Examples

1. df.write

2. .format(SNOWFLAKE_SOURCE_NAME)

3. .options(sfOptions)

4. .option('dbtable', 't2')

5. .mode(SaveMode.Overwrite)

6. .save()

Select the Data Science Tools which are known to provide native connectivity to Snowflake?

A.
Denodo
A.
Denodo
Answers
B.
DvSUM
B.
DvSUM
Answers
C.
DiYotta
C.
DiYotta
Answers
D.
HEX
D.
HEX
Answers
Suggested answer: D

Explanation:

Hex --- collaborative data science and analytics platform

Denodo --- data virtualization and federation platform

DvSum --- data catalog and data intelligence platform

Diyotta --- data integration and migration

Which one of the following is not the key component while designing External functions within Snowflake?

A.
Remote Service
A.
Remote Service
Answers
B.
API Integration
B.
API Integration
Answers
C.
UDF Service
C.
UDF Service
Answers
D.
Proxy Service
D.
Proxy Service
Answers
Suggested answer: C

Explanation:

What is an External Function?

An external function calls code that is executed outside Snowflake.

The remotely executed code is known as a remote service.

Information sent to a remote service is usually relayed through a proxy service.

Snowflake stores security-related external function information in an API integration.

External Function:

An external function is a type of UDF. Unlike other UDFs, an external function does not contain its own code; instead, the external function calls code that is stored and executed outside Snowflake.

Inside Snowflake, the external function is stored as a database object that contains information that Snowflake uses to call the remote service. This stored information includes the URL of the proxy service that relays information to and from the remote service.

Remote Service:

The remotely executed code is known as a remote service.

The remote service must act like a function. For example, it must return a value.

Snowflake supports scalar external functions; the remote service must return exactly one row for each row received.

Proxy Service:

Snowflake does not call a remote service directly. Instead, Snowflake calls a proxy service, which relays the data to the remote service.

The proxy service can increase security by authenticating requests to the remote service.

The proxy service can support subscription-based billing for a remote service. For example, the proxy service can verify that a caller to the remote service is a paid subscriber.

The proxy service also relays the response from the remote service back to Snowflake.

Examples of proxy services include:

Amazon API Gateway.

Microsoft Azure API Management service.

API Integration:

An integration is a Snowflake object that provides an interface between Snowflake and third-party services. An API integration stores information, such as security information, that is needed to work with a proxy service or remote service.

An API integration is created with the CREATE API INTEGRATION command.

Users can write and call their own remote services, or call remote services written by third parties. These remote services can be written using any HTTP server stack, including cloud serverless compute services such as AWS Lambda.

Which ones are the known limitations of using External function?

A.
Currently, external functions cannot be shared with data consumers via Secure Data Sharing.
A.
Currently, external functions cannot be shared with data consumers via Secure Data Sharing.
Answers
B.
Currently, external functions must be scalar functions. A scalar external function re-turns a single value for each input row.
B.
Currently, external functions must be scalar functions. A scalar external function re-turns a single value for each input row.
Answers
C.
External functions have more overhead than internal functions (both built-in functions and internal UDFs) and usually execute more slowly
C.
External functions have more overhead than internal functions (both built-in functions and internal UDFs) and usually execute more slowly
Answers
D.
An external function accessed through an AWS API Gateway private endpoint can be accessed only from a Snowflake VPC (Virtual Private Cloud) on AWS and in the same AWS region.
D.
An external function accessed through an AWS API Gateway private endpoint can be accessed only from a Snowflake VPC (Virtual Private Cloud) on AWS and in the same AWS region.
Answers
Suggested answer: A, B, C, D

What is the risk with tuning hyper-parameters using a test dataset?

A.
Model will overfit the test set
A.
Model will overfit the test set
Answers
B.
Model will underfit the test set
B.
Model will underfit the test set
Answers
C.
Model will overfit the training set
C.
Model will overfit the training set
Answers
D.
Model will perform balanced
D.
Model will perform balanced
Answers
Suggested answer: A

Explanation:

The model will not generalize well to unseen data because it overfits the test set. Tuning model hyper-parameters to a test set means that the hyper-parameters may overfit to that test set. If the same test set is used to estimate performance, it will produce an overestimate. The test set should be used only for testing, not for parameter tuning.

Using a separate validation set for tuning and test set for measuring performance provides unbiased, realistic measurement of performance.

What are hyper-parameters?

Hyper-parameters are parameters whose values control the learning process and determine the values of model parameters that a learning algorithm ends up learning. We can't calculate their values from the data.

Example: Number of clusters in clustering, number of hidden layers in a neural network, and depth of a tree are some of the examples of hyper-parameters.

What is the hyper-parameter tuning?

Hyper-parameter tuning is the process of choosing the right combination of hyper-parameters that maximizes the model performance. It works by running multiple trials in a single training process. Each trial is a complete execution of your training application with values for your chosen hyper-parameters, set within the limits you specify. This process once finished will give you the set of hyper-parameter values that are best suited for the model to give optimal results.

Select the correct mappings:

I) W Weights or Coefficients of independent variables in the Linear regression model --> Model Pa-rameter

II) K in the K-Nearest Neighbour algorithm --> Model Hyperparameter

III) Learning rate for training a neural network --> Model Hyperparameter

IV) Batch Size --> Model Parameter

A.
I,II
A.
I,II
Answers
B.
I,II,III
B.
I,II,III
Answers
C.
III,IV
C.
III,IV
Answers
D.
II,III,IV
D.
II,III,IV
Answers
Suggested answer: B

Explanation:

Hyperparameters in Machine learning are those parameters that are explicitly defined by the user to control the learning process. These hyperparameters are used to improve the learning of the model, and their values are set before starting the learning process of the model.

What are hyperparameters?

In Machine Learning/Deep Learning, a model is represented by its parameters. In contrast, a training process involves selecting the best/optimal hyperparameters that are used by learning algorithms to provide the best result. So, what are these hyperparameters? The answer is, 'Hyperparameters are defined as the parameters that are explicitly defined by the user to control the learning process.'

Here the prefix 'hyper' suggests that the parameters are top-level parameters that are used in con-trolling the learning process. The value of the Hyperparameter is selected and set by the machine learning engineer before the learning algorithm begins training the model. Hence, these are external to the model, and their values cannot be changed during the training process.

Some examples of Hyperparameters in Machine Learning

* The k in kNN or K-Nearest Neighbour algorithm

* Learning rate for training a neural network

* Train-test split ratio

* Batch Size

* Number of Epochs

* Branches in Decision Tree

* Number of clusters in Clustering Algorithm

Model Parameters:

Model parameters are configuration variables that are internal to the model, and a model learns them on its own. For example, W Weights or Coefficients of independent variables in the Linear regression model. or Weights or Coefficients of independent variables in SVM, weight, and biases of a neural network, cluster centroid in clustering. Some key points for model parameters are as follows:

They are used by the model for making predictions.

* They are learned by the model from the data itself

* These are usually not set manually.

* These are the part of the model and key to a machine learning Algorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined by the user to control the learning process. Some key points for model parameters are as follows:

These are usually defined manually by the machine learning engineer.

One cannot know the exact best value for hyperparameters for the given problem. The best value can be determined either by the rule of thumb or by trial and error.

Some examples of Hyperparameters are the learning rate for training a neural network, K in the KNN algorithm.

Performance metrics are a part of every machine learning pipeline, Which ones are not the performance metrics used in the Machine learning?

A.
R (R-Squared)
A.
R (R-Squared)
Answers
B.
Root Mean Squared Error (RMSE)
B.
Root Mean Squared Error (RMSE)
Answers
C.
AU-ROC
C.
AU-ROC
Answers
D.
AUM
D.
AUM
Answers
Suggested answer: D

Explanation:

Every machine learning task can be broken down to either Regression or Classification, just like the performance metrics.

Metrics are used to monitor and measure the performance of a model (during training and testing), and do not need to be differentiable.

Regression metrics

Regression models have continuous output. So, we need a metric based on calculating some sort of distance between predicted and ground truth.

In order to evaluate Regression models, we'll discuss these metrics in detail:

* Mean Absolute Error (MAE),

* Mean Squared Error (MSE),

* Root Mean Squared Error (RMSE),

* R (R-Squared).

Mean Squared Error (MSE)

Mean squared error is perhaps the most popular metric used for regression problems. It essentially finds the average of the squared difference between the target value and the value predicted by the regression model.

Few key points related to MSE:

* It's differentiable, so it can be optimized better.

* It penalizes even small errors by squaring them, which essentially leads to an overestimation of how bad the model is.

* Error interpretation has to be done with squaring factor(scale) in mind. For example in our Boston Housing regression problem, we got MSE=21.89 which primarily corresponds to (Prices).

* Due to the squaring factor, it's fundamentally more prone to outliers than other metrics.

Mean Absolute Error (MAE)

Mean Absolute Error is the average of the difference between the ground truth and the predicted values.

Few key points for MAE

* It's more robust towards outliers than MAE, since it doesn't exaggerate errors.

* It gives us a measure of how far the predictions were from the actual output. However, since MAE uses absolute value of the residual, it doesn't give us an idea of the direction of the error, i.e. whether we're under-predicting or over-predicting the data.

* Error interpretation needs no second thoughts, as it perfectly aligns with the original degree of the variable.

* MAE is non-differentiable as opposed to MSE, which is differentiable.

Root Mean Squared Error (RMSE)

Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target value and the value predicted by the regression model.

Few key points related to RMSE:

* It retains the differentiable property of MSE.

* It handles the penalization of smaller errors done by MSE by square rooting it.

* Error interpretation can be done smoothly, since the scale is now the same as the random variable.

* Since scale factors are essentially normalized, it's less prone to struggle in the case of outliers.

R Coefficient of determination

R Coefficient of determination actually works as a post metric, meaning it's a metric that's calcu-lated using other metrics.

The point of even calculating this coefficient is to answer the question ''How much (what %) of the total variation in Y(target) is explained by the variation in X(regression line)''

Few intuitions related to R results:

If the sum of Squared Error of the regression line is small => R will be close to 1 (Ideal), meaning the regression was able to capture 100% of the variance in the target variable.

Conversely, if the sum of squared error of the regression line is high => R will be close to 0, meaning the regression wasn't able to capture any variance in the target variable.

You might think that the range of R is (0,1) but it's actually (-,1) because the ratio of squared errors of the regression line and mean can surpass the value 1 if the squared error of regression line is too high (>squared error of the mean).

Classification metrics

Classification problems are one of the world's most widely researched areas. Use cases are present in almost all production and industrial environments. Speech recognition, face recognition, text classification -- the list is endless.

Classification models have discrete output, so we need a metric that compares discrete classes in some form. Classification Metrics evaluate a model's performance and tell you how good or bad the classification is, but each of them evaluates it in a different way.

So in order to evaluate Classification models, we'll discuss these metrics in detail:

Accuracy

Confusion Matrix (not a metric but fundamental to others)

Precision and Recall

F1-score

AU-ROC

Accuracy

Classification accuracy is perhaps the simplest metric to use and implement and is defined as the number of correct predictions divided by the total number of predictions, multiplied by 100.

We can implement this by comparing ground truth and predicted values in a loop or simply utilizing the scikit-learn module to do the heavy lifting for us (not so heavy in this case).

Confusion Matrix

Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.

Each cell in the confusion matrix represents an evaluation factor. Let's understand these factors one by one:

* True Positive(TP) signifies how many positive class samples your model predicted correctly.

* True Negative(TN) signifies how many negative class samples your model predicted correctly.

* False Positive(FP) signifies how many negative class samples your model predicted incorrectly. This factor represents Type-I error in statistical nomenclature. This error positioning in the confusion matrix depends on the choice of the null hypothesis.

* False Negative(FN) signifies how many positive class samples your model predicted incorrectly. This factor represents Type-II error in statistical nomenclature. This error positioning in the confu-sion matrix also depends on the choice of the null hypothesis.

Precision

Precision is the ratio of true positives and total positives predicted

Recall/Sensitivity/Hit-Rate

A Recall is essentially the ratio of true positives to all the positives in ground truth.

Precision-Recall tradeoff

To improve your model, you can either improve precision or recall -- but not both! If you try to re-duce cases of non-cancerous patients being labeled as cancerous (FN/type-II), no direct effect will take place on cancerous patients being labeled as non-cancerous.

F1-score

The F1-score metric uses a combination of precision and recall. In fact, the F1 score is the harmonic mean of the two.

AUROC (Area under Receiver operating characteristics curve)

Better known as AUC-ROC score/curves. It makes use of true positive rates(TPR) and false posi-tive rates(FPR).

Total 65 questions
Go to page: of 7