ExamGecko
Question list
Search
Search

List of questions

Search

Related questions











Question 18 - Professional Machine Learning Engineer discussion

Report
Export

Your team is working on an NLP research project to predict political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:

You followed the standard 80%-10%-10% data distribution across the training, testing, and evaluation subsets. How should you distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion?

A)

B)

C)

D)

A.
Option A
Answers
A.
Option A
B.
Option B
Answers
B.
Option B
C.
Option C
Answers
C.
Option C
D.
Option D
Answers
D.
Option D
Suggested answer: C

Explanation:

The best way to distribute the training examples across the train-test-eval subsets while maintaining the 80-10-10 proportion is to use option C. This option ensures that each subset contains a balanced and representative sample of the different classes (Democrat and Republican) and the different authors. This way, the model can learn from a diverse and comprehensive set of articles and avoid overfitting or underfitting. Option C also avoids the problem of data leakage, which occurs when the same author appears in more than one subset, potentially biasing the model and inflating its performance. Therefore, option C is the most suitable technique for this use case.

asked 18/09/2024
Elliott Fields
34 questions
User
Your answer:
0 comments
Sorted by

Leave a comment first