Logo Passei Direto
Material
Study with thousands of resources!

Text Material Preview

Databricks Machine Learning Associate Databricks Certified Machine Learning
Associate Exam exam dumps questions are the best material for you to test all
the related Databricks exam topics. By using the Databricks Machine Learning
Associate exam dumps questions and practicing your skills, you can increase
your confidence and chances of passing the Databricks Machine Learning
Associate exam.
Features of Dumpsinfo’s products
Instant Download
Free Update in 3 Months
Money back guarantee
PDF and Software
24/7 Customer Support
Besides, Dumpsinfo also provides unlimited access. You can get all
Dumpsinfo files at lowest price.
Databricks Certified Machine Learning Associate Exam Databricks Machine
Learning Associate exam free dumps questions are available below for you
to study. 
Full version: Databricks Machine Learning Associate Exam Dumps Questions
1.A machine learning engineer is using the following code block to scale the inference of a single-
node model on a Spark DataFrame with one million records:
 1 / 8
https://www.dumpsinfo.com/unlimited-access/
https://www.dumpsinfo.com/exam/databricks-machine-learning-associate
Assuming the default Spark configuration is in place, which of the following is a benefit of using an
Iterator?
A. The data will be limited to a single executor preventing the model from being loaded multiple times
B. The model will be limited to a single executor preventing the data from being distributed
C. The model only needs to be loaded once per executor rather than once per batch during the
inference process
D. The data will be distributed across multiple executors during the inference process
Answer: C
Explanation:
Using an iterator in the pandas_udf ensures that the model only needs to be loaded once per
executor rather than once per batch. This approach reduces the overhead associated with repeatedly
loading the model during the inference process, leading to more efficient and faster predictions. The
data will be distributed across multiple executors, but each executor will load the model only once,
optimizing the inference process.
Reference: Databricks documentation on pandas UDFs: Pandas UDFs
2.A team is developing guidelines on when to use various evaluation metrics for classification
problems. The team needs to provide input on when to use the F1 score over accuracy.
Which of the following suggestions should the team include in their guidelines?
A. The F1 score should be utilized over accuracy when the number of actual positive cases is
identical to the number of actual negative cases.
B. The F1 score should be utilized over accuracy when there are greater than two classes in the
target variable.
C. The F1 score should be utilized over accuracy when there is significant imbalance between
positive and negative classes and avoiding false negatives is a priority.
D. The F1 score should be utilized over accuracy when identifying true positives and true negatives
are equally important to the business problem.
Answer: C
Explanation:
The F1 score is the harmonic mean of precision and recall and is particularly useful in situations
where there is a significant imbalance between positive and negative classes. When there is a class
imbalance, accuracy can be misleading because a model can achieve high accuracy by simply
predicting the majority class. The F1 score, however, provides a better measure of the test's accuracy
in terms of both false positives and false negatives. Specifically, the F1 score should be used over
accuracy when:
 2 / 8
https://www.dumpsinfo.com/
There is a significant imbalance between positive and negative classes.
Avoiding false negatives is a priority, meaning recall (the ability to detect all positive instances) is
crucial.
In this scenario, the F1 score balances both precision (the ability to avoid false positives) and recall,
providing a more meaningful measure of a model’s performance under these conditions.
Reference: Databricks documentation on classification metrics: Classification Metrics
3.A machine learning engineer wants to parallelize the training of group-specific models using the
Pandas Function API. They have developed the train_model function, and they want to apply it to
each group of DataFrame df.
They have written the following incomplete code block:
Which of the following pieces of code can be used to fill in the above blank to complete the task?
A. applyInPandas
B. mapInPandas
C. predict
D. train_model
E. groupedApplyIn
Answer: B
Explanation:
The function mapInPandas in the PySpark DataFrame API allows for applying a function to each
partition of the DataFrame. When working with grouped data, groupby followed by applyInPandas is
the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if
the function should apply across each partition of the grouped data rather than on each individual
group, mapInPandas would be utilized. Since the code snippet indicates the use of groupby, the
intent seems to be to apply train_model on each group specifically, which aligns with applyInPandas.
Thus, applyInPandas is a better fit to ensure that each group generated by groupby is processed
through the train_model function, preserving the partitioning and grouping integrity.
Reference PySpark Documentation on applying functions to grouped data:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.applyInPa
ndas.html
4.Which of the following describes the relationship between native Spark DataFrames and pandas
API on Spark DataFrames?
A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional
metadata
B. pandas API on Spark DataFrames are more performant than Spark DataFrames
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
Answer: C
 3 / 8
https://www.dumpsinfo.com/
Explanation:
The pandas API on Spark DataFrames are made up of Spark DataFrames with additional metadata.
The pandas API on Spark aims to provide the pandas-like experience with the scalability and
distributed nature of Spark. It allows users to work with pandas functions on large datasets by
leveraging Spark’s underlying capabilities.
Reference: Databricks documentation on pandas API on Spark: pandas API on Spark
5.A data scientist has developed a random forest regressor rfr and included it as the final stage in a
Spark MLPipeline pipeline.
They then set up a cross-validation process with pipeline as the estimator in the following code block:
Which of the following is a negative consequence of including pipeline as the estimator in the cross-
validation process rather than rfr as the estimator?
A. The process will have a longer runtime because all stages of pipeline need to be refit or
retransformed with each mode
B. The process will leak data from the training set to the test set during the evaluation phase
C. The process will be unable to parallelize tuning due to the distributed nature of pipeline
D. The process will leak data prep information from the validation sets to the training sets for each
model
Answer: A
Explanation:
Including the entire pipeline as the estimator in the cross-validation process means that all stages of
the pipeline, including data preprocessing steps like string indexing and vector assembling, will be
refit or retransformed for each fold of the cross-validation. This results in a longer runtime because
each fold requires re-execution of these preprocessing steps, which can be computationally
expensive.
If only the random forest regressor (rfr) were included as the estimator, the preprocessing steps
would be performed once, and only the model fitting would be repeated for each fold, significantly
reducing the computational overhead.
Reference: Databricks documentation on cross-validation: Cross Validation
6.A data scientistis using Spark SQL to import their data into a machine learning pipeline. Once the
data is imported, the data scientist performs machine learning tasks using Spark ML.
 4 / 8
https://www.dumpsinfo.com/
Which of the following compute tools is best suited for this use case?
A. Single Node cluster
B. Standard cluster
C. SQL Warehouse
D. None of these compute tools support this task
Answer: B
Explanation:
For a data scientist using Spark SQL to import data and then performing machine learning tasks
using Spark ML, the best-suited compute tool is a Standard cluster. A Standard cluster in Databricks
provides the necessary resources and scalability to handle large datasets and perform distributed
computing tasks efficiently, making it ideal for running Spark SQL and Spark ML operations.
Reference: Databricks documentation on clusters: Clusters in Databricks
7.Which of the following tools can be used to parallelize the hyperparameter tuning process for single-
node machine learning models using a Spark cluster?
A. MLflow Experiment Tracking
B. Spark ML
C. Autoscaling clusters
D. Autoscaling clusters
E. Delta Lake
Answer: B
Explanation:
Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across
multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports
various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for
parallelizing hyperparameter tuning for single-node machine learning models when they are adapted
to run on Spark.
Reference
Apache Spark MLlib Guide: https://spark.apache.org/docs/latest/ml-guide.html
Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to
handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process
for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how
Spark ML can be used:
Hyperparameter Tuning with CrossValidator: Spark ML includes the CrossValidator and
TrainValidationSplit classes, which are used for hyperparameter tuning. These classes can evaluate
multiple sets of hyperparameters in parallel using a Spark cluster.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import
BinaryClassificationEvaluator
# Define the model model = ...
# Create a parameter grid paramGrid = ParamGridBuilder() \
.addGrid(model.hyperparam1, [value1, value2]) \
.addGrid(model.hyperparam2, [value3, value4]) \
.build()
# Define the evaluator
evaluator = BinaryClassificationEvaluator()
# Define the CrossValidator
crossval = CrossValidator(estimator=model,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3)
Parallel Execution: Spark distributes the tasks of training models with different hyperparameters
 5 / 8
https://www.dumpsinfo.com/
across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows
multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for
efficient processing of large datasets and training of models across many nodes, which speeds up the
hyperparameter tuning process significantly compared to single-node computations.
Reference
Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML
8.A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine
Learning.
Which of the following steps will the data scientist need to perform outside of their AutoML
experiment?
A. Model tuning
B. Model evaluation
C. Model deployment
D. Exploratory data analysis
Answer: D
Explanation:
AutoML platforms, such as the one available in Databricks Machine Learning, streamline various
stages of the machine learning pipeline including feature engineering, model selection,
hyperparameter tuning, and model evaluation. However, exploratory data analysis (EDA) is typically
performed outside the AutoML process. EDA involves understanding the dataset, visualizing
distributions, identifying anomalies, and gaining insights into data before feeding it into a machine
learning pipeline. This step is crucial for ensuring that the data is clean and suitable for model training
but is generally done manually by the data scientist.
Reference
Databricks documentation on AutoML: https://docs.databricks.com/applications/machine-
learning/automl.html
9.Which statement describes a Spark ML transformer?
A. A transformer is an algorithm which can transform one DataFrame into another DataFrame
B. A transformer is a hyperparameter grid that can be used to train a model
C. A transformer chains multiple algorithms together to transform an ML workflow
D. A transformer is a learning algorithm that can use a DataFrame to train a model
Answer: A
Explanation:
In Spark ML, a transformer is an algorithm that can transform one DataFrame into another
DataFrame. It takes a DataFrame as input and produces a new DataFrame as output. This
transformation can involve adding new columns, modifying existing ones, or applying feature
transformations. Examples of transformers in Spark MLlib include feature transformers like
StringIndexer, VectorAssembler, and StandardScaler.
Reference: Databricks documentation on transformers: Transformers in Spark ML
10.A data scientist is using the following code block to tune hyperparameters for a machine learning
model:
 6 / 8
https://www.dumpsinfo.com/
Which change can they make the above code block to improve the likelihood of a more accurate
model?
A. Increase num_evals to 100
B. Change fmin() to fmax()
C. Change sparkTrials() to Trials()
D. Change tpe.suggest to random.suggest
Answer: A
Explanation:
To improve the likelihood of a more accurate model, the data scientist can increase num_evals to
100. Increasing the number of evaluations allows the hyperparameter tuning process to explore a
larger search space and evaluate more combinations of hyperparameters, which increases the
chance of finding a more optimal set of hyperparameters for the model.
Reference: Databricks documentation on hyperparameter tuning: Hyperparameter Tuning
11.Which of the following approaches can be used to view the notebook that was run to create an
MLflow run?
A. Open the MLmodel artifact in the MLflow run paqe
B. Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe
C. Click the "Source" link in the row corresponding to the run in the MLflow experiment page
D. Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page
Answer: C
Explanation:
To view the notebook that was run to create an MLflow run, you can click the "Source" link in the row
corresponding to the run in the MLflow experiment page. The "Source" link provides a direct
reference to the source notebook or script that initiated the run, allowing you to review the code and
methodology used in the experiment. This feature is particularly useful for reproducibility and for
understanding the context of the experiment.
Reference: MLflow Documentation (Viewing Run Sources and Notebooks).
 7 / 8
https://www.dumpsinfo.com/
12.A data scientist is working with a feature set with the following schema:
The customer_id column is the primary key in the feature set. Each of the columns in the feature set
has missing values. They want to replace the missing values by imputing a common value for each
feature.
Which of the following lists all of the columns in the feature set that need to be imputed using the
most common value of the column?
A. customer_id, loyalty_tier
B. loyalty_tier
C. units
D. spend
E. customer_id
Answer: B
Explanation:
For the feature set schema provided, the columns that need to be imputed using the most common
value (mode) are typically the categorical columns. In this case, loyalty_tier is the only categorical
columnthat should be imputed using the most common value. customer_id is a unique identifier and
should not be imputed, while spend and units are numerical columns that should typically be imputed
using the mean or median values, not the mode.
Reference: Databricks documentation on missing value imputation: Handling Missing Data
If you need any further clarification or additional questions answered, please let me know!
Powered by TCPDF (www.tcpdf.org)
 8 / 8
https://www.dumpsinfo.com/
http://www.tcpdf.org