Databricks Certified Data Engineer Professional Dumps

Mathematics

Vicente Riva Palacio

chen donghua

in 11/13/2023

Content selected for you

12 pg.

Databricks Certified Professional Data Engineer Dumps

12 pg.

Databricks Certified Data Engineer Professional Updated Dumps 2023

15 pg.

Microsoft DP-600 Exam Updated Dumps

Harvard

43 pg.

[Dumpsinfo] Databricks Certified Data Engineer Professional Updated Dumps

LST

49 pg.

Microsoft DP-203 Exam Updated Material

Questions from this subject

A proteção de dados na nuvem requer atenção redobrada das organizações, pois envolve a confidencialidade, integridade e disponibilidade das informa...

9 Marcar para revisão A segurança em ambientes de computação em nuvem deve ser pensada desde 0 planejamento da arquitetura até 0 monitoramento cont...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gma...

Questão 11 Leia as afirmativas abaixo: I - - As Na empresas década de ainda 80, não conseguem fornecer produtos com custos menores pela internet, p...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gmai...

Material

Study with thousands of resources!

Content selected for you

12 pg.

Databricks Certified Professional Data Engineer Dumps

12 pg.

Databricks Certified Data Engineer Professional Updated Dumps 2023

15 pg.

Microsoft DP-600 Exam Updated Dumps

Harvard

43 pg.

[Dumpsinfo] Databricks Certified Data Engineer Professional Updated Dumps

LST

49 pg.

Microsoft DP-203 Exam Updated Material

Questions from this subject

A proteção de dados na nuvem requer atenção redobrada das organizações, pois envolve a confidencialidade, integridade e disponibilidade das informa...

9 Marcar para revisão A segurança em ambientes de computação em nuvem deve ser pensada desde 0 planejamento da arquitetura até 0 monitoramento cont...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gma...

Questão 11 Leia as afirmativas abaixo: I - - As Na empresas década de ainda 80, não conseguem fornecer produtos com custos menores pela internet, p...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gmai...

Text Material Preview

Databricks Certified
Data Engineer
Professional
Exam Name: Databricks Certified Data Engineer
Professional Exam
Full version: 337 Q&As
Full version of Databricks Certified Data
Engineer Professional Dumps
Share some Databricks Certified Data Engineer
1 / 44
https://www.certqueen.com/Databricks-Certified-Data-Engineer-Professional.html
https://www.certqueen.com/Databricks-Certified-Data-Engineer-Professional.html
Professional exam dumps below.
1. UNION
2. Which of the following is true, when building a Databricks SQL dashboard?
A. A dashboard can only use results from one query
B. Only one visualization can be developed with one query result
C. A dashboard can only connect to one schema/Database
D. More than one visualization can be developed using a single query result
E. A dashboard can only have one refresh schedule
Answer: D
Explanation:
the answer is, More than one visualization can be developed using a single query result. In the
query editor pane + Add visualization tab can be used for many visualizations for a single query
result.
Graphical user
interface, text, application
2 / 44
Description automatically generated
3. INNER JOIN CUSTOMERS_2020 C2
4. You are noticing job cluster is taking 6 to 8 mins to start which is delaying your job to finish on
time, what steps you can take to reduce the amount of time cluster startup time
A. Setup a second job ahead of first job to start the cluster, so the cluster is ready with re-
sources when the job starts
B. Use All purpose cluster instead to reduce cluster start up time
C. Reduce the size of the cluster, smaller the cluster size shorter it takes to start the cluster
D. Use cluster pools to reduce the startup time of the jobs
E. Use SQL endpoints to reduce the startup time
Answer: D
Explanation:
The answer is, Use cluster pools to reduce the startup time of the jobs.
Cluster pools allow us to reserve VM's ahead of time, when a new job cluster is created VM are
grabbed from the pool. Note: when the VM's are waiting to be used by the cluster only cost
incurred is Azure. Databricks run time cost is only billed once VM is allocated to a cluster.
Here is a demo of how to setup and follow some best practices,
https://www.youtube.com/watch?v=FVtITxOabxg&ab_channel=DatabricksAcademy
5. FROM json.`/path/to/json/file.json`;
The data engineer asks a colleague for help to convert this query for use in a Delta Live Tables
(DLT) pipeline. The query should create the first table in the DLT pipeline.
Which of the following describes the change the colleague needs to make to the query?
A. They need to add a CREATE LIVE TABLE table_name AS line at the beginning of the query
B. They need to add the cloud_files(...) wrapper to the JSON file path
C. They need to add a CREATE DELTA LIVE TABLE table_name AS line at the beginning of
the query
D. They need to add a live. prefix prior to json. in the FROM line
E. They need to add a COMMENT line at the beginning of the query
Answer: A
6. Which of the following developer operations in CI/CD flow can be implemented in Databricks
Re-pos?
A. Merge when code is committed
B. Pull request and review process
3 / 44
C. Trigger Databricks Repos API to pull the latest version of code into production folder
D. Resolve merge conflicts
E. Delete a branch
Answer: C
Explanation:
See the below diagram to understand the role Databricks Repos and Git provider plays when
building a CI/CD workflow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in
Gray are done in a git provider like Github or Azure DevOps
7. You had worked with the Data analysts team to set up a SQL Endpoint(SQL warehouse)
point so they can easily query and analyze data in the gold layer, but once they started
consuming the SQL Endpoint(SQL warehouse) you noticed that during the peak hours as the
number of users increase you are seeing queries taking longer to finish, which of the following
steps can be taken to resolve the issue?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
A. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse).
B. They can increase the maximum bound of the SQL endpoint(SQL warehouse) ’s scaling
range.
4 / 44
C. They can increase the cluster size from 2X-Small to 4X-Large of the SQL end-point(SQL
warehouse) .
D. They can turn on the Auto Stop feature for the SQL endpoint(SQL warehouse) .
E. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse) and change
the Spot Instance Policy from “Cost optimized” to “Reliability Optimized.”
Answer: B
Explanation:
the answer is,
They can increase the maximum bound of the SQL endpoint’s scaling range, when you
increase the maximum bound you can add more clusters to the warehouse which can then run
additional queries that are waiting in the queue to run, focus on the below explanation that talks
about Scale-out.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL
Warehouse) and you have to look for cue words or need to understand if the queries are
running sequentially or concurrently. if the queries are running sequentially then scale up(Size
of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more
users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below
diagram to understand )
8. get_source_dataframe(tablename):
9. #Execute code
10. .format("delta")
11. USING SQL
12. SELECT count(*) FROM my_table VERSION AS OF 5238
13. ELSE (temp C 33 ) * 5/9 5.END
Answer: D
Explanation:
The answer is
14. FROM raw_table;
E. 1. SELECT cart_id, explode(items) AS item_id
15. query = f"select * from {schema_name}.{table_name}"
D. 1.table_name = "sales"
16. LOCATION DELTA
Answer: D
Explanation:
Answer is
5 / 44
17. CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')
Drop invalid records:
Use the expect or drop operator to prevent the processing of invalid records. Records that
violate the expectation are dropped from the target dataset:
Python
18. A data engineering team has created a series of tables using Parquet data stored in an
external sys-tem. The team is noticing that after appending new rows to the data in the external
system, their queries within Databricks are not returning the new rows. They identify the caching
of the previous data as the cause of this issue.
Which of the following approaches will ensure that the data returned by queries is always up-to-
date?
A. The tables should be updated before the next query is run
B. The tables should be converted to the Delta format
C. The tables should be refreshed in the writing cluster before the next query is run
D. The tables should be altered to include metadata to not cache
E. The tables should be stored in a cloud-based external system
Answer: B
19. A particular job seems to be performing slower and slower over time, the team thinks this
started to happen when a recent production change was implemented, you were asked to take
look at the job history and see if we can identify trends and root cause, where in the workspace
UI can you perform this analysis?
A. Under jobs UI select the job you are interested, under runs we can see current active runs
and last 60 days historical run
B. Under jobs UI select the job cluster, under spark UI select the application job logs, then you
can access last 60 day historical runs
C. Under Workspace logs, select job logs and select the job you want to monitor to view the last
60 day historical runs
D. Under Compute UI, select Job cluster and select the job cluster to see last 60 day his-torical
runs
E. Historical job runs can only be accessed by REST API
Answer: A
Explanation:
The answer is,Under jobs UI select the job you are interested, under runs we can see current active runs and
last 60 days historical run
6 / 44
20. Which of the following commands results in the successful creation of a view on top of the
delta stream (stream on delta table)?
A. Spark.read.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
B. Spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_vw
")
C. Spark.read.format("delta").table("sales").mode("stream").createOrReplaceTempView("strea
ming_vw")
D. Spark.read.format("delta").table("sales").trigger("stream").createOrReplaceTempView("stre
aming_vw")
E. Spark.read.format("delta").stream("sales").createOrReplaceTempView("streaming_vw")
F. You can not create a view on streaming data source.
Answer: B
Explanation:
The answer is
Spark.readStream.table("sales").createOrReplaceTempView("streaming_vw")
7 / 44
When you load a Delta table as a stream source and use it in a streaming query, the query
processes all of the data present in the table as well as any new data that arrives after the
stream is started.
You can load both paths and tables as a stream, you also have the ability to ignore deletes and
changes (updates, Merge, overwrites) on the delta table. Here is more information,
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-source
21. "temp":[25,28,49,54,38,25]
22. Which configuration parameter directly affects the size of a spark-partition upon ingestion of
data into Spark?
A. spark.sql.files.maxPartitionBytes
B. spark.sql.autoBroadcastJoinThreshold
C. spark.sql.files.openCostInBytes
D. spark.sql.adaptive.coalescePartitions.minPartitionNum
E. spark.sql.adaptive.advisoryPartitionSizeInBytes
Answer: A
23. Which of the following SQL statement can be used to query a table by eliminating duplicate
rows from the query results?
A. SELECT DISTINCT * FROM table_name
B. SELECT DISTINCT * FROM table_name HAVING COUNT(*) > 1
C. SELECT DISTINCT_ROWS (*) FROM table_name
D. SELECT * FROM table_name GROUP BY * HAVING COUNT(*) < 1
E. SELECT * FROM table_name GROUP BY * HAVING COUNT(*) > 1
Answer: A
Explanation:
The answer is SELECT DISTINCT * FROM table_name
24. The research team has put together a funnel analysis query to monitor the customer traffic
on the e-commerce platform, the query takes about 30 mins to run on a small SQL endpoint
cluster with max scaling set to 1 cluster.
What steps can be taken to improve the performance of the query?
A. They can turn on the Serverless feature for the SQL endpoint.
B. They can increase the maximum bound of the SQL endpoint’s scaling range anywhere from
between 1 to 100 to review the performance and select the size that meets the re-quired SLA.
C. They can increase the cluster size anywhere from X small to 3XL to review the performance
8 / 44
and select the size that meets the required SLA.
D. They can turn off the Auto Stop feature for the SQL endpoint to more than 30 mins.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance
Policy from “Cost optimized” to “Reliability Optimized.”
Answer: C
Explanation:
The answer is, They can increase the cluster size anywhere from 2X-Small to 4XL (Scale Up) to
review the performance and select the size that meets your SLA. If you are trying to improve the
performance of a single query at a time having additional memory, additional worker nodes
mean that more tasks can run in a cluster which will improve the performance of that query.
The question is looking to test your ability to know how to scale a SQL Endpoint (SQL
Warehouse) and you have to look for cue words or need to understand if the queries are
running sequentially or concurrently. if the queries are running sequentially then scale up (Size
of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more
users then scale out(add more clusters).
SQL Endpoint (SQL Warehouse) Overview: (Please read all of the below points and the below
diagram to understand )
25. AS SELECT * FROM table
A. INSERT OVERWRITE replaces data by default, CREATE OR REPLACE replaces data and
Schema by default
B. INSERT OVERWRITE replaces data and schema by default, CREATE OR
REPLACEreplaces data by default
C. INSERT OVERWRITE maintains historical data versions by de-fault, CREATE OR
REPLACEclears the historical data versions by default
D. INSERT OVERWRITE clears historical data versions by de-fault, CREATE OR REPLACE
maintains the historical data versions by default
E. Both are same and results in identical outcomes
Answer: A
Explanation:
The main difference between INSERT OVERWRITE and CREATE OR REPLACE
TABLE(CRAS) is that CRAS can modify the schema of the table, i.e it can add new columns or
change data types of existing columns. By default INSERT OVERWRITE only overwrites the
data.
INSERT OVERWRITE can also be used to overwrite schema, only when
spark.databricks.delta.schema.autoMerge.enabled is set true if this option is not enabled and if
there is a schema mismatch command will fail.
9 / 44
26. When investigating a data issue you realized that a process accidentally updated the table,
you want to query the same table with yesterday's version of the data so you can review what
the prior version looks like, what is the best way to query historical data so you can do your
analysis?
A. SELECT * FROM TIME_TRAVEL(table_name) WHERE time_stamp = 'timestamp'
B. TIME_TRAVEL FROM table_name WHERE time_stamp = date_sub(current_date(), 1)
C. SELECT * FROM table_name TIMESTAMP AS OF date_sub(current_date(), 1)
D. DISCRIBE HISTORY table_name AS OF date_sub(current_date(), 1)
E. SHOW HISTORY table_name AS OF date_sub(current_date(), 1)
Answer: C
Explanation:
The answer is SELECT * FROM table_name TIMESTAMP as of date_sub(current_date(), 1)
FYI, Time travel supports two ways one is using timestamp and the second way is using version
number,
Timestamp:
27. .withColumn("avg_price", col("sales") / col("units"))
28. AS
29. You had AUTO LOADER to process millions of files a day and noticed slowness in load
process, so you scaled up the Databricks cluster but realized the performance of the Auto
loader is still not improving, what is the best way to resolve this.
A. AUTO LOADER is not suitable to process millions of files a day
B. Setup a second AUTO LOADER process to process the data
C. Increase the maxFilesPerTrigger option to a sufficiently high number
D. Copy the data from cloud storage to local disk on the cluster for faster access
E. Merge files to one large file
Answer: C
Explanation:
The default value of maxFilesPerTrigger is 1000 it can be increased to a much higher number
but will require a much larger compute to process.
10 / 44
Graphical user
interface, text, application, email
Description automatically generated https://docs.databricks.com/ingestion/auto-
loader/options.html
30. pass
Answer: A
Explanation:
The answer is,
31. You are trying to calculate total sales made by all the employees by parsing a complex
struct data type that stores employee and sales data, how would you approach this in SQL
Table definition, batchId INT, performance ARRAY<STRUCT<employeeId: BIGINT, sales:
INT>>, in-sertDate TIMESTAMP
Sample data of performance column
32. A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the average humidity and
average temperature for each non-overlapping five-minute interval. Events are recorded once
per minute per device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
11 / 44
Choose the response that correctly fills inthe blank within the code block to complete this task.
A. to_interval("event_time", "5 minutes").alias("time")
B. window("event_time", "5 minutes").alias("time")
C. "event_time"
D. window("event_time", "10 minutes").alias("time")
E. lag("event_time", "10 minutes").alias("time")
Answer: B
33. The operations team is interested in monitoring the recently launched product, team wants
to set up an email alert when the number of units sold increases by more than 10,000 units.
They want to monitor this every 5 mins.
Fill in the below blanks to finish the steps we need to take
• Create ___ query that calculates total units sold
• Setup ____ with query on trigger condition Units Sold > 10,000
• Setup ____ to run every 5 mins
• Add destination ______
A. Python, Job, SQL Cluster, email address
B. SQL, Alert, Refresh, email address
C. SQL, Job, SQL Cluster, email address
D. SQL, Job, Refresh, email address
E. Python, Job, Refresh, email address
Answer: B
Explanation:
The answer is SQL, Alert, Refresh, email address
Here the steps from Databricks documentation,
Create an alert
Follow these steps to create an alert on a single column of a query.
12 / 44
34. At the end of the inventory process, a file gets uploaded to the cloud object storage, you are
asked to build a process to ingest data which of the following method can be used to ingest the
data in-crementally, schema of the file is expected to change overtime ingestion process should
be able to handle these changes automatically. Below is the auto loader to command to load
the data, fill in the blanks for successful execution of below code.
35. CREATE TABLE <jdbcTable>
36. SELECT * FROM CUSTOMERS_2020
C. 1. SELECT * FROM CUSTOMERS_2021 C1
37. table_name = "sales"
38. item_id STRING
Which of the following commands should the junior data engineer run to complete this task?
A. 1. SELECT cart_id, flatten(items) AS item_id
39. CREATE VIEW sales_redacted AS
40. table("uncleanedSales")
Answer: B
Explanation:
The answer is
41. Data quality and governance
a. Difficult to monitor and enforce data quality
b. Impossible to trace data lineage
42. There are 5000 different color balls, out of which 1200 are pink color .
What is the maximum likelihood estimate for the proportion of "pink" items in the test set of color
balls?
A. 2.4
B. 24 0
C. .24
D. .48
E. 4.8
Answer: C
Explanation:
Given no additional information, the MLE for the probability of an item in the test set is exactly
its frequency in the training set. The method of maximum likelihood corresponds to many well-
known estimation methods in statistics. For example, one may be interested in the heights of
adult female penguins, but be unable to measure the height of every single penguin in a
population due to cost or time constraints. Assuming that the heights are normally (Gaussian)
distributed with some unknown mean and variance, the mean and variance can be estimated
13 / 44
with MLE while only knowing the heights of some sample of the overall population. MLE would
accomplish this by taking the mean and variance as parameters and finding particular
parametric values that make the observed results the most probable (given the model).
In general, for a fixed set of data and underlying statistical model the method of maximum
likelihood selects the set of values of the model parameters that maximizes the likelihood
function. Intuitively, this maximizes the "agreement" of the selected model with the observed
data, and for discrete random variables it indeed maximizes the probability of the observed data
under the resulting distribution. Maximum-likelihood estimation gives a unified approach to
estimation, which is well-defined in the case of the normal distribution and many other problems.
However in some complicated problems, difficulties do occur: in such problems, maximum-
likelihood estimators are unsuitable or do not exist.
43. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
44. SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
45. format("cloudfiles")
46. option("checkpointLocation", checkpointPath)
47. where on_hand = 0
A. Turn on the Serverless feature for the SQL endpoint.
B. Increase the maximum bound of the SQL endpoint’s scaling range.
C. Increase the cluster size of the SQL endpoint.
D. Turn on the Auto Stop feature for the SQL endpoint.
E. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to
“Reliability Optimized.”
Answer: C
Explanation:
The answer is to increase the cluster size of the SQL Endpoint, here queries are running
sequentially and since the single query can not span more than one cluster adding more
clusters won't improve the query but rather increasing the cluster size will improve performance
so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue
words or need to understand if the queries are running sequentially or concurrently. if the que-
ries are running sequentially then scale up(more nodes) if the queries are running concurrently
(more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to
add more worker nodes.
14 / 44
SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand
when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large....
If you are trying to improve the performance of a single query having additional memory,
additional nodes and cpu in the cluster will improve the performance. Scale-out -> Add more
clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then
having an additional cluster(s) will improve the performance. SQL endpoint
15 / 44
A picture containing diagram
Description automatically generated
48. SELECT country,
49. Which of the following statements describes Delta Lake?
A. Delta Lake is an open source platform to help manage the complete machine learning
lifecycle
B. Delta Lake is an open format storage layer that delivers reliability, security, and per-formance
C. Delta Lake is an open source data storage format for distributed data
D. Delta Lake is an open source analytics engine used for big data workloads
E. Delta Lake is an open format storage layer that processes data
Answer: B
50. Which of the below SQL commands creates a session scoped temporary view?
A. 1. CREATE OR REPLACE TEMPORARY VIEW view_name
51. Create a schema called bronze using location ‘/mnt/delta/bronze’, and check if the schema
exists before creating.
A. CREATE SCHEMA IF NOT EXISTS bronze LOCATION '/mnt/delta/bronze'
B. CREATE SCHEMA bronze IF NOT EXISTS LOCATION '/mnt/delta/bronze'
C. if IS_SCHEMA('bronze'): CREATE SCHEMA bronze LOCATION '/mnt/delta/bronze'
D. Schema creation is not available in metastore, it can only be done in Unity catalog UI
E. Cannot create schema without a database
Answer: A
Explanation:
https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-schema.html
52. Which of the following scenarios is the best fit for AUTO LOADER?
A. Efficiently process new data incrementally from cloud object storage
B. Efficiently move data incrementally from one delta table to another delta table
C. Incrementally process new data from streaming data sources like Kafka into delta lake
D. Incrementally process new data from relational databases like MySQL
E. Efficiently copy data from one data lake location to another data lake location
Answer: A
Explanation:
The answer is, Efficiently process new data incrementally from cloud object storage, AU-TO
LOADER only supports ingesting filesstored in a cloud object storage. Auto Loader cannot
process streaming data sources like Kafka or Delta streams, use Structured streaming for these
16 / 44
data sources.
Diagram
Description automatically generated
Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
53.
54. unitsSold int)
E. 1. CREATE TABLE transactions (
55. A junior data engineer on your team has implemented the following code block.
The view new_events contains a batch of records with the same schema as the events Delta
table. The event_id field serves as a unique key for this table.
When this query is executed, what will happen with new records that have the same event_id as
an existing record?
A. They are merged.
17 / 44
B. They are ignored.
C. They are updated.
D. They are inserted.
E. They are deleted.
Answer: B
56. spark.sql("select * from table_name")
B. 1.%sql
57. Which of the following section in the UI can be used to manage permissions and grants to
tables?
A. User Settings
B. Admin UI
C. Workspace admin settings
D. User access control lists
E. Data Explorer
Answer: E
Explanation:
The answer is Data Explorer
58. query = f"select * from + schema_name +"."+table_name"
Answer: C
Explanation:
Answer is
table_name = “sales”
query = f”select * from {schema_name}.{table_name}”
f strings can be used to format a string. f" This is string {python variable}"
https://realpython.com/python-f-strings/
59. as (
60. The data engineering team has configured a Databricks SQL query and alert to monitor the
values in a Delta Lake table. The recent_sensor_recordings table contains an identifying
sensor_id alongside the timestamp and temperature for the most recent 5 minutes of
recordings.
The below query is used to create the alert:
18 / 44
The query is set to refresh each minute and always completes in less than 10 seconds. The
alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at
most every 1 minute.
If this alert raises notifications for 3 consecutive minutes and then stops, which statement must
be true?
A. The total average temperature across all sensors exceeded 120 on three consecutive
executions of the query
B. The recent_sensor_recordings table was unresponsive for three consecutive runs of the
query
C. The source query failed to update properly for three consecutive minutes and then restarted
D. The maximum temperature recording for at least one sensor exceeded 120 on three
consecutive executions of the query
E. The average temperature recordings for at least one sensor exceeded 120 on three
consecutive executions of the query
Answer: E
61. ' as raw
62. .outputMode("complete")
63. outputMode("complete")
64. format("cloudfiles")
65. A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE.
Three datasets are defined against Delta Lake table sources using LIVE TABLE. The table is
configured to run in Development mode using the Triggered Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected
outcome after clicking Start to update the pipeline?
A. AII datasets will be updated once and the pipeline will shut down. The compute resources will
persist to allow for additional testing
B. All datasets will be updated once and the pipeline will shut down. The compute resources will
be terminated
C. All datasets will be updated continuously and the pipeline will not shut down. The compute
resources will persist with the pipeline
D. All datasets will be updated at set intervals until the pipeline is shut down. The compute
resources will persist after the pipeline is stopped to allow for additional testing
19 / 44
E. All datasets will be updated at set intervals until the pipeline is shut down. The compute
resources will be deployed for the update and terminated when the pipeline is stopped
Answer: A
66. END
B. 1.CREATE UDF FUNCTION udf_convert(temp DOUBLE, measure STRING)
67. url = "jdbc:sqlite:/sqmple_db",
68. org.apache.spark.sql.parquet OPTIONS (PATH "storage-path");
B. 1. CREATE TABLE my_table (id STRING, value STRING) USING DBFS;
C. 1. CREATE TABLE my_table (id STRING, value STRING) USING
69. {"date":"01-03-2021",
70. outputMode("____")
71. {TIMESTAMP AS OF timestamp_expression |
72. @dlt.expect_or_drop("valid_current_page", "current_page_id IS NOT NULL AND cur-
rent_page_title IS NOT NULL")
SQL
73. pass
E. 1. if department is None:
74. Which of the following locations in Databricks product architecture hosts jobs/pipelines and
queries?
A. Data plane
B. Control plane
C. Databricks Filesystem
D. JDBC data source
E. Databricks web application
Answer: B
Explanation:
The answer is Control Plane,
Databricks operates most of its services out of a control plane and a data plane, please note
serverless features like SQL Endpoint and DLT compute use shared compute in Control pane.
Control Plane: Stored in Databricks Cloud Account
• The control plane includes the backend services that Databricks manages in its own Azure
account. Notebook commands and many other workspace configurations are stored in the
control plane and encrypted at rest.
Data Plane: Stored in Customer Cloud Account
• The data plane is managed by your Azure account and is where your data resides. This is
20 / 44
also where data is processed. You can use Azure Databricks connectors so that your clusters
can connect to external data sources outside of your Azure account to ingest data or for
storage.
Here is the product architecture diagram highlighted where
75. One of the team members Steve who has the ability to create views, created a new view
called re-gional_sales_vw on the existing table called sales which is owned by John, and the
second team member Kevin who works with regional sales managers wanted to query the data
in region-al_sales_vw, so Steve granted the permission to Kevin using command
GRANT VIEW, USAGE ON regional_sales_vw to kevin@company.com but Kevin is still unable
to access the view?
A. Kevin needs select access on the table sales
B. Kevin needs owner access on the view regional_sales_vw
21 / 44
C. Steve is not the owner of the sales table
D. Kevin is not the owner of the sales table
E. Table access control is not enabled on the table and view
Answer: C
Explanation:
Ownership determines whether or not you can grant privileges on derived objects to other
users, since Steve is not the owner of the underlying sales table, he can not grant access to the
table or data in the table indirectly.
Only owner(user or group) can grant access to a object https://docs.microsoft.com/en-us/azure/
databricks/security/access-control/table-acls/object-privileges#a-user-has-select-privileges-on-a
-view-of-table-t-but-when-that-user-tries-to-select-from-that-view-they-get-the-error-user-does-
not-have-privilege-select-on-table Data object privileges - Azure Databricks | Microsoft Doc
76. A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that
the Min, Median, and Max Durations for tasks in a particular stage show the minimum and
median time to complete a task as roughly the same, but the max duration for a task to be
roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
A. Task queueing resulting from improper thread pool assignment.
B. Spill resulting from attached volume storage being too small.
C. Network latency due to some cluster nodes being in different regions from the source data
D. Skew caused by more data being assigned to a subset of spark-partitions.
E. Credential validation errors while pulling data from an external system.
Answer: D
77. unitsSold int)
B. 1. CREATE OR REPLACETABLE IF EXISTS transactions (
78. Select * from table_name
C. Both A & B
(Correct)
D. 1.%python
79. agg(sum("sales"))
80. table_name = "sales"
81. table("cleanedSales"))
D. 1.(spark.readStream.load(rawSalesLocation)
82. outputMode("append")
22 / 44
83. A junior data engineer has ingested a JSON file into a table raw_table with the following
schema:
84. option("checkpointLocation", checkpointPath)
85. def check_input(x,y):
86. No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1
worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale up
87. When writing streaming data, Spark’s structured stream supports the below write modes
A. Append, Delta, Complete
B. Delta, Complete, Continuous
C. Append, Complete, Update
D. Complete, Incremental, Update
E. Append, overwrite, Continuous
Answer: C
Explanation:
The answer is Append, Complete, Update
•Append mode (default) - This is the default mode, where only the new rows added to the
Result Table since the last trigger will be outputted to the sink. This is supported for only those
queries where rows added to the Result Table is never going to change. Hence, this mode
guarantees that each row will be output only once (assuming fault-tolerant sink). For example,
queries with only select, where, map, flatMap, filter, join, etc. will support Append mode.
• Complete mode - The whole Result Table will be outputted to the sink after every trigger.
This is supported for aggregation queries.
• Update mode - (Available since Spark 2.1.1) Only the rows in the Result Table that were
updated since the last trigger will be outputted to the sink. More information to be added in
future releases.
88. format("cloudfiles") # Returns a stream data source, reads data as it arrives based on the
trigger.
89. Which of the following describes how Databricks Repos can help facilitate CI/CD workflows
on the Databricks Lakehouse Platform?
A. Databricks Repos can facilitate the pull request, review, and approval process before
merging branches
B. Databricks Repos can merge changes from a secondary Git branch into a main Git branch
C. Databricks Repos can be used to design, develop, and trigger Git automation pipelines
D. Databricks Repos can store the single-source-of-truth Git repository
E. Databricks Repos can commit or push code changes to trigger a CI/CD process
23 / 44
Answer: E
Explanation:
Answer is Databricks Repos can commit or push code changes to trigger a CI/CD process See
below diagram to understand the role Databricks Repos and Git provider plays when building a
CI/CD workdlow.
All the steps highlighted in yellow can be done Databricks Repo, all the steps highlighted in
Gray are done in a git provider like Github or Azure Devops.
Diagram
Description automatically generated
90. Data engineering team has a job currently setup to run a task load data into a reporting
table every day at 8: 00 AM takes about 20 mins, Operations teams are planning to use that
data to run a second job, so they access latest complete set of data.
What is the best to way to orchestrate this job setup?
A. Add Operation reporting task in the same job and set the Data Engineering task to de-pend
on Operations reporting task
B. Setup a second job to run at 8:20 AM in the same workspace
C. Add Operation reporting task in the same job and set the operations reporting task to depend
on Data Engineering task
24 / 44
D. Use Auto Loader to run every 20 mins to read the initial table and set the trigger to once and
create a second job
E. Setup a Delta live to table based on the first table, set the job to run in continuous mode
Answer: C
Explanation:
The answer is Add Operation reporting task in the same job and set the operations reporting
task to depend on Data Engineering task.
25 / 44
Diagram
Description automatically generated with medium confidence
91. AS SELECT * FROM table_name
D. 1. CREATE OR REPLACE VIEW view_name
92. _____
93. No schema is applied at this layer
Exam focus: Please review the below image and understand the role of each layer (bronze,
silver, gold) in medallion architecture, you will see varying questions targeting each layer and its
purpose.
Sorry I had to add the watermark some people in Udemy are copying my content. Purpose of
each layer in medallion architecture
26 / 44
94. Reduces strain on production systems
95. load(data_source)
96. else
97. A SQL Warehouse should have at least one cluster
98. option("_______",”csv)
99. ON C1.CUSTOMER_ID = C2.CUSTOMER_ID
D. 1. SELECT * FROM CUSTOMERS_2021
100. --- Get order summary
101. Which of the following statements are true about a lakehouse?
A. Lakehouse only supports Machine learning workloads and Data warehouses support BI
workloads
B. Lakehouse only supports end-to-end streaming workloads and Data warehouses support
Batch workloads
C. Lakehouse does not support ACID
D. Lakehouse do not support SQL
E. Lakehouse supports Transactions
Answer: E
Explanation:
What Is a Lakehouse? - The Databricks Blog
27 / 44
Text
Description automatically generated
102. FROM raw_table;
D. 1. SELECT cart_id, filter(items) AS item_id
103. In which phase of the data analytics lifecycle do Data Scientists spend the most time in a
project?
A. Discovery
B. Data Preparation
28 / 44
C. Model Building
D. Communicate Results
Answer: B
104. A SQL Warehouse should have at least one cluster
105. AS
106. pass
C. 1. if department is not None:
107. When you drop an external DELTA table using the SQL Command DROP TABLE
table_name, how does it impact metadata (delta log, history), and data stored in the storage?
A. Drops table from metastore, metadata (delta log, history) and data in storage
B. Drops table from metastore, data but keeps metadata (delta log, history) in storage
C. Drops table from metastore, metadata (delta log, history) but keeps the data in storage
D. Drops table from metastore, but keeps metadata (delta log, history) and data in storage
E. Drops table from metastore and data in storage but keeps metadata (delta log, history)
Answer: D
Explanation:
The answer is Drops table from metastore, but keeps metadata and data in storage. When an
external table is dropped, only the table definition is dropped from metastore everything
including data and metadata (Delta transaction log, time travel history) remains in the storage.
Delta log is considered as part of metadata because if you drop a column in a delta table
(managed or external) the column is not physically removed from the parquet files rather it is
recorded in the delta log. The delta log becomes a key metadata layer for a Delta table to work.
Please see the below image to compare the external delta table and managed delta table and
how they differ in how they are created and what happens if you drop the table.
29 / 44
Diagram
Description automatically generated
108. transactionId int,
109. transactionDate timestamp,
110. items ARRAY<item_id:STRING>
The junior data engineer would like to unnest the items column in raw_table to result in a new
table with the following schema:
111. as
112. from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',
113. transactionDate timestamp,
114. "temp":[25,28,49,58,38,25]
115. RETURNS DOUBLE
116. A data engineer has set up two Jobs that each run nightly. The first Job starts at 12:00 AM,
and it usually completes in about 20 minutes. The second Job depends on the first Job, and it
starts at 12:30 AM. Sometimes, the second Job fails when the first Job does not complete by
12:30 AM.
Which of the following approaches can the data engineer use to avoid this problem?
A. They can set up aretry policy on the first Job to help it run more quickly
B. They can use cluster pools to help the Jobs run more efficiently
C. They can limit the size of the output in the second Job so that it will not fail as easily
D. They can set up the data to stream from the first Job to the second Job
30 / 44
E. They can utilize multiple tasks in a single job with a linear dependency
Answer: E
117. .load(dataSource)
118. outputMode("complete")
119. You currently working with the marketing team to setup a dashboard for ad campaign
analysis, since the team is not sure how often the dashboard should be refreshed they have
decided to do a manual refresh on an as needed basis.
Which of the following steps can be taken to reduce the overall cost of the compute when the
team is not using the compute?
*Please note that Databricks recently change the name of SQL Endpoint to SQL Warehouses.
A. They can turn on the Serverless feature for the SQL endpoint(SQL Warehouse).
B. They can decrease the maximum bound of the SQL endpoint(SQL Warehouse) scaling
range.
C. They can decrease the cluster size of the SQL endpoint(SQL Warehouse).
D. They can turn on the Auto Stop feature for the SQL endpoint(SQL Warehouse).
E. They can turn on the Serverless feature for the SQL endpoint(SQL Warehouse) and change
the Spot Instance Policy from “Reliability Optimized” to “Cost optimized”
Answer: D
Explanation:
The answer is, They can turn on the Auto Stop feature for the SQL endpoint(SQL Warehouse).
Use auto stop to automatically terminate the cluster when you are not using it.
120. AS SELECT * FROM table_name
Answer: A
Explanation:
The answer is
121. Consider flipping a coin for which the probability of heads is p, where p is unknown, and
our goa is to estimate p. The obvious approach is to count how many times the coin came up
heads and divide by the total number of coin flips. If we flip the coin 1000 times and it comes up
heads 367 times, it is very reasonable to estimate p as approximately 0.367. However, suppose
we flip the coin only twice and we get heads both times.
Is it reasonable to estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it
seems a bit rash to conclude that the coin will always come up heads, and____________is a
way of avoiding such rash conclusions.
A. Naive Bayes
31 / 44
B. Laplace Smoothing
C. Logistic Regression
D. Linear Regression
Answer: B
Explanation:
Smooth the estimates: consider flipping a coin for which the probability of heads is p, where p is
unknown, and our goal is to estimate p. The obvious approach is to count how many times the
coin came up heads and divide by the total number of coin flips. If we flip the coin 1000 times
and it comes up heads 367 times, it is very reasonable to estimate p as approximately 0.367.
However, suppose we flip the coin only twice and we get heads both times. Is it reasonable to
estimate p as 1.0? Intuitively, given that we only flipped the coin twice, it seems a bit rash to
conclude that the coin will always come up heads, and smoothing is a way of avoiding such
rash conclusions. A simple smoothing method, called Laplace smoothing (or Laplace's law of
succession or add-one smoothing in R&N), is to estimate p by (one plus the number of heads) /
(two plus the total number of flips). Said differently, if we are keeping count of the number of
heads and the number of tails, this rule is equivalent to starting each of our counts at one, rather
than zero. Another advantage of Laplace smoothing is that it avoids estimating any probabilities
to be zero, even for events never observed in the data.
Laplace add-one smoothing now assigns too much probability to unseen words
122. #Execute code
123. In Refresh, set a refresh schedule. An alert’s refresh schedule is independent of the
query’s refresh schedule.
• If the query is a Run as owner query, the query runs using the query owner’s cre-dential on
the alert’s refresh schedule.
• If the query is a Run as viewer query, the query runs using the alert creator’s cre-dential on
the alert’s refresh schedule.
124. option("cloudFiles.schemaLocation", checkpoint_directory)\
125. One of the queries in the Databricks SQL Dashboard takes a long time to refresh, which of
the be-low steps can be taken to identify the root cause of this issue?
A. Restart the SQL endpoint
B. Select the SQL endpoint cluster, spark UI, SQL tab to see the execution plan and time spent
in each step
C. Run optimize and Z ordering
D. Change the Spot Instance Policy from “Cost optimized” to “Reliability Optimized.”
E. Use Query History, to view queries and select query, and check query profile to time spent in
32 / 44
each step
Answer: E
Explanation:
The answer is, Use Query History, to view queries and select query, and check the query profile
to see time spent in each step.
Here is the view of the query profile, for more info use the link https://docs.microsoft.com/en-
us/azure/databricks/sql/admin/query-profile
As you can see here Databricks SQL query profile is much different to Spark UI and provides
much more clear information on how time is being spent on different queries and time it spent
on each step.
Graphical user
interface, application
Description automatically generated
126. The data governance team is reviewing code used for deleting records for compliance with
GDPR. They note the following logic is used to delete records from the Delta Lake table named
users.
Assuming that user_id is a unique identifying key and that delete_requests contains all users
33 / 44
that have requested deletion, which statement describes whether successfully executing the
above logic guarantees that the records to be deleted are no longer accessible and why?
A. Yes; Delta Lake ACID guarantees provide assurance that the DELETE command succeeded
fully and permanently purged these records.
B. No; the Delta cache may return records from previous versions of the table until the cluster is
restarted.
C. Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.
D. No; the Delta Lake DELETE command only provides ACID guarantees when combined with
the MERGE INTO command.
E. No; files containing deleted records may still be accessible with time travel until a VACUUM
command is used to remove invalidated data files.
Answer: E
127. SELECT * FROM CUSTOMERS_2020
Answer: D
Explanation:
Answer is,
128. CASE WHEN is_member('auditors') THEN email ELSE 'REDACTED' END AS email,
129. outputMode("append")
130. {"date":"01-02-2021",
131. ALTER TABLE table_name SET TBLPROPERTIES (property_key [ = ] property_val [, ...] )
TBLPROPERTIES allow you to set key-value pairs
Table properties and table options (Databricks SQL) | Databricks on AWS
132. Which of the following Auto loader structured streaming commands successfully performs
a hop from the landing area into Bronze?
A. 1.spark\
133.
134. The view updates represents an incremental batch of all newly ingested data to be inserted
or updated in the customers table.
The following logic is used to process these records.
34 / 44
Which statement describes this implementation?
A. The customers table is implemented as a Type 3 table; old values are maintained as a new
column alongside the current value.
B. The customers table is implemented as a Type 2 table; old values are maintained but marked
as no longer current and new values are inserted.
C. The customers table is implemented as a Type 0 table; all writes are append only with no
changes to existing values.
D. The customers table is implemented as a Type 1 table; old values are overwritten by new
values and no history is maintained.
E. The customers table is implemented as a Type 2 table; old values are overwritten and new
customers are appended.
Answer: B
135. TheDelta Live Table Pipeline is configured to run in Production mode using the continuous
Pipe-line Mode.
What is the expected outcome after clicking Start to update the pipeline?
A. All datasets will be updated once and the pipeline will shut down. The compute resources will
be terminated
B. All datasets will be updated at set intervals until the pipeline is shut down. The compute
resources will be deployed for the update and terminated when the pipeline is stopped
C. All datasets will be updated at set intervals until the pipeline is shut down. The compute
resources will persist after the pipeline is stopped to allow for additional testing
D. All datasets will be updated once and the pipeline will shut down. The compute resources will
35 / 44
persist to allow for additional testing
E. All datasets will be updated continuously and the pipeline will not shut down. The compute
resources will persist with the pipeline (Correct)
Answer: E
Explanation:
The answer is,
All datasets will be updated continuously and the pipeline will not shut down. The compute re-
sources will persist with the pipeline until it is shut down since the execution mode is chosen to
be continuous. It does not matter if the pipeline mode is development or production, pipeline
mode only matters during the pipeline initialization.
DLT pipeline supports two modes Development and Production, you can switch between the
two based on the stage of your development and deployment lifecycle.
Development and production modes
Development:
When you run your pipeline in development mode, the Delta Live Tables system:
• Reuses a cluster to avoid the overhead of restarts.
• Disables pipeline retries so you can immediately detect and fix errors.
Production:
In production mode, the Delta Live Tables system:
• Restarts the cluster for specific recoverable errors, including memory leaks and stale cre-
dentials.
• Retries execution in the event of specific errors, for example, a failure to start a cluster.
Use the buttons in the Pipelines UI to switch
between develop-ment and production modes. By default, pipelines run in development mode.
Switching between development and production modes only controls cluster and pipeline
execution behavior. Storage locations must be configured as part of pipeline settings and are
not affected when switching between modes.
Delta Live Tables supports two different modes of execution:
Triggered pipelines update each table with whatever data is currently available and then stop
the cluster running the pipeline. Delta Live Tables automatically analyzes the dependencies
between your tables and starts by computing those that read from external sources. Tables
within the pipe-line are updated after their dependent data sources have been updated.
36 / 44
Continuous pipelines update tables continuously as input data changes. Once an update is
started, it continues to run until manually stopped. Continuous pipelines require an always-
running cluster but ensure that downstream consumers have the most up-to-date data Please
review additional DLT concepts using the below link https://docs.databricks.com/data-
engineering/delta-live-tables/delta-live-tables-concepts.html#delta-live-tables-concepts
136. Which of the following is a correct statement on how the data is organized in the storage
when when managing a DELTA table?
A. All of the data is broken down into one or many parquet files, log files are broken down into
one or many JSON files, and each transaction creates a new data file(s) and log file. (Correct)
B. All of the data and log are stored in a single parquet file
C. All of the data is broken down into one or many parquet files, but the log file is stored as a
single json file, and every transaction creates a new data file(s) and log file gets appended.
D. All of the data is broken down into one or many parquet files, log file is removed once the
transaction is committed.
E. All of the data is stored into one parquet file, log files are broken down into one or many json
files.
Answer: A
Explanation:
Answer is
All of the data is broken down into one or many parquet files, log files are broken down into one
or many json files, and each transaction creates a new data file(s) and log file. here is sample
layout of how DELTA table might look,
37 / 44
137. The current ELT pipeline is receiving data from the operations team once a day so you had
setup an AUTO LOADER process to run once a day using trigger (Once = True) and scheduled
a job to run once a day, operations team recently rolled out a new feature that allows them to
send data every 1 min, what changes do you need to make to AUTO LOADER to process the
data every 1 min.
A. Convert AUTO LOADER to structured streaming
B. Change AUTO LOADER trigger to .trigger(ProcessingTime = "1 minute")
C. Setup a job cluster run the notebook once a minute
D. Enable stream processing
E. Change AUTO LOADER trigger to ("1 minute")
Answer: B
138. writeStream
139. USING org.apache.spark.sql.jdbc
140. You noticed that a team member started using an all-purpose cluster to develop a
notebook and used the same all-purpose cluster to set up a job that can run every 30 mins so
they can update un-derlying tables which are used in a dashboard.
What would you recommend for reducing the overall cost of this approach?
A. Reduce the size of the cluster
B. Reduce the number of nodes and enable auto scale
C. Enable auto termination after 30 mins
38 / 44
D. Change the cluster all-purpose to job cluster when scheduling the job
E. Change the cluster mode from all-purpose to single-mode
Answer: D
Explanation:
While using an all-purpose cluster is ok during development but anytime you don't need to
interact with a notebook, especially for a scheduled job it is less expensive to use a job cluster.
Using an all-purpose cluster can be twice as expensive as a job cluster.
Please note: The compute cost you pay the cloud provider for the same cluster type and size be-
tween an all-purpose cluster and job cluster is the same the only difference is the DBU cost.
The total cost of cluster = Total cost of VM compute (Azure or AWS or GCP) + Cost per DBU
The per DBU cost varies between all-purpose and Job Cluster
Here is the recent cost estimate from AWS between Jobs Cluster and all-purpose Cluster, for
jobs compute its $0.15 cents per DBU v$0.55 cents per DBU for all-purpose
39 / 44
Graphical user
interface
Description automatically generated
How do I check how much the DBU cost for my cluster?
When you click on an exister cluster or when you look at the cluster details you will see this in
the top right corner
40 / 44
Graphical user
interface, text, application, email
Description automatically generated
141. ELSE (temp C 33 ) * 5/9
142. A warehouse can have more than one cluster this is called Scale out. If a warehouse is
con-figured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an
additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to
run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start
running and holds the remaining in the queue and databricks will automatically start the second
cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
143. sum(sales) from cte
Sample data with create table syntax for the data:
144. Replaces traditional data lake
145. SELECT * FROM CUSTOMERS_2020
E. 1. SELECT * FROM CUSTOMERS_2021
146. select product_id, sum(order_count) order_count
147. SELECT * FROM CUSTOMERS_2021
148. You are working on a dashboard that takes a long time to load in the browser, due to the
fact that each visualization contains a lot of data to populate, which ofthe following approaches
can be taken to address this issue?
A. Increase size of the SQL endpoint cluster
B. Increase the scale of maximum range of SQL endpoint cluster
41 / 44
C. Use Databricks SQL Query filter to limit the amount of data in each visualization
D. Remove data from Delta Lake
E. Use Delta cache to store the intermediate results
Answer: C
Explanation:
Note*: The question may sound misleading but these are types of questions the exam tries to
ask.
A query filter lets you interactively reduce the amount of data shown in a visualization, similar to
query parameter but with a few key differences. A query filter limits data after it has been loaded
into your browser. This makes filters ideal for smaller datasets and environments where query
executions are time-consuming, rate-limited, or costly.
This query filter is different from than filter that needs to be applied at the data level, this filter is
at the visualization level so you can toggle how much data you want to see.
149. else:
150. Which of the following functions can be used to convert JSON string to Struct data type?
A. TO_STRUCT (json value)
B. FROM_JSON (json value)
C. FROM_JSON (json value, schema of json)
D. CONVERT (json value, schema of json)
E. CAST (json value as STRUCT)
Answer: C
Explanation:
Syntax
Copy
151. What is the output of the below function when executed with input parameters 1, 3:
152. Where in the Spark UI can one diagnose a performance problem induced by not
leveraging predicate push-down?
A. In the Executor’s log file, by grepping for "predicate push-down"
B. In the Stage’s Detail screen, in the Completed Stages table, by noting the size of data read
from the Input column
C. In the Storage Detail screen, by noting which RDDs are not stored on disk
D. In the Delta Lake transaction log. by noting the column statistics
E. In the Query Detail screen, by interpreting the Physical Plan
Answer: E
153. user = "<jdbcUsername>",
42 / 44
154. A data analyst has noticed that their Databricks SQL queries are running too slowly. They
claim that this issue is affecting all of their sequentially run queries. They ask the data
engineering team for help. The data engineering team notices that each of the queries uses the
same SQL endpoint, but the SQL endpoint is not used by any other user.
Which of the following approaches can the data engineering team use to improve the latency of
the data analyst's queries?
A. They can increase the maximum bound of the SQL endpoint's scaling range
B. They can increase the cluster size of the SQL endpoint
C. They can turn on the Auto Stop feature for the SQL endpoint
D. They can turn on the Serverless feature for the SQL endpoint and change the Spot In-stance
Policy to "Reliability Optimized"
E. They can turn on the Serverless feature for the SQL endpoint
Answer: B
155. -- get on hand based on orders summary and supply summary
156. Create a sales database using the DBFS location 'dbfs:/mnt/delta/databases/sales.db/'
A. CREATE DATABASE sales FORMAT DELTA LOCATION
'dbfs:/mnt/delta/databases/sales.db/'’
B. CREATE DATABASE sales USING LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
C. CREATE DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
D. The sales database can only be created in Delta lake
E. CREATE DELTA DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
Answer: D
Explanation:
The answer is
CREATE DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/' Note: with the
introduction of the Unity catalog and three-layer namespace usage of
SCHEMA and DATABASE is interchangeable
157. withColumn("avgPrice", col("sales") / col("units"))
158. 'ARRAY<STRUCT<employeeId: BIGINT, sales: INT>>') as performance,
159. writeStream
160. city STRING,
161. if x <y:
43 / 44

More Hot Exams are available.
350-401 ENCOR Exam Dumps
350-801 CLCOR Exam Dumps
200-301 CCNA Exam Dumps
Powered by TCPDF (www.tcpdf.org)
44 / 44
https://www.certqueen.com/promotion.asp
https://www.certqueen.com/350-401.html
https://www.certqueen.com/350-801.html
https://www.certqueen.com/200-301.html
http://www.tcpdf.org

Databricks Certified Data Engineer Professional Dumps

Mathematics

Vicente Riva Palacio

Ferramentas de estudo

Content selected for you

Databricks Certified Professional Data Engineer Dumps

Databricks Certified Data Engineer Professional Updated Dumps 2023

Microsoft DP-600 Exam Updated Dumps

[Dumpsinfo] Databricks Certified Data Engineer Professional Updated Dumps

Microsoft DP-203 Exam Updated Material

Questions from this subject

A proteção de dados na nuvem requer atenção redobrada das organizações, pois envolve a confidencialidade, integridade e disponibilidade das informa...

9 Marcar para revisão A segurança em ambientes de computação em nuvem deve ser pensada desde 0 planejamento da arquitetura até 0 monitoramento cont...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gma...

Questão 11 Leia as afirmativas abaixo: I - - As Na empresas década de ainda 80, não conseguem fornecer produtos com custos menores pela internet, p...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gmai...

Content selected for you

Databricks Certified Professional Data Engineer Dumps

Databricks Certified Data Engineer Professional Updated Dumps 2023

Microsoft DP-600 Exam Updated Dumps

[Dumpsinfo] Databricks Certified Data Engineer Professional Updated Dumps

Microsoft DP-203 Exam Updated Material

Questions from this subject

A proteção de dados na nuvem requer atenção redobrada das organizações, pois envolve a confidencialidade, integridade e disponibilidade das informa...

9 Marcar para revisão A segurança em ambientes de computação em nuvem deve ser pensada desde 0 planejamento da arquitetura até 0 monitoramento cont...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gma...

Questão 11 Leia as afirmativas abaixo: I - - As Na empresas década de ainda 80, não conseguem fornecer produtos com custos menores pela internet, p...

Buy Old Gmail Accounts for Sale – Trusted Sellers, Instant Delivery For personal or professional use, do you require out dated Gmail accounts? Gmai...

More content from this subject