2024 New Databricks-Certified-Professional-Data-Engineer Dumps - Real Databricks Exam Questions
Dependable Databricks-Certified-Professional-Data-Engineer Exam Dumps to Become Databricks Certified
Databricks Certified Professional Data Engineer exam is a valuable certification for professionals who want to showcase their expertise in big data processing using Databricks. Databricks Certified Professional Data Engineer Exam certification demonstrates that the candidate has the necessary skills and knowledge to design and implement scalable data pipelines using Databricks. Databricks Certified Professional Data Engineer Exam certification also provides a competitive advantage to professionals in the job market and opens up new career opportunities in the field of big data engineering.
NEW QUESTION # 53
While investigating a data issue, you wanted to review yesterday's version of the table using below command, while querying the previous version of the table using time travel you realized that you are no longer able to view the historical data in the table and you could see it the table was updated yesterday based on the table history(DESCRIBE HISTORY table_name) command what could be the reason why you can not access this data?
SELECT * FROM table_name TIMESTAMP AS OF date_sub(current_date(), 1)
- A. By default, historical data is cleaned every 180 days in DELTA
- B. Time travel must be enabled before you query previous data
- C. A command VACUUM table_name RETAIN 0 was ran on the table
- D. Time travel is disabled
- E. You currently do not have access to view historical data
Answer: C
Explanation:
Explanation
The answer is, VACUUM table_name RETAIN 0 was ran
The VACUUM command recursively vacuums directories associated with the Delta table and re-moves data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default is 7 Days.
When VACUUM table_name RETAIN 0 is ran all of the historical versions of data are lost time travel can only provide the current state.
NEW QUESTION # 54
Which of the following results in the creation of an external table?
- A. CREATE TABLE transactions (id int, desc string) LOCATION '/mnt/delta/transactions'
- B. CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION EX-TERNAL
- C. CREATE TABLE transactions (id int, desc string)
- D. CREATE EXTERNAL TABLE transactions (id int, desc string)
- E. CREATE TABLE transactions (id int, desc string) TYPE EXTERNAL
Answer: A
Explanation:
Explanation
Answer is CREATE TABLE transactions (id int, desc string) USING DELTA LOCATION
'/mnt/delta/transactions'
Anytime a table is created using Location it is considered an external table, below is the current syntax.
Syntax
CREATE TABLE table_name ( column column_data_type...) USING format LOCATION "dbfs:/"
NEW QUESTION # 55
You have noticed that Databricks SQL queries are running slow, you are asked to look reason why queries are running slow and identify steps to improve the performance, when you looked at the issue you noticed all the queries are running in parallel and using a SQL endpoint(SQL Warehouse) with a single cluster. Which of the following steps can be taken to improve the performance/response times of the queries?
*Please note Databricks recently renamed SQL endpoint to SQL warehouse.
- A. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse).
- B. They can increase the warehouse size from 2X-Smal to 4XLarge of the SQL end-point(SQL warehouse).
- C. They can turn on the Auto Stop feature for the SQL endpoint(SQL warehouse).
- D. They can increase the maximum bound of the SQL endpoint(SQL warehouse)'s scaling range
- E. They can turn on the Serverless feature for the SQL endpoint(SQL warehouse) and change the Spot Instance Policy to "Reliability Optimized."
Answer: D
Explanation:
Explanation
The answer is, They can increase the maximum bound of the SQL endpoint's scaling range when you increase the max scaling range more clusters are added so queries instead of waiting in the queue can start running using available clusters, see below for more explanation.
The question is looking to test your ability to know how to scale a SQL Endpoint(SQL Warehouse) and you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the queries are running sequentially then scale up(Size of the cluster from 2X-Small to 4X-Large) if the queries are running concurrently or with more users then scale out(add more clusters).
SQL Endpoint(SQL Warehouse) Overview: (Please read all of the below points and the below diagram to understand )
1.A SQL Warehouse should have at least one cluster
2.A cluster comprises one driver node and one or many worker nodes
3.No of worker nodes in a cluster is determined by the size of the cluster (2X -Small ->1 worker, X-Small ->2 workers.... up to 4X-Large -> 128 workers) this is called Scale up
4.A single cluster irrespective of cluster size(2X-Smal.. to ...4XLarge) can only run 10 queries at any given time if a user submits 20 queries all at once to a warehouse with 3X-Large cluster size and cluster scaling (min
1, max1) while 10 queries will start running the remaining 10 queries wait in a queue for these 10 to finish.
5.Increasing the Warehouse cluster size can improve the performance of a query, for example, if a query runs for 1 minute in a 2X-Small warehouse size it may run in 30 Seconds if we change the warehouse size to X-Small. this is due to 2X-Small having 1 worker node and X-Small having 2 worker nodes so the query has more tasks and runs faster (note: this is an ideal case example, the scalability of a query performance depends on many factors, it can not always be linear)
6.A warehouse can have more than one cluster this is called Scale out. If a warehouse is con-figured with X-Small cluster size with cluster scaling(Min1, Max 2) Databricks spins up an additional cluster if it detects queries are waiting in the queue, If a warehouse is configured to run 2 clusters(Min1, Max 2), and let's say a user submits 20 queries, 10 queriers will start running and holds the remaining in the queue and databricks will automatically start the second cluster and starts redirecting the 10 queries waiting in the queue to the second cluster.
7.A single query will not span more than one cluster, once a query is submitted to a cluster it will remain in that cluster until the query execution finishes irrespective of how many clusters are available to scale.
Please review the below diagram to understand the above concepts:
SQL endpoint(SQL Warehouse) scales horizontally(scale-out) and vertical (scale-up), you have to understand when to use what.
Scale-out -> to add more clusters for a SQL endpoint, change max number of clusters If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
Databricks SQL automatically scales as soon as it detects queries are in queuing state, in this example scaling is set for min 1 and max 3 which means the warehouse can add three clusters if it detects queries are waiting.
During the warehouse creation or after you have the ability to change the warehouse size (2X-Small....to
...4XLarge) to improve query performance and the maximize scaling range to add more clusters on a SQL Endpoint(SQL Warehouse) scale-out, if you are changing an existing warehouse you may have to restart the warehouse to make the changes effective.
How do you know how many clusters you need(How to set Max cluster size)?
When you click on an existing warehouse and select the monitoring tab, you can see warehouse utilization information(see below), there are two graphs that provide important information on how the warehouse is being utilized, if you see queries are being queued that means your warehouse can benefit from additional clusters. Please review the additional DBU cost associated with adding clusters so you can take a well balanced decision between cost and performance.
NEW QUESTION # 56
In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function willautomatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.
The function is displayed below with a blank:
Which response correctly fills in the blank to meet the specified requirements?
- A. Option C
- B. Option A
- C. Option B
- D. Option E
- E. Option D
Answer: C
Explanation:
Explanation
Option B correctly fills in the blank to meet the specified requirements. Option B uses the
"cloudFiles.schemaLocation" option, which is required for the schema detection and evolution functionality of Databricks Auto Loader. Additionally, option B uses the "mergeSchema" option, which is required for the schema evolution functionality of Databricks Auto Loader. Finally, option B uses the "writeStream" method, which is required for the incremental processing of JSON files as they arrive in a source directory. The other options are incorrect because they either omit the required options, use the wrong method, or use the wrong format. References:
Configure schema inference and evolution in Auto Loader:
https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Write streaming data:
https://docs.databricks.com/spark/latest/structured-streaming/writing-streaming-data.html
NEW QUESTION # 57
A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.
Theuser_ltvtable has the following schema:
email STRING, age INT, ltv INT
The following view definition is executed:
An analyst who is not a member of the marketing group executes the following query:
SELECT * FROM email_ltv
Which statement describes the results returned by this query?
- A. Only the email and itv columns will be returned; the email column will contain all null values.
- B. The email and ltv columns will be returned with the values in user itv.
- C. Only the email and ltv columns will be returned; the email column will contain the string
"REDACTED" in each row. - D. Three columns will be returned, but one column will be named "redacted" and contain only null values.
- E. The email, age. and ltv columns will be returned with the values in user ltv.
Answer: C
Explanation:
Explanation
The code creates a view called email_ltv that selects the email and ltv columns from a table called user_ltv, which has the following schema: email STRING, age INT, ltv INT. The code alsouses the CASE WHEN expression to replace the email values with the string "REDACTED" if the user is not a member of the marketing group. The user who executes the query is not a member of the marketing group, so they will only see the email and ltv columns, and the email column will contain the string "REDACTED" in each row.
Verified References: [Databricks Certified Data Engineer Professional], under "Lakehouse" section; Databricks Documentation, under "CASE expression" section.
NEW QUESTION # 58
A Delta Lake table representing metadata about content posts from users has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE This table is partitioned by the date column. A query is run with the following filter:
longitude < 20 & longitude > -20
Which statement describes how data will be filtered?
- A. Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.
- B. No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.
- C. The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.
- D. Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.
- E. The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.
Answer: A
Explanation:
Explanation
This is the correct answer because it describes how data will be filtered when a query is run with the following filter: longitude < 20 & longitude > -20. The query is run on a Delta Lake table that has the following schema:
user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE. This table is partitioned by the date column. When a query is run on a partitioned Delta Lake table, Delta Lake uses statistics in the Delta Log to identify data files that might include records in the filtered range. The statistics include information such as min and max values for each column in each data file. By using these statistics, Delta Lake can skip reading data files that do not match the filter condition, which can improve query performance and reduce I/O costs. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Data skipping" section.
NEW QUESTION # 59
Which statement characterizes the general programming model used by Spark Structured Streaming?
- A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
- B. Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
- C. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.
- D. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
- E. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
Answer: E
Explanation:
This is the correct answer because it characterizes the general programming model used by Spark Structured Streaming, which is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model, where users can express their streaming computation using the same Dataset/DataFrame API as they would use for static data. The Spark SQL engine will take care of running the streaming query incrementally and continuously and updating the final result as streaming data continues to arrive. Verified References: [Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "Overview" section.
NEW QUESTION # 60
Create a sales database using the DBFS location 'dbfs:/mnt/delta/databases/sales.db/'
- A. CREATE DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
- B. CREATE DELTA DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
- C. CREATE DATABASE sales FORMAT DELTA LOCATION 'dbfs:/mnt/delta/databases/sales.db/''
- D. CREATE DATABASE sales USING LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
- E. The sales database can only be created in Delta lake
Answer: E
Explanation:
Explanation
The answer is
CREATE DATABASE sales LOCATION 'dbfs:/mnt/delta/databases/sales.db/'
Note: with the introduction of the Unity catalog and three-layer namespace usage of SCHEMA and DATABASE is interchangeable
NEW QUESTION # 61
A dataset has been defined using Delta Live Tables and includes an expectations clause:
1. CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')
What is the expected behaviour when a batch of data containing data that violates these constraints is
processed?
- A. Records that violate the expectation cause the job to fail
- B. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table
- C. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log
- D. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset
- E. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log
Answer: C
NEW QUESTION # 62
You are working on a dashboard that takes a long time to load in the browser, due to the fact that each visualization contains a lot of data to populate, which of the following approaches can be taken to address this issue?
- A. Use Delta cache to store the intermediate results
- B. Increase size of the SQL endpoint cluster
- C. Use Databricks SQL Query filter to limit the amount of data in each visualization
- D. Increase the scale of maximum range of SQL endpoint cluster
- E. Remove data from Delta Lake
Answer: C
Explanation:
Explanation
Note*: The question may sound misleading but these are types of questions the exam tries to ask.
A query filter lets you interactively reduce the amount of data shown in a visualization, similar to query parameter but with a few key differences. A query filter limits data after it has been loaded into your browser.
This makes filters ideal for smaller datasets and environments where query executions are time-consuming, rate-limited, or costly.
This query filter is different from than filter that needs to be applied at the data level, this filter is at the visualization level so you can toggle how much data you want to see.
1.SELECT action AS `action::filter`, COUNT(0) AS "actions count"
2.FROM events
3.GROUP BY action
When queries have filters you can also apply filters at the dashboard level. Select the Use Dash-board Level Filters checkbox to apply the filter to all queries.
Dashboard filters
Query filters | Databricks on AWS
NEW QUESTION # 63
You are looking to process the data based on two variables, one to check if the department is supply chain and second to check if process flag is set to True
- A. if department == "supply chain" && process:
- B. if department == "supply chain" & if process == TRUE:
- C. if department == "supply chain" and process:
- D. if department = "supply chain" & process:
- E. if department == "supply chain" & process == TRUE:
Answer: C
NEW QUESTION # 64
A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources.
Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.
Given the current implementation, which method can be used?
- A. Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
- B. Execute a query to calculate the difference between the new version and the previous version using Delta Lake's built-in versioning and time travel functionality.
- C. Parse the Delta Lake transaction log to identify all newly written data files.
- D. Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
Answer: B
Explanation:
Delta Lake provides built-in versioning and time travel capabilities, allowing users to query previous snapshots of a table. This feature is particularly useful for understanding changes between different versions of the table. In this scenario, where the table is overwritten nightly, you can use Delta Lake's time travel feature to execute a query comparing the latest version of the table (the current state) with its previous version. This approach effectively identifies the differences (such as new, updated, or deleted records) between the two versions. The other options do not provide a straightforward or efficient way to directly compare different versions of a Delta Lake table.
References:
* Delta Lake Documentation on Time Travel: Delta Time Travel
* Delta Lake Versioning: Delta Lake Versioning Guide
NEW QUESTION # 65
A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.
Which kind of the test does the above line exemplify?
- A. Integration
- B. Manual
- C. Unit
- D. functional
Answer: C
Explanation:
A unit test is designed to verify the correctness of a small, isolated piece of code, typically a single function.
Testing a mathematical function that calculates the area under acurve is an example of a unit test because it is testing a specific, individual function to ensure it operates as expected.
References:
* Software Testing Fundamentals: Unit Testing
NEW QUESTION # 66
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFramedf. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
Streaming DataFramedfhas the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this task.
- A. to_interval("event_time", "5 minutes").alias("time")
- B. lag("event_time", "10 minutes").alias("time")
- C. window("event_time", "10 minutes").alias("time")
- D. "event_time"
- E. window("event_time", "5 minutes").alias("time")
Answer: E
Explanation:
Explanation
This is the correct answer because the window function is used to group streaming data by time intervals. The window function takes two arguments: a time column and a window duration. The window duration specifies how long each window is, and must be a multiple of 1 second. In this case, the window duration is "5 minutes", which means each window will cover a non-overlapping five-minute interval. The window function also returns a struct column with two fields: start and end, which represent the start and end time of each window. The alias function is used to rename the struct column as "time". Verified References: [Databricks Certified Data Engineer Professional], under "Structured Streaming" section; Databricks Documentation, under "WINDOW" section.https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-struc
NEW QUESTION # 67
The data engineering team noticed that one of the job normally finishes in 15 mins but gets stuck randomly when reading remote databases due to a network packet drop, which of the following steps can be used to improve the stability of the job?
- A. Use Databrick REST API to monitor long running jobs and issue a kill command
- B. Use Jobs runs, active runs UI section to monitor and kill long running job
- C. Use Spark job time out setting in the Spark UI
- D. Use Cluster timeout setting in the Job cluster UI
- E. Modify the task, to include a timeout to kill the job if it runs more than 15 mins.
Answer: E
Explanation:
Explanation
The answer is, Modify the task, to include time out to kill the job if it runs more than 15 mins.
https://docs.microsoft.com/en-us/azure/databricks/data-engineering/jobs/jobs#timeout
NEW QUESTION # 68
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?
- A. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.
- B. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.
- C. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.
- D. All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.
- E. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.
Answer: B
Explanation:
Explanation
The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.
The resulting state after running the second command is that an external table will be created in the storage container mounted to /mnt/finance_eda_bucket with the new name prod.sales_by_store. The command will not change any data or move any files in the storage container; it will only update the table reference in the metastore and create a new Delta transaction log for the renamed table. Verified References: [Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "ALTER TABLE RENAME TO" section; Databricks Documentation, under "Create an external table" section.
NEW QUESTION # 69
A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.
If task A fails during a scheduled run, which statement describes the results of this run?
- A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
- B. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.
- C. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
- D. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
- E. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
Answer: C
Explanation:
When a Databricks job runs multiple tasks with dependencies, the tasks are executed in a dependency graph. If a task fails, the downstream tasks that depend on it are skipped and marked as Upstream failed. However, the failed task may have already committed some changes to the Lakehouse before the failure occurred, and those changes are not rolled back automatically. Therefore, the job run may result in a partial update of the Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake toensure that the changes are only committed when the entire job run succeeds. Alternatively, you can use the Run if condition to configure tasks to run even when some or all of their dependencies have failed, allowing your job to recover from failures and continue running. References:
* transactional writes: https://docs.databricks.com/delta/delta-intro.html#transactional-writes
* Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html
NEW QUESTION # 70
When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?
- A. Cluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1 - B. Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: Unlimited - C. Cluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1 - D. Cluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1 - E. Cluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1
Answer: C
Explanation:
Explanation
This is the best configuration for scheduling Structured Streaming jobs for production, as it automatically recovers from query failures and keeps costs low. A new job cluster is created for each run of the job and terminated when the job completes, which saves costs and avoids resource contention. Retries are not needed for Structured Streaming jobs, as they can automatically recover from failures using checkpointing and write-ahead logs. Maximum concurrent runs should be set to 1 to avoid duplicate output or data loss. Verified References: Databricks Certified Data Engineer Professional, under "Monitoring & Logging" section; Databricks Documentation, under "Schedule streaming jobs" section.
NEW QUESTION # 71
A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint2.0/jobs/create.
Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?
- A. Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.
- B. The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.
- C. Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.
- D. The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.
- E. One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.
Answer: C
Explanation:
Explanation
This is the correct answer because the JSON posted to the Databricks REST API endpoint 2.0/jobs/create defines a new job with a name, an existing cluster id, and a notebook task. However, it does not specify any schedule or trigger for the job execution. Therefore, three new jobs with the same name and configuration will be created in the workspace, but none of them will be executed until they are manually triggered or scheduled.
Verified References: [Databricks Certified Data Engineer Professional], under "Monitoring & Logging" section; [Databricks Documentation], under "Jobs API - Create" section.
NEW QUESTION # 72
Which of the following is true of Delta Lake and the Lakehouse?
- A. Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.
- B. Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.
- C. Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.
- D. Z-order can only be applied to numeric values stored in Delta Lake tables
- E. Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.
Answer: B
Explanation:
Explanation
This is the correct answer because it is true of Delta Lake and the Lakehouse. Delta Lake uses Parquet as the underlying storage format for data files. Parquet is a columnar format that compresses data by column rather than by row. This means that Parquet can achieve high compression ratios for columns that have low cardinality or high repetition of values, such as integers, booleans, or dates. However, for columns that have high cardinality or low repetition of values, such as strings, Parquet cannot compress data very well.
Therefore, strings will only be compressed when a character is repeated multiple times within a row. Verified References:[Databricks Certified Data Engineer Professional], under "Delta Lake" section; Databricks Documentation, under "Delta Lake core features - Schema enforcement and evolution" section.
NEW QUESTION # 73
A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?
- A. Records that violate the expectation cause the job to fail.
- B. Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.
- C. Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.
- D. Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
- E. Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset.
Answer: D
Explanation:
Explanation
The answer is Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.
Delta live tables support three types of expectations to fix bad data in DLT pipelines Review below example code to examine these expectations, Diagram Description automatically generated with medium confidence
NEW QUESTION # 74
......
Get Ready with Databricks-Certified-Professional-Data-Engineer Exam Dumps (2024): https://pass4sure.verifieddumps.com/Databricks-Certified-Professional-Data-Engineer-valid-exam-braindumps.html
