felipelemes commited on
Commit
1d1d48c
·
verified ·
1 Parent(s): 0473a7c

Upload 922 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. scraped_kb_articles/%25run-magic-command-not-working-as-expected-with-%25python-listed-first.json +5 -0
  2. scraped_kb_articles/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector.json +5 -0
  3. scraped_kb_articles/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command.json +5 -0
  4. scraped_kb_articles/404-error-when-installing-krb5-user-module.json +5 -0
  5. scraped_kb_articles/502-error-when-trying-to-access-the-spark-ui.json +5 -0
  6. scraped_kb_articles/abfs-client-hang.json +5 -0
  7. scraped_kb_articles/access-adls1-from-sparklyr.json +5 -0
  8. scraped_kb_articles/access-blob-fails-wasb.json +5 -0
  9. scraped_kb_articles/access-blobstore-odbc.json +5 -0
  10. scraped_kb_articles/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance.json +5 -0
  11. scraped_kb_articles/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api.json +5 -0
  12. scraped_kb_articles/active-vs-dead-jobs.json +5 -0
  13. scraped_kb_articles/add-custom-tags-to-a-delta-live-tables-pipeline.json +5 -0
  14. scraped_kb_articles/add-libraries-to-a-job-cluster-to-reduce-idle-time.json +5 -0
  15. scraped_kb_articles/addressing-performance-issues-with-over-partitioned-delta-tables.json +5 -0
  16. scraped_kb_articles/adls-gen1-firewall-access.json +5 -0
  17. scraped_kb_articles/adls-gen1-mount-problem.json +5 -0
  18. scraped_kb_articles/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables.json +5 -0
  19. scraped_kb_articles/alter-table-command-only-reordering-on-metadata-level.json +5 -0
  20. scraped_kb_articles/alter-table-drop-partition-error-in-unity-catalog-external-tables.json +5 -0
  21. scraped_kb_articles/analysisexception-error-due-to-a-schema-mismatch.json +5 -0
  22. scraped_kb_articles/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-.json +5 -0
  23. scraped_kb_articles/analyze-statement-not-working-on-delta-live-tables-dlt.json +5 -0
  24. scraped_kb_articles/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql.json +5 -0
  25. scraped_kb_articles/ansi-compliant-decimal-precision-and-scale.json +5 -0
  26. scraped_kb_articles/apache-airflow-triggered-jobs-terminating-before-completing.json +5 -0
  27. scraped_kb_articles/apache-spark-driver-failing-with-unexpected-stop-and-restart-message.json +5 -0
  28. scraped_kb_articles/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway.json +5 -0
  29. scraped_kb_articles/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error.json +5 -0
  30. scraped_kb_articles/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute.json +5 -0
  31. scraped_kb_articles/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records.json +5 -0
  32. scraped_kb_articles/apache-spark-jobs-fail-with-environment-directory-not-found-error.json +5 -0
  33. scraped_kb_articles/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster.json +5 -0
  34. scraped_kb_articles/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes.json +5 -0
  35. scraped_kb_articles/apache-spark-submit-job-clusters-do-not-terminate-after-scstop.json +5 -0
  36. scraped_kb_articles/apache-spark-ui-task-logs-intermittently-return-http-500-error.json +5 -0
  37. scraped_kb_articles/append-a-row-to-rdd-or-dataframe.json +5 -0
  38. scraped_kb_articles/append-output-not-supported-no-watermark.json +5 -0
  39. scraped_kb_articles/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables.json +5 -0
  40. scraped_kb_articles/arcgis-library-installation-fails-with-subprocess-exited-with-error.json +5 -0
  41. scraped_kb_articles/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error.json +5 -0
  42. scraped_kb_articles/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics.json +5 -0
  43. scraped_kb_articles/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally.json +5 -0
  44. scraped_kb_articles/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function.json +5 -0
  45. scraped_kb_articles/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode.json +5 -0
  46. scraped_kb_articles/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files.json +5 -0
  47. scraped_kb_articles/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service.json +5 -0
  48. scraped_kb_articles/auto-loader-streaming-job-failure-with-schema-inference-error.json +5 -0
  49. scraped_kb_articles/auto-loader-streaming-query-failure-with-unknownfieldexception-error.json +5 -0
  50. scraped_kb_articles/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames.json +5 -0
scraped_kb_articles/%25run-magic-command-not-working-as-expected-with-%25python-listed-first.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/notebooks/%25run-magic-command-not-working-as-expected-with-%25python-listed-first",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen trying to use a\n%run\ncommand, you list it after\n%python\nin a notebook as in the following code snippet. You then notice the command behaves unexpectedly or inconsistently. However, when you run notebooks exported in file formats like\n.py\nor\n.ipybn\nfiles, they successfully run.\n%python\r\n%run\r\n...\nCause\nThe\n%run\ncommand must be the first line in a command cell to execute properly in Databricks notebooks.\nWhen it is not the first command, and a command like\n%python\nis written first, the iPython version of the\n%run\ncommand is called, leading to the observed different behavior from the Databricks version of this command.\nSolution\nEnsure that the\n%run\ncommand is the first line in the command cell when invoking a notebook from another notebook.\nFor more information using\n%run\nto import a notebook, review the\nOrchestrate notebooks and modularize code in notebooks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/cloud/403-error-when-attempting-to-synchronize-users-through-an-azure-databricks-scim-provisioning-connector",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou want to ensure your Databricks account portal is inaccessible from the internet while still allowing a connection from Microsoft Entra ID to Databricks. You follow the steps to allowlist Microsoft Entra ID IPs as recommended in the\nTutorial: Develop and plan provisioning for a SCIM endpoint in Microsoft Entra ID\ndocumentation. The current requirement is to add all the Microsoft Entra ID IPs to the Databricks IP access list, so you can establish a connection from the Microsoft Entra ID service for SCIM. You have IP access control lists (ACLs) enabled at the account level.\nDuring this setup, you encounter a 403 error when you attempt to synchronize users through an Azure Databricks SCIM provisioning connector.\nCause\nThe IP addresses Microsoft Entra ID service uses are dynamic. In order for Microsoft Entra ID SCIM provisioning to work with an IP ACL in place, the IP addresses with the AzureActiveDirectory tag need to be allowlisted. Their dynamic nature means it is difficult to maintain an updated IP ACL manually.\nSolution\nImplement an automation script to manage dynamic IP addresses. The script should do the following.\nRetrieve the current list of public IP address prefixes for EntraID using the Python requests API.\nUpdate the Databricks IP access list using the accounts IP access lists API (using\nPATCH\n).\nBest practices while implementing your script include the following.\nRegularly download the updated IP ranges file from Microsoft and update the IP access list accordingly. This file is updated weekly, and new ranges will not be used in Azure for at least one week.\nTest the automation script rigorously in a lower environment before implementing it in production.\nTake a backup of the current IP access list before the first implementation.\nNote\nUpdating an IP access list requires the admin role.\nHow to run the script\nFirst, preview the changes that the script makes.\npython <your-script-name>.py\nor\npython <your-script-name>.py --dry-run=true\nThe script outputs a list of IP ranges that will be added to the Databricks IP access list, and a list of IP ranges to be removed from the Databricks IP access list.\nThen, apply the changes to the Databricks IP access list.\npython <your-script-name>.py --dry-run=false\nExample script\nYou can use the following example script for\n<your-script-name>.py\n. It sets constants, gets the latest service tags and address prefixes for Azure Active Directory. Then it gets the IP access list by ID name and updates the IP access list. Last, it determines changes and updates the IP access list with new prefixes.\nimport requests\r\nimport json\r\nimport re\r\nimport argparse\r\n\r\n# Constants\r\nMICROSOFT_DOWNLOAD_PAGE = \"https://www.microsoft.com/en-us/download/details.aspx?id=56519\"\r\naccount_id = '<your-Databricks-account-id>'\r\nDATABRICKS_INSTANCE = 'https://accounts.azuredatabricks.net/'\r\nDATABRICKS_TOKEN = '<set-authentication-mechanism-following-your-account-config>'\r\nIP_ACCESS_LIST_NAME = \"AzureActiveDirectory\"\r\n\r\n# Function to get the latest service tags JSON URL\r\ndef get_latest_service_tags_url():\r\n    headers = {\"User-Agent\": \"curl/7.81.0\", \"Accept\": \"*/*\"}\r\n    response = requests.get(MICROSOFT_DOWNLOAD_PAGE, headers=headers)\r\n    match = re.search(r'ServiceTags_Public_\\d+\\.json', response.text)\r\n    if not match:\r\n        raise ValueError(\"Could not find the latest ServiceTags_Public JSON file.\")\r\n    filename = match.group(0)\r\n    return f\"https://download.microsoft.com/download/7/1/D/71D86715-5596-4529-9B13-DA13A5DE5B63/{filename}\"\r\n\r\n# Function to get the address prefixes for Azure Active Directory\r\ndef get_azure_ad_prefixes(url):\r\n    response = requests.get(url)\r\n    data = response.json()\r\n    azure_ad_prefixes = []\r\n    for item in data['values']:\r\n        if item['name'] == \"AzureActiveDirectory\" and item['id'] == \"AzureActiveDirectory\":\r\n            azure_ad_prefixes = item['properties']['addressPrefixes']\r\n            azure_ad_prefixes = [prefix for prefix in azure_ad_prefixes if \":\" not in prefix]\r\n            break\r\n    return azure_ad_prefixes\r\n\r\n# Function to get the IP access list ID by name\r\ndef get_ip_access_list_id(access_list_name):\r\n    url = f\"{DATABRICKS_INSTANCE}/api/2.0/accounts/{account_id}/ip-access-lists\"\r\n    headers = {\"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\"}\r\n    response = requests.get(url, headers=headers)\r\n    response.raise_for_status()\r\n    access_lists = response.json()['ip_access_lists']\r\n    for acl in access_lists:\r\n        if acl['label'] == access_list_name:\r\n            return acl['list_id'], acl['ip_addresses']\r\n    raise ValueError(f\"Access list with label {access_list_name} not found.\")\r\n\r\n# Function to update the IP access list\r\ndef update_ip_access_list(access_list_id, ip_prefixes):\r\n    url = f\"{DATABRICKS_INSTANCE}/api/2.0/accounts/{account_id}/ip-access-lists/{access_list_id}\"\r\n    headers = {\r\n        \"Authorization\": f\"Bearer {DATABRICKS_TOKEN}\",\r\n        \"Content-Type\": \"application/json\"\r\n    }\r\n    payload = {\r\n        \"label\": IP_ACCESS_LIST_NAME,\r\n        \"list_type\": \"ALLOW\",\r\n        \"ip_addresses\": ip_prefixes,\r\n        \"enabled\": True\r\n    }\r\n    response = requests.patch(url, headers=headers, data=json.dumps(payload))\r\n    response.raise_for_status()\r\n    return response.json()\r\n\r\n# Main script logic\r\ndef main(dry_run=True):\r\n    try:\r\n        # Get latest service tags URL\r\n        latest_url = get_latest_service_tags_url()\r\n        print(f\"Using Service Tags URL: {latest_url}\")\r\n\r\n        # Get Azure AD prefixes\r\n        azure_ad_prefixes = get_azure_ad_prefixes(latest_url)\r\n        \r\n        # Get the IP access list ID and current IPs for the given name\r\n        ip_access_list_id, current_ip_prefixes = get_ip_access_list_id(IP_ACCESS_LIST_NAME)\r\n        \r\n        # Determine changes\r\n        new_prefixes = set(azure_ad_prefixes) - set(current_ip_prefixes)\r\n        removed_prefixes = set(current_ip_prefixes) - set(azure_ad_prefixes)\r\n        \r\n        if dry_run:\r\n            print(\"Dry Run: Changes that will be made\")\r\n            print(\"New prefixes to add:\", new_prefixes)\r\n            print(\"Prefixes to remove:\", removed_prefixes)\r\n            return\r\n\r\n        # Update the IP access list with the new prefixes\r\n        updated_ips = list(set(azure_ad_prefixes))\r\n        response = update_ip_access_list(ip_access_list_id, updated_ips)\r\n        print(\"IP access list updated successfully:\", response)\r\n\r\n    except Exception as e:\r\n        print(f\"An error occurred: {e}\")\r\n\r\nif __name__ == \"__main__\":\r\n    \r\n    parser = argparse.ArgumentParser(description=\"Update Databricks IP Access List with Azure AD IPs\")\r\n    parser.add_argument(\"--dry-run\", type=bool, default=True, help=\"Perform a dry run to show changes without applying them (default: True)\")\r\n    args = parser.parse_args()\r\n    \r\n    main(dry_run=args.dry_run)\nFor more information, refer to the\nMicrosoft Entra on-premises application provisioning to SCIM-enabled apps\ndocumentation.\nFor quick access to the IP address list file with the AzureActiveDirectory tag, refer to the\nAzure IP Ranges and Service Tags – Public Cloud\nwebpage."
5
+ }
scraped_kb_articles/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/notebooks/404-client-error-not-found-for-url-while-running-delta-sharing-client-load_as_spark-command",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhile attempting to query a shared table using the Delta Sharing client in a notebook, you try to execute the\ndelta_sharing.load_as_spark(table_url)\nmethod.\nYou receive a 404 error.\nrequests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://<region>.<cloud-vendor>/delta_sharing/retrieve_config.html/shares\nCause\nThe endpoint URL used by the Delta Sharing client is incorrectly configured. It uses “\nHTTPS\n” in the path to the external location, which is not supported in the\nload_as_spark\nmethod.\nSolution\nCheck your endpoint URL and use the correct external location URI.\nConfirm that the endpoint URL in the share file matches the expected format. The call to the\ndelta_sharing.load_as_spark(table_url)\nmethod from the\ndelta-sharing\nPython library expects a valid URI that can be accessed remotely by the Spark driver, in the following format.\nAWS S3 (s3a://)\nAzure Blob Storage (abfss://)\nGCP GCS (gs://your-bucket/path/to/data)\nSelect and replace\n<your-URI>\nin the following code snippet with your URI, based on your cloud platform in the previous step.\nshare_file_path = '<cloud storage>://<storage-region>/delta-sharing/share/open-datasets.share'\r\n\r\ntable_url = f\"{share_file_path}#delta_sharing.default.file_name\"\r\n\r\nshared_df = delta_sharing.load_as_spark(table_url)\r\n\r\ndisplay(shared_df)\nFor more information, refer to the\nDelta Sharing Receiver Quickstart\nnotebook."
5
+ }
scraped_kb_articles/404-error-when-installing-krb5-user-module.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/clusters/404-error-when-installing-krb5-user-module",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen trying to install the\nkrb5-user\npackage on your cluster’s Ubuntu OS, you receive a 404 error.\nUnable to correct missing packages.\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libgssrpc4_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkadm5clnt-mit11_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkdb5-9_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/k/krb5/libkadm5srv-mit11_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/k/krb5/krb5-user_1.17-6ubuntu4.1_amd64.deb 404 Not Found [IP: 185.125.190.39 80]\r\nE: Aborting install.;\nCause\nWhen there is a package update in the\nkrb5-user Ubuntu repository\n, the previous version is removed.\nIn the Ubuntu OS, package references saved within\n/var/lib/apt/lists/\nare cached, causing the package info to point to the previous, now no longer available, version.\nNote\nThe command\napt-get clean\ndoesn't clear the\n/var/lib/apt/lists/\ndirectory. This is a common OS behavior.\nSolution\nManually remove the\n/var/lib/apt/lists/\npath to refresh the cached data.\nsudo rm -rf /var/lib/apt/lists/*\r\nsudo apt-get -y update\r\nsudo apt-get install -y libkrb5-user"
5
+ }
scraped_kb_articles/502-error-when-trying-to-access-the-spark-ui.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/clusters/502-error-when-trying-to-access-the-spark-ui",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to access the Apache Spark UI in Databricks, you encounter a 502 error even though your Spark job is running without issues.\n502 Bad Gateway: The server returned an invalid or incomplete response.\nThis error can occur in environments such as Data Engineering and Machine Learning, specifically when working with large Delta Lake tables.\nWhen you check the driver logs, you notice the following error in the driver logs.\njava.lang.StackOverflowError\nCause\nThe Spark UI stores data in memory by default. When a Spark job generates a large amount of data, that data can overflow memory, forcing the driver to address the overflow. The HTTP server, which is within the driver, cannot then respond to HTTP requests properly and throws a 502 error.\nSolution\nEnable the configuration to store Spark UI data on disk instead of in memory. This helps prevent the Spark UI from running out of memory.\nTo enable this configuration, add the following line to your Spark configuration in your cluster settings. Edit your cluster, scroll to\nAdvanced options > Spark\ntab, then in the\nSpark config\nfield, input the following setting.\nspark.ui.store.path /databricks/driver/sparkuirocksdb\nIf you prefer to enable the configuration using a notebook, you can use the following Python code.\nspark.conf.set(\"spark.ui.store.path\", \"/databricks/driver/sparkuirocksdb\")"
5
+ }
scraped_kb_articles/abfs-client-hang.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/abfs-client-hang",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are using Azure Data Lake Storage (ADLS) Gen2. When you try to access an Azure Blob File System (ABFS) path from a Databricks cluster, the command hangs.\nEnable the debug log and you can see the following stack trace in the driver logs:\nCaused by: java.io.IOException: Server returned HTTP response code: 400 for URL: https://login.microsoftonline.com/b9b831a9-6c10-40bf-86f3-489ed83c81e8/oauth2/token\r\n  at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1894)\r\n  at sun.net.www.protocol.http.HttpURLConnection.access$200(HttpURLConnection.java:91)\r\n  at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1484)\r\n  at sun.net.www.protocol.http.HttpURLConnection$9.run(HttpURLConnection.java:1482)\r\n  at java.security.AccessController.doPrivileged(Native Method)\r\n  at java.security.AccessController.doPrivilegedWithCombiner(AccessController.java:782)\r\n  at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1481)\r\n  at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)\r\n  at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)\r\n  at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.oauth2.AzureADAuthenticator.getTokenSingleCall(AzureADAuthenticator.java:254)\r\n  ... 31 more\nCause\nIf ABFS is configured on a cluster with a wrong value for property\nfs.azure.account.oauth2.client.id\n, or if you try to access an explicit path of the form\nabfss://myContainer@myStorageAccount.dfs.core.windows.net/...\nwhere\nmyStorageAccount\ndoes not exist, then the ABFS driver ends up in a retry loop and becomes unresponsive. The command will eventually fail, but because it retries so many times, it appears to be a hung command.\nIf you try to access an incorrect path with an existing storage account, you will see a 404 error message. The system does not hang in this case.\nSolution\nYou must verify the accuracy of all credentials when accessing ABFS data. You must also verify the ABFS path you are trying to access exists. If either of these are incorrect, the problem occurs."
5
+ }
scraped_kb_articles/access-adls1-from-sparklyr.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/access-adls1-from-sparklyr",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen using a cluster with Azure AD Credential Passthrough enabled, commands that you run on that cluster are able to read and write your data in Azure Data Lake Storage Gen1 without requiring you to configure service principal credentials for access to storage.\nFor example, you can directly access data using\n%python\r\n\r\nspark.read.csv(\"adl://myadlsfolder.azuredatalakestore.net/MyData.csv\").collect()\nHowever, when you try to access data directly using Sparklyr:\n%r\r\n\r\nspark_read_csv(sc, name = \"air\", path = \"adl://myadlsfolder.azuredatalakestore.net/MyData.csv\")\nIt fails with the error:\ncom.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen1 Token\nCause\nThe\nspark_read_csv\nfunction in Sparklyr is not able to extract the ADLS token to enable authentication and read data.\nSolution\nA workaround is to use an Azure\napplication id\n,\napplication key\n, and\ndirectory id\nto mount the ADLS location in DBFS:\n%python\r\n\r\n# Get credentials and ADLS URI from Azure\r\napplicationId= <application-id>\r\napplicationKey= <application-key>\r\ndirectoryId= <directory-id>\r\nadlURI=<adl-uri>\r\nassert adlURI.startswith(\"adl:\"), \"Verify the adlURI variable is set and starts with adl:\"\r\n\r\n# Mount ADLS location to DBFS\r\ndbfsMountPoint=<mount-point-location>\r\ndbutils.fs.mount(\r\n  mount_point = dbfsMountPoint,\r\n  source = adlURI,\r\n  extra_configs = {\r\n    \"dfs.adls.oauth2.access.token.provider.type\": \"ClientCredential\",\r\n    \"dfs.adls.oauth2.client.id\": applicationId,\r\n    \"dfs.adls.oauth2.credential\": applicationKey,\r\n    \"dfs.adls.oauth2.refresh.url\": \"\nhttps://login.microsoftonline.com/\n{}/oauth2/token\".format(directoryId)\r\n  })\nThen, in your R code, read data using the mount point:\n%r\r\n\r\n# Install Sparklyr\r\n%r\r\ninstall.packages(\"sparklyr\")\r\nlibrary(sparklyr)\r\n# Create a sparklyr connection\r\nsc <- spark_connect(method = \"databricks\")\r\n\r\n# Read Data\r\n%r\r\nmyData = spark_read_csv(sc, name = \"air\", path = \"dbfs:/<mount-point-location>/myData.csv\")"
5
+ }
scraped_kb_articles/access-blob-fails-wasb.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/access-blob-fails-wasb",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to access an already created mount point or create a new mount point, it fails with the error:\nWASB: Fails with java.lang.NullPointerException\nCause\nThis error can occur when the root mount path (such as\n/mnt/\n) is also mounted to blob storage. Run the following command to check if the root path is also mounted:\n%python\r\n\r\ndbutils.fs.mounts()\nCheck if\n/mnt\nappears in the list.\nSolution\nUnmount the\n/mnt/\nmount point using the command:\n%python\r\n\r\ndbutils.fs.unmount(\"/mnt\")\nNow you should be able to access your existing mount points and create new ones."
5
+ }
scraped_kb_articles/access-blobstore-odbc.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/access-blobstore-odbc",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nDelete\nInfo\nIn general, you should use Databricks Runtime 5.2 and above, which include a built-in Azure Blob File System (ABFS) driver, when you want to access Azure Data Lake Storage Gen2 (ADLS Gen2). This article applies to users who are accessing ADLS Gen2 storage using JDBC/ODBC instead.\nWhen you run a SQL query from a JDBC or ODBC client to access ADLS Gen2, the following error occurs:\ncom.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: No value for dfs.adls.oauth2.access.token.provider found in conf file.\r\n\r\n18/10/23 21:03:28 ERROR SparkExecuteStatementOperation: Error executing query, currentState RUNNING,\r\njava.util.concurrent.ExecutionException: java.io.IOException: There is no primary group for UGI (Basic token)chris.stevens+dbadmin (auth:SIMPLE)\r\n  at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)\r\n  at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)\r\n  at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)\r\n  at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)\r\n  at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)\r\n  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)\r\n  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)\r\n  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)\r\n  at com.google.common.cache.LocalCache.get(LocalCache.java:3932)\r\n  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)\r\n  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getCachedPlan(SessionCatalog.scala:158)\r\n  at org.apache.spark.sql.execution.datasources.FindDataSourceTable.org$apache$spark$sql$execution$datasources$FindDataSourceTable$$readDataSourceTable(DataSourceStrategy.scala:257)\r\n  at org.apache.spark.sql.execution.datasources.FindDataSourceTable$$anonfun$apply$2.applyOrElse(DataSourceStrategy.scala:313)\r\n  at\r\n  at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)\r\n  at scala.collection.immutable.List.foldLeft(List.scala:84)\r\n  at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:87)\r\n  at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:79)\nWhen you run the query from the SQL client, you get the following error:\nAn error occurred when executing the SQL command:\r\nselect * from test_databricks limit 50\r\n\r\n[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: com.google.common.util.concurrent.UncheckedExecutionException: com.databricks.backend.daemon.data.common.InvalidMountException: Error while using path /mnt/crm_gen2/phonecalls for resolving path '/phonecalls' within mount at '/mnt/crm_gen2'., Query: SELECT * FROM `default`.`test_databricks` `default_test_databricks` LIMIT 50. [SQL State=HY000, DB Errorcode=500051]\r\n\r\nWarnings:\r\n[Simba][SparkJDBCDriver](500100) Error getting table information from database.\nCause\nThe root cause is incorrect configuration settings to create a JDBC or ODBC connection to ABFS via ADLS Gen2, which cause queries to fail.\nSolution\nSet\nspark.hadoop.hive.server2.enable.doAs\nto\nfalse\nin the cluster configuration settings."
5
+ }
scraped_kb_articles/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/security/acquire-app-only-access-token-failed-error-when-trying-to-connect-to-sharepoint-online-from-an-on-premises-databricks-instance",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you attempt to connect to SharePoint Online from an on-premises Databricks instance to pull content hosted in SharePoint Online, you receive an\nAcquire app-only access token failed\nerror.\nYou’re using Azure AD credentials for authentication, and the connection succeeds from your local computer but fails when attempted from Databricks.\nCause\nYour network is blocking outbound traffic to\nlogin.microsoftonline.com\non port 443. This domain is essential for authenticating with Azure AD and acquiring an app-only access token, which is required for accessing SharePoint resources.\nSolution\nConfirm network connectivity using the following command to verify if the proxy allows traffic to\nlogin.microsoftonline.com\non port 443.\n%sh\r\nnc -zv -x xxxxx:443 login.microsoftonline.com 443\nIf you encounter a disconnection error, work with your network/firewall team to ensure that access to\nlogin.microsoftonline.com\non port 443 is allowed from the on-premises Databricks environment."
5
+ }
scraped_kb_articles/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/administration/activate-or-deactivate-a-user-in-the-account-console-or-workspace-using-the-api",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": ""
5
+ }
scraped_kb_articles/active-vs-dead-jobs.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/jobs/active-vs-dead-jobs",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nOn clusters where there are too many concurrent jobs, you often see some jobs stuck in the Spark UI without any progress. This complicates identifying which are the active jobs/stages versus the dead jobs/stages.\nCause\nWhenever there are too many concurrent jobs running on a cluster, there is a chance that the Spark internal\neventListenerBus\ndrops events. These events are used to track job progress in the Spark UI. Whenever the event listener drops events you start seeing dead jobs/stages in Spark UI, which never finish. The jobs are actually finished but not shown as completed in the Spark UI.\nYou observe the following traces in driver logs:\n18/01/25 06:37:32 WARN LiveListenerBus: Dropped 5044 SparkListenerEvents since Thu Jan 25 06:36:32 UTC 2018\nSolution\nThere is no way to remove dead jobs from the Spark UI without restarting the cluster. However, you can identify the active jobs and stages by running the following commands:\n%scala\r\n\r\nsc.statusTracker.getActiveJobIds()  // Returns an array containing the IDs of all active jobs.\r\nsc.statusTracker.getActiveStageIds() // Returns an array containing the IDs of all active stages."
5
+ }
scraped_kb_articles/add-custom-tags-to-a-delta-live-tables-pipeline.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/jobs/add-custom-tags-to-a-delta-live-tables-pipeline",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "When managing Delta Live Tables pipelines on your clusters, you may want to use custom tags for internal tracking. For example, you may want to use tags to allocate cost across different departments. Or your organization might have a global cluster policy that requires tags on the instances. Failure to comply with a cluster policy can result in cluster start up failures.\nInstructions\nThe Create Pipeline UI does not have an option to add additional tags. Instead, you must add custom tags manually, by editing the JSON configuration.\nClick\nWorkflows\nin the left sidebar menu.\nClick\nDelta Live Tables\n.\nClick\nCreate Pipeline\n.\nEnter your pipeline information in the UI.\nClick\nJSON\nin the upper right to switch to the JSON view.\nAdd your custom tags to the\nclusters\nsection of the JSON file. The\ncustom_tags\nblock should be placed right below the\nlabel\nblock.\n{\r\n    \"clusters\": [\r\n        {\r\n            \"label\": \"default\",\r\n            \"custom_tags\": {\r\n                \"<custom-tag-name1>\": \"<custom-tag-value1>\",\r\n                \"<custom-tag-name2>\": \"<custom-tag-value2>\",\r\n                \"<custom-tag-name3>\": \"<custom-tag-value3>\"\r\n            },\r\n            \"autoscale\": {\r\n                \"min_workers\": 1,\r\n                \"max_workers\": 5,\r\n                \"mode\": \"ENHANCED\"\r\n            }\r\n        }\r\n    ],\r\n    \"development\": true,\r\n    \"continuous\": false,\r\n    \"channel\": \"CURRENT\",\r\n    \"edition\": \"ADVANCED\",\r\n    \"photon\": false,\r\n    \"libraries\": []\r\n}\nExample\n3. For the existing pipelines: click on Delta Live Tables tables UI and select the desired pipeline. Click on settings tab and switch to JSON view as show above.\nDelete\nInfo\nYou can add custom tags to an existing pipeline by editing the settings and using the JSON view to add the tag information. The JSON view can be used to add or edit any cluster property without using the UI."
5
+ }
scraped_kb_articles/add-libraries-to-a-job-cluster-to-reduce-idle-time.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/libraries/add-libraries-to-a-job-cluster-to-reduce-idle-time",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem:\nYou have an automated job that requires the use of external Maven libraries.\nYou created a separate cluster with the libraries installed, but it incurs idle time, resulting in unnecessary costs.\nSolution:\nTo add libraries to a job cluster, follow these steps:\nCreate a job in Databricks.\nClick\nAdd\nnext to dependent libraries.\nIn the pop-up window, add the required libraries.\nTo reduce idle time in a job cluster, you have two options:\nOpt out of auto termination by clearing the\nAuto Termination\ncheckbox.\nSpecify an inactivity period of 0.\nDatabricks recommends running jobs on a job cluster, rather than an interactive cluster with auto termination.\nJob clusters automatically terminate once the job completes, ensuring efficient resource utilization."
5
+ }
scraped_kb_articles/addressing-performance-issues-with-over-partitioned-delta-tables.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data/addressing-performance-issues-with-over-partitioned-delta-tables",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Info\nThis article applies to Databricks Runtime 15.2 and above.\nProblem\nWhen working with Delta tables, you notice that your\nDESCRIBE HISTORY\n,\nDESCRIBE FORMATTED\n, and\nDESCRIBE EXTENDED\nqueries execute slowly. You may also see bloated Delta logs or driver out-of-memory (OOM) errors.\nCause\nYour Delta tables are over-partitioned: you have less than 1 GB of data in a given partition, whether from a single file or multiple small files, but the table can accommodate more.\nWhen a Delta table is divided into too many partitions, each containing a small amount of data, the system's performance can degrade trying to manage the increased number of files and associated overhead.\nSolution\nImplement liquid clustering to simplify data layout decisions and optimize query performance. Liquid clustering helps distribute data more efficiently and reduce the overhead associated with managing a large number of small partitions.\nFor more information, please review the\nUse liquid clustering for Delta tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nYou can also optimize the table partitioning layout to ensure that each partition contains approximately 1 GB of data or more. Reduce the number of partitions and merge smaller files.\nFor more information on the ideal partition size, please refer to the\nWhen to partition tables on Databricks\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/adls-gen1-firewall-access.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/adls-gen1-firewall-access",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you have a firewall enabled on your Azure virtual network (VNet) and you try to access ADLS using the ADLS Gen1 connector, it fails with the error:\n328 format(target_id, \".\", name), value) 329 else: 330 raise Py4JError(Py4JJavaError:\r\nAn error occurred while calling o196.parquet.: java.lang.RuntimeException:\r\nCould not find ADLS Token at com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider$$anonfun$get Token$1.apply(AdlCredentialContextTokenProvider.scala:18)\r\nat com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider$$anonfun$get\r\nToken$1.apply(AdlCredentialContextTokenProvider.scala:18)\r\nat scala.Option.getOrElse(Option.scala:121)\r\nat com.databricks.backend.daemon.data.client.adl.AdlCredentialContextTokenProvider.getToken(AdlCredentialContextTokenProvider.scala:18)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getAccessToken(ADLStoreClient.java:1036)\r\nat com.microsoft.azure.datalake.store.HttpTransport.makeSingleCall(HttpTransport.java:177)\r\nat com.microsoft.azure.datalake.store.HttpTransport.makeCall(HttpTransport.java:91)\r\nat com.microsoft.azure.datalake.store.Core.getFileStatus(Core.java:655)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:735)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getDirectoryEntry(ADLStoreClient.java:718)\r\nat com.databricks.adl.AdlFileSystem.getFileStatus(AdlFileSystem.java:423)\r\nat org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)\r\nat org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:94)\nCause\nThis is a known issue with the ADLS Gen1 connector. Connecting to ADLS Gen1 when a firewall is enabled is unsupported.\nSolution\nUse\nADLS Gen2\ninstead."
5
+ }
scraped_kb_articles/adls-gen1-mount-problem.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/cloud/adls-gen1-mount-problem",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to mount an Azure Data Lake Storage (ADLS) Gen1 account on Databricks, it fails with the error:\ncom.microsoft.azure.datalake.store.ADLException: Error creating directory /\r\nError fetching access token\r\nOperation null failed with exception java.io.IOException : Server returned HTTP response code: 401 for URL: https://login.windows.net/18b0b5d6-b6eb-4f5d-964b-c03a6dfdeb22/oauth2/token\r\nLast encountered exception thrown after 5 tries. [java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException,java.io.IOException]\r\n [ServerRequestId:null]\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.getExceptionFromResponse(ADLStoreClient.java:1169)\r\nat com.microsoft.azure.datalake.store.ADLStoreClient.createDirectory(ADLStoreClient.java:589)\r\nat com.databricks.adl.AdlFileSystem.mkdirs(AdlFileSystem.java:533)\r\nAt com.databricks.backend.daemon.data.client.DatabricksFileSystemV2$$anonfun$mkdirs$1$$anonfun$apply$mcZ$sp$7$$anonfun$apply$mcZ$sp$8.apply$mcZ$sp(DatabricksFileSystemV2.scala:638)\nCause\nThis error can occur if the ADLS Gen1 account was previously mounted in the workspace, but not unmounted, and the credential used for that mount subsequently expired. When you try to mount the same account with a new credential, there is a conflict between the expired and new credentials.\nSolution\nYou need to unmount all existing mounts, and then create a new mount with a new, unexpired credential.\nFor more information, see\nMount Azure Data Lake Storage Gen1 with DBFS (AWS)\nand\nMount Azure Data Lake Storage Gen1 with DBFS (Azure)\n."
5
+ }
scraped_kb_articles/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta/allow-spaces-and-special-characters-in-nested-column-names-with-delta-tables",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nIt is common for JSON files to contain nested struct columns. Nested column names in a JSON file can have spaces between the names.\nWhen you use Apache Spark to read or write JSON files with spaces in the nested column names, you get an\nAnalysisException\nerror message.\nFor example, if you try to read a JSON file, evaluate the DataFrame, and then write it out to a Delta table on DBR 10.2 or below it returns an error.\n%scala\r\n\r\nval df = spark.read.json(\"<path-to-JSON-file>\") \r\ndf.write.format(\"delta\").mode(\"overwrite\").save(\"<path-to-delta-table>\")\nThe expected error message is visible in the stack trace.\nAnalysisException: Attribute name \"stage_info.Accumulables.Count Failed Values\" contains invalid character(s) among \" ,;{}()\\n\\t=\". Please use alias to rename it.\nCause\nOne of the nested column names in the DataFrame contains spaces, which is preventing you from writing the output to the Delta table.\nSolution\nIf your source files are straightforward, you can use\nwithColumnRenamed\nto rename multiple columns and remove spaces. However, this can quickly get complicated with a nested schema.\nwithColumn\ncan be used to flatten nested columns and rename the existing column (with spaces) to a new column name (without spaces). In case of a large schema, flattening all of the nested columns in the DataFrame can be a tedious task.\nIf your clusters are using Databricks Runtime 10.2 or above you can avoid the issue entirely by enabling column mapping mode. Column mapping mode allows the use of spaces as well as\n, ; { } ( ) \\n \\t =\ncharacters in table column names.\nSet the Delta table property\ndelta.columnMapping.mode\nto\nname\nto enable column mapping mode.\nThis sample code sets up a Delta table that can support nested column names with spaces, however it does require a cluster running Databricks Runtime 10.2 or above.\n%scala\r\n\r\nimport io.delta.tables.DeltaTable\r\n\r\nval df = spark.read.json(\"<path-to-JSON-file>\") \r\nDeltaTable.create()\r\n .addColumns(df.schema)\r\n .property(\"delta.minReaderVersion\", \"2\")\r\n .property(\"delta.minWriterVersion\", \"5\")\r\n .property(\"delta.columnMapping.mode\", \"name\")\r\n .location(\"<path-to-delta-table>\")\r\n .execute()\r\n\r\ndf.write.format(\"delta\").mode(\"append\").save(\"<path-to-delta-table>\")"
5
+ }
scraped_kb_articles/alter-table-command-only-reordering-on-metadata-level.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data/alter-table-command-only-reordering-on-metadata-level",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou have a pre-existing Delta table and want to add a new column,\n'<new-column>'\n. When you use the following ALTER TABLE command, the reordering is only adjusted at the metadata level.\nALTER TABLE <table> ALTER COLUMN <new-column> AFTER <existing-column>;\nCause\nALTER TABLE\naffects the metadata level only. It does not change the physical layout of the parquet files underneath. Limiting its impact ensure metadata-only changes to the schema, such as renaming or dropping columns, do not incur costly rewrites of the underlying data files.\nFor more information, refer to the\nUpdate Delta Lake table schema\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nTo rewrite files with new column order taken into account, use either\nCREATE OR REPLACE TABLE\nor\nINSERT OVERWRITE TABLE\ninstead of\nALTER TABLE\n.\nCREATE OR REPLACE TABLE\nCREATE OR REPLACE TABLE <table> AS\r\nSELECT <existing-column>, <new-column>, <other-existing-column>, ... -- desired column order\r\nFROM <table>\nINSERT OVERWRITE TABLE\nINSERT OVERWRITE TABLE <table>\r\nSELECT <existing-column>, <new-column>, <other-existing-column>, ...\r\nFROM <table>"
5
+ }
scraped_kb_articles/alter-table-drop-partition-error-in-unity-catalog-external-tables.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data-sources/alter-table-drop-partition-error-in-unity-catalog-external-tables",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to run\nALTER TABLE <catalog>.<schema>.<table-name> DROP PARTITION <partition column=partition value>\non a Unity Catalog external table, you encounter an error.\nSQL query error : [UC_COMMAND_NOT_SUPPORTED.WITHOUT_RECOMMENDATION] ALTER TABLE (drop partition) are not supported in Unity Catalog.\nCause\nUnity Catalog does not store table partition information, so\nDROP PARTITION\nis not supported on Unity Catalog tables.\nSolution\nFor Unity Catalog external tables with CSV, JSON, ORC, or Parquet data formats, use partition metadata logging to resolve the issue.\nSet the\nnonDelta.partitionLog.enabled\nto\ntrue\nfor the Apache Spark session while creating the table.\nSET spark.databricks.nonDelta.partitionLog.enabled = true;\nRe-create or create the table.\nRe-run\nALTER TABLE <catalog>.<schema>.<table-name> DROP PARTITION <partition column=partition value>\n.\nExample\nSet the config for partition metadata logging.\n%sql\r\n\r\nSET spark.databricks.nonDelta.partitionLog.enabled = true;\nCreate the table.\n%sql\r\n\r\nCREATE OR REPLACE TABLE <catalog>.<schema>.<table-name>\r\nUSING <format>\r\nPARTITIONED BY (<partition-column-list>)\r\nLOCATION 's3://<bucket-path>/<table-directory>';\nRe-run the command.\n%sql\r\n\r\nALTER TABLE <catalog>.<schema>.<table-name> DROP PARTITION <partition column=partition value>\nFor more information, please refer to the\nPartition discovery for external tables\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/analysisexception-error-due-to-a-schema-mismatch.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta/analysisexception-error-due-to-a-schema-mismatch",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are writing to a Delta table when you get an\nAnalysisException\nerror indicating a schema mismatch.\n'AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: bc10as3e-e12va-4f325-av10e-4s38f17vr3dd3)'. input_df.write.format(\"delta\").mode(\"overwrite\").save(target_delta_table_path)\nCause\nThe scheme mismatch is due to a change in the source schema and target schema. This usually happens when you introduce new columns to the target table during the write operation. This change in schema is a case of schema evolution. If these changes are expected, you should enable the\nmergeSchema\nproperty.\nSolution\nModify the write command and set the\nmergeSchema\nproperty to true.\ninput_df.write.format(\"delta\").mode(\"overwrite\").option(\"mergeSchema\", \"true\").save(target_delta_table_path)\nFor more information, please review the\nEnable schema evolution\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta/analysisexception-incompatible-format-detected-error-when-writing-to-opensearch-",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are attempting to write dataframes into OpenSearch indices using the\norg.opensearch.client:opensearch-spark-30_2.12\nlibrary when you encounter the error\nAnalysisException: Incompatible format detected\n.\nCause\nThis error is caused by the presence of a\n_delta_log\nfolder in the root directory of the Databricks File System (DBFS).\nWhen Apache Spark detects this folder, it assumes the target path is a Delta table, leading to the\nAnalysisException\nerror when attempting to write non-Delta data using the OpenSearch format. The folder's presence causes Spark to misinterpret the target path.\nSolution\nVerify the presence of the\n_delta_log\nfolder in the root directory.\ndbutils.fs.ls(\"dbfs:/_delta_log/\")\nIf the folder is present, delete it to remove the conflict. Ensure you take a backup if necessary before deletion.\ndbutils.fs.rm(\"dbfs:/_delta_log/\", True)\nAfter deleting the\n_delta_log\nfolder, re-run the job to write the dataframe into OpenSearch using the specified function. Please adjust the parameters as required for your environment. The following is an example for an upsert condition in us-east-1 region.\ndef saveToElasticSearch(df):\r\n  df.write.format(\"org.opensearch.spark.sql\")\\\r\n    .option(\"opensearch.nodes\", openSearchDomainPath)\\\r\n    .option(\"opensearch.nodes.wan.only\", True)\\\r\n    .option(\"opensearch.port\",\"443\")\\\r\n    .option(\"opensearch.net.ssl\", \"true\")\\\r\n    .option(\"opensearch.aws.sigv4.enabled\", \"true\")\\\r\n    .option(\"opensearch.aws.sigv4.region\", \"us-east-1\")\\\r\n    .option(\"opensearch.batch.size.entries\",200)\\\r\n    .option(\"opensearch.mapping.id\",\"id\")\\\r\n    .option(\"opensearch.write.operation\", \"upsert\")\\\r\n    .mode(\"append\")\\\r\n .save(index)\nConfirm that the job now completes successfully.\nNote\nAs a preventive measure, avoid creating Delta tables at the root location in DBFS to prevent similar conflicts in the future. Regularly check and clean up any unintended folders that may interfere with your workflows.\nFor more information, please review the\nWhat is Delta Lake?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/analyze-statement-not-working-on-delta-live-tables-dlt.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta-live-tables/analyze-statement-not-working-on-delta-live-tables-dlt",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you use the Delta Live Tables (DLT) service to load data into streaming tables, in order to execute\nANALYZE\nstatements on Unity Catalog tables, you receive an error.\n\"PERMISSION_DENIED: MV/ST internal properties can only be updated through DLT. SQLSTATE: 42501\".\nCause\nANALYZE\nstatements are not supported for streaming tables.\nSolution\nWith Delta Live Tables, running\nANALYZE TABLE\nis not necessary. A DLT pipeline has a process that runs every 24 hours, called the maintenance job. This job performs optimization and vacuum on Delta tables that are part of the DLT pipeline."
5
+ }
scraped_kb_articles/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/administration/analyze-your-overall-unity-catalog-resource-quota-usage-using-spark-sql",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": ""
5
+ }
scraped_kb_articles/ansi-compliant-decimal-precision-and-scale.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/sql/ansi-compliant-decimal-precision-and-scale",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are trying to cast a value of one or greater as a\nDECIMAL\nusing equal values for both precision and scale. A null value is returned instead of the expected value.\nThis sample code:\n%sql\r\n\r\nSELECT CAST (5.345 AS DECIMAL(20,20))\nReturns:\nCause\nThe\nDECIMAL\ntype (\nAWS\n|\nAzure\n|\nGCP\n) is declared as\nDECIMAL(precision, scale)\n, where precision and scale are optional.\nPrecision represents the total number of digits in the value of the variable. This includes the whole number part and the fractional part.\nThe scale represents the number of fractional digits in the value of the variable. Put simply, this is the number of digits to the right of the decimal point.\nFor example, the number 123.45 has a\nprecision\nof five (as there are five total digits) and a\nscale\nof two (as only two digits are on the right-hand side of the decimal point).\nWhen the precision and scale are equal, it means the value is less than one, as all digits are used for the fractional part of the number. For example,\nDECIMAL(20, 20)\ndefines a value with 20 digits and 20 digits to the right of the decimal point. All 20 digits are used to represent the fractional part of the number, with no digits used for the whole number.\nIf the precision in the value overflows the precision defined in the datatype declaration, null is returned instead of the fractional decimal value.\nSolution\nSet\nspark.sql.ansi.enabled\nto\ntrue\nin your cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nThis enables Spark SQL ANSI compliance.\nFor more information, review the\nANSI compliance\ndocumentation.\nDelete\nInfo\nYou can also set this value at the notebook level using\nspark.conf.set(\"spark.sql.ansi.enabled\", \"True\")\nin a Python cell if you don't have the ability to edit the cluster's\nSpark config\n.\nOnce ANSI compliance is enabled, passing incorrect precision and scale values returns an error indicating the correct value.\nFor example, this sample code:\n%sql\r\n\r\nSELECT CAST (5.345 AS DECIMAL(20,20))\nReturns this error message:\nError in SQL statement: SparkArithmeticException: [CANNOT_CHANGE_DECIMAL_PRECISION] Decimal(expanded, 5.345, 4, 3) cannot be represented as Decimal(20, 20).\nHowever, this sample code:\nSELECT CAST (5.345 AS DECIMAL(4,3))\nReturns the expected result:"
5
+ }
scraped_kb_articles/apache-airflow-triggered-jobs-terminating-before-completing.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/jobs/apache-airflow-triggered-jobs-terminating-before-completing",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou have a job triggered through Apache Airflow using the\nDatabricksRunNowOperator\nthat cancels after running for X hours (where X is the timeout value set through Airflow) even though the job is not complete.\nCause\nThe Airflow\nDatabricksRunNowOperator\noperator uses the X hours configuration to determine the job’s length and when to stop, regardless of whether the job is complete.\nIf you have access to audit logs, you can see the cancellation request is sent by the Airflow operator, confirming the issue lies in the Airflow configuration rather than the Databricks job settings.\nAudit logs snippet\n{\"version\":\"2.0\",\"auditLevel\":\"WORKSPACE_LEVEL\",\"timestamp\":1746505827573,\"orgId\":\"<org-id>\",\"shardName\":\"<shard-name>\",\"accountId\":\"xxxxxxxxxxxxxxxx\",\"sourceIPAddress\":\"<source-ip-address>\",\"userAgent\":\"databricks-airflow/6.7.0 _/0.0.0 python/3.11.11 os/linux airflow/2.9.3+astro.11 operator/DatabricksRunNowOperator\",\"sessionId\":null,\"userIdentity\":{\"email\":\"<email>\",\"subjectName\":null},\"principal\":{\"resourceName\":\"accounts/xxxxxxxxxxxxxxxx\"/users/<user>\",\"uniqueName\":\"<email>\",\"contextId\":\"<context-id>\",\"displayName\":\"Data Engineering\"},\"authorizeAs\":{\"resourceName\":\"accounts/xxxxxxxxxxxxxxxx\"/users/<user>\",\"uniqueName\":\"<email>\",\"displayName\":\"Data Engineering\",\"activatingResourceName\":null},\"serviceName\":\"jobs\",\"actionName\":\"cancel\",\"requestId\":\"<request-id>\",\"requestParams\":{\"run_id\":\"<run-id>\"},\"response\":{\"statusCode\":200,\"errorMessage\":null,\"result\":\"{}\"}}\nNote\nf you do not have audit logs configured for your workspace and you are on a premium plan or above, you can follow the instructions in the\nAudit log reference\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation to configure them.\nSolution\nIncrease the job timeout threshold on the Airflow side.\nReview the Airflow Directed Acyclic Graph (DAG) that triggers the Databricks job. Look for the\nDatabricksRunNowOperator\ntask and check its configuration.\nAdjust the parameter in\nDatabricksRunNowOperator\nthat controls the timeout to a value beyond four hours.\nUpdate your Airflow DAG with the adjusted timeout parameter and deploy the changes.\nAfter updating the DAG, trigger a new run and monitor the job to ensure it runs beyond the previous four-hour limit without being terminated.\nFor more information, review the Airflow\nTasks\ndocumentation."
5
+ }
scraped_kb_articles/apache-spark-driver-failing-with-unexpected-stop-and-restart-message.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/cloud/apache-spark-driver-failing-with-unexpected-stop-and-restart-message",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nTypically to accommodate a memory-intensive workload and avoid out-of-memory (OOM) errors, you scale up the cluster node’s memory.\nNote\nIf you’re looking for more information on scaling up, review the knowledge base article,\nSpark job fails with Driver is temporarily unavailable\n.\nAfter scaling up, you notice the driver still fails with an unexpected stop and restart message.\nThe spark driver has stopped unexpectedly and is restarting.\nWhile investigating, you notice a high frequency of garbage collection (GC) events which can be verified on the driver's log4j. ​\n24/11/07 00:32:45 WARN DBRDebuggerEventReporter: Driver/10.XX.XX.XX paused the JVM process 81 seconds during the past 120 seconds (67.76%) because of GC. We observed 3 such issue(s) since 2024-11-07T00:26:26.301Z.\nAdditionally, the driver's\nstdout\nmay show\nfull GC\nmessages.\n[Full GC (Metadata GC Threshold) [PSYoungGen: 38397K->0K(439808K)] [ParOldGen: 351239K->108115K(1019904K)] 389636K->108115K(1459712K), [Metaspace: 252946K->252852K(1290240K)], 46.2830875 secs] [Times: user=0.74 sys=0.76, real=46.28 secs]\nCause\nWhen scaling up a cluster’s memory doesn’t solve the memory issue, it rules out available memory and instead becomes a GC issue, which is often silent.\nGC causes the driver to pause Java virtual machine (JVM) applications. If there is more memory potentially available to use, the GC takes more time to scan all objects and free that memory. These long pauses can lead to a forced restart of the machine.\nSolution\nRefactor your code to use less memory at once. You can use the Apache Spark parallelization pipeline, which helps to distribute the load between the cluster's nodes and avoid memory issues.\nIf your workload doesn’t use the Spark API, Databricks recommends partitioning non-API execution types. Consider iterating over objects and reusing references instead of instantiating many heavy objects at the same time during processing."
5
+ }
scraped_kb_articles/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/clusters/apache-spark-is-configured-to-suppress-info-statements-but-they-overwhelm-logs-anyway",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou receive\nINFO\nstatements despite configuring the Apache Spark settings to suppress\nINFO\nand give specific\nWARN\nstatements. This issue is observed even after setting the '\npy4j'\nlogger to\nWARN\nand configuring the logging in the Spark config in the Databricks UI to\nWARN\n.\nThe problem persists, leading to an overflow of\nINFO\nlogs, which can be problematic when integrating with monitoring tools like DataDog.\nCause\nConfiguring Spark settings to suppress\nINFO\nlogs does not override the default\nlog4j2\nsettings in the Databricks cluster, which control logging behavior at a more granular level. These default\nlog4j2\nsettings may still allow\nINFO\nlog generation.\nAdditionally, the integration with DataDog may not respect the Spark configuration settings, leading to the continued generation of\nINFO\nlogs.\nSolution\nModify the\nlog4j2\nconfiguration file directly within the Databricks environment.\n1. Use an init script that updates the\nlog4j2.xml\nfile to suppress\nINFO\nlogs.\n#!/bin/bash\r\nset -e  # Exit script on any error\r\n# Define the log4j2 configuration file path (modify if needed)\r\nLOG4J2_PATH=\"/databricks/spark/dbconf/log4j/driver/log4j2.xml\"\r\n# Modify the log4j2 configuration file\r\necho \"Updating log4j2.xml to suppress INFO logs\"\r\nsed -i 's/level=\"INFO\"/level=\"WARN\"/g' $LOG4J2_PATH\r\necho \"Completed log4j2 config changes at `date`\"\n2. Upload the init script to the\nWorkspace Files\n. (You can create a\n.sh\nfile in the workspace files folder, add the contents of the script to the\n.sh\nfile and use the init script on the cluster)\n3. Configure the cluster to use the init script by setting it in the\nInit Scripts\ntab.\n\"destination\": \"Workspace\"\r\n\"/Users/<your-workspace-folder>/log4j_warn.sh\"\n4. Restart the cluster to apply the changes."
5
+ }
scraped_kb_articles/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/scala/apache-spark-job-failing-with-gc-overhead-limit-exceeded-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to execute an Apache Spark job, it fails with a\nGC overhead limit exceeded\nerror.\njava.lang.OutOfMemoryError: GC overhead limit exceeded. The excessive GC triggered by join conditions results in the job being aborted with the error: Exception: Job aborted due to stage failure: Task in stage failed X times, most recent failure: Lost task in stage (TID) (executor): java.lang.OutOfMemoryError: GC overhead limit exceeded\nCause\nNULL values in join keys can result in a high number of unmatched rows, and a high number of unmatched rows in turn increases memory. Duplicate records in join keys can increase the amount of data that needs to be shuffled and sorted during a sort-merge join.\nThese factors can lead to excessive memory usage and extended garbage collection (GC) activity, which ultimately triggers the\njava.lang.OutOfMemoryError: GC overhead limit exceeded\nerror.\nSolution\nFirst, analyze join columns. Check for NULL values and duplicates in your join keys to identify potential sources of the error.\nSELECT COUNT(*) FROM <your-dataset-1> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;\r\nSELECT COUNT(*) FROM <your-dataset-2> WHERE <your-join-key-1> IS NULL OR <your-join-key-2> IS NULL;\nThen, deduplicate join keys. Minimize data size during joins by removing duplicate rows based on the join keys.\ndf1 = df1.dropDuplicates([\"<your-join-key->1\", \"<your-join-key-2>\"])\r\ndf2 = df2.dropDuplicates([\"<your-join-key-1>\", \"<your-join-key-2>\"])"
5
+ }
scraped_kb_articles/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/jobs/apache-spark-job-failing-with-sparkexception-job-aborted-due-to-stage-failure-error-on-dedicated-compute",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you try to run a job on a dedicated compute, it fails with the following error.\nSparkException: Job aborted due to stage failure: Total size of serialized results of 2817 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.\nThe job fails even after increasing the\nspark.driver.maxResultSize\nand\ndriver memory\nto higher value.\nStacktrace\nat scala.collection.AbstractIterator.foreach(Iterator.scala:1431)\r\n\r\nat com.databricks.sql.execution.arrowcollect.RDDBatchCollector.runSparkJobs(RDDBatchCollector.scala:261)\r\n\r\nat com.databricks.sql.execution.arrowcollect.RDDBatchCollector.collect(RDDBatchCollector.scala:347)\r\n\r\nat com.databricks.sql.execution.arrowcollect.CloudStoreCollector$.hybridCollect(CloudStoreCollector.scala:159)\r\n\r\nat com.databricks.sql.execution.arrowcollect.CloudStoreCollector$.hybridCollect(CloudStoreCollector.scala:206)\r\n\r\nat org.apache.spark.sql.execution.qrc.CompressedHybridCloudStoreFormat.collect(cachedSparkResults.scala:170)\r\n\r\nat org.apache.spark.sql.execution.qrc.CompressedHybridCloudStoreFormat.collect(cachedSparkResults.scala:160)\r\n\r\nat org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.processAsRemoteBatches(SparkConnectPlanExecution.scala:475)\r\n\r\nat org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:141)\nCause\nWhen Fine-Grained Access Control (FGAC) is enabled, queries involving restricted data, such as those protected by row-level security, column masking, or secure views, are offloaded to serverless compute for enforcement. The resulting data must then be fully materialized and transferred back to the dedicated cluster’s driver.\nWhen the query spans a large number of small partitions, Apache Spark triggers an optimized execution path where executors send results directly to the driver, which aggregates and uploads them to cloud storage. If the total serialized result exceeds Spark’s internal 4 GiB driver-side limit, the job fails deterministically, regardless of driver memory or\nspark.driver.maxResultSize\nsettings.\nFor details, refer to the\nFine-grained access control on dedicated comput\ne (\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nExecute queries involving FGAC on standard compute, where data filtering and access control enforcement are handled within the same compute environment."
5
+ }
scraped_kb_articles/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/execution/apache-spark-job-output-only-giving-the-first-json-object-instead-of-all-records",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nIn your Apache Spark jobs, you notice some JSON files are processed incorrectly, leading to output containing only the first JSON object instead of all the records in the file.\nCause\nYou’re missing newline characters to separate each JSON record. Without them, Spark reads only the first JSON record from a file.\nExample of single-line JSON records without newline separators\n{\"col_1\":\"us-east-1\",\"col_3\":\"prod\"}{\"col_1\":\"us-east-2\",\"col_3\":\"dev\"}{\"col_1\":\"us-east-3\",\"col_3\":\"stage\"}\nExample of JSON records with newline separators\n{\"col_1\":\"us-east-1\",\"col_3\":\"prod\"}  \r\n{\"col_1\":\"us-east-2\",\"col_3\":\"dev\"}  \r\n{\"col_1\":\"us-east-3\",\"col_3\":\"stage\"}\nSolution\nAdd newline characters in your source file to separate each JSON object and make sure Spark can read them correctly.\nAlternatively, enable Photon runtime to leverage its ability to handle single-line JSON objects without newlines."
5
+ }
scraped_kb_articles/apache-spark-jobs-fail-with-environment-directory-not-found-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/libraries/apache-spark-jobs-fail-with-environment-directory-not-found-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nAfter you install a Python library (via the cluster UI or by using\npip\n), your Apache Spark jobs fail with an\nEnvironment directory not found\nerror message.\norg.apache.spark.SparkException: Environment directory not found at\r\n/local_disk0/.ephemeral_nfs/cluster_libraries/python\nCause\nLibraries are installed on a Network File System (NFS) on the cluster's driver node. If any security group rules prevent the workers from communicating with the NFS server, Spark commands cannot resolve the Python executable path.\nSolution\nYou should make sure that your security groups are configured with appropriate security rules (\nAWS\n|\nAzure\n|\nGCP\n)."
5
+ }
scraped_kb_articles/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/execution/apache-spark-jobs-failing-due-to-stage-failure-when-using-spot-instances-in-a-cluster",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen using spot instances in your cluster, your Apache Spark jobs fail due to stage failures.\n\"org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage 2923 to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.\"\nCause\nSpot instances can be preempted, leading to the loss of nodes in the cluster. When nodes are lost, the shuffle map stage fails and Spark cannot rollback the\nResultStage\nto re-process input data.\nSolution\nUse on-demand nodes instead of spot instances. In the cluster configuration, navigate to the\nAdvanced\ntab and slide the slider to the extreme right to select on-demand nodes for workers.\nAlternatively, if you want to continue to use spot instances, you can decrease the chance of data loss by enabling Spark decommissioning. Decommissioning allows migration of data before spot node preemption.\nImportant\nDecommissioning is a best effort and does not guarantee that all data can be migrated before final preemption. Decommissioning cannot guarantee against shuffle fetch failures when running tasks are fetching shuffle data from the executor.\nTo decommission, add the following configurations to the cluster configuration under\nAdvanced options > Spark\n.\nspark.decommission.enabled true\r\nspark.storage.decommission.enabled true\r\nspark.storage.decommission.shuffleBlocks.enabled true\r\nspark.storage.decommission.rddBlocks.enabled true\r\n\r\nAdditionally, under Advanced options > Environment, add:\r\nSPARK_WORKER_OPTS=\"-Dspark.decommission.enabled=true\""
5
+ }
scraped_kb_articles/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/python/apache-spark-pyspark-job-using-a-python-threading-api-function-taking-hours-instead-of-minutes",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYour Apache Spark PySpark job using the following Python threading API function\nThreadPoolExecutor()\ntakes over an hour to complete, instead of minutes.\nwith ThreadPoolExecutor(max_workers=MAX_THREAD_NUM) as executor:\r\n    executor.map(thread_process_partition, cid_partitions)\nCause\nWhen using Python threads, the driver node becomes overwhelmed, leading to inefficient task distribution and underutilization of worker nodes.\nSolution\nUse the Databricks Spark connector instead of threading. For more information, review the\nConnect to external systems\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nEnsure that the cluster configuration is optimized for the workload. This includes adjusting the number of worker nodes and their specifications to match the job requirements. For cluster sizing guidance, review the\nCompute configuration recommendations\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nPreventative measures\nImplement best practices for Spark job optimization, such as caching intermediate results, using broadcast variables, and avoiding shuffles where possible. For more information, review the\nComprehensive Guide to Optimize Databricks, Spark and Delta Lake Workloads\n.\nMonitor and adjust the job's execution plan, using the Spark UI to identify and address any bottlenecks. Refer to the\nDebugging with the Apache Spark UI\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation for details."
5
+ }
scraped_kb_articles/apache-spark-submit-job-clusters-do-not-terminate-after-scstop.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/python/apache-spark-submit-job-clusters-do-not-terminate-after-scstop",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYour Apache Spark Submit jobs remain active even after invoking\nsc.stop()\n, and the underlying job cluster does not shut down as expected.\nCause\nThe cluster still has non-daemon threads running, preventing shutdown.\nContext\nThe\nsc.stop()\ncommand in Spark stops the SparkContext, which is responsible for managing the resources for a Spark application. However, it does not necessarily terminate the job cluster.\nThe termination of a job cluster depends on various factors including the specific configurations and the termination policies set for the cluster.\nIn Spark Submit jobs, the job will not exit until the Spark Submit JVM shuts down. This shutdown normally occurs in one of two ways.\nExplicitly, when the JVM’s\nSystem.exit()\nis method explicitly invoked somewhere via code.\nWhen all non-daemon threads have exited. A non-daemon thread ensures that the JVM waits for its completion before exiting, making it suitable for critical tasks that must be finished before the application terminates. This type of thread is crucial for maintaining data integrity and ensuring that important operations are fully executed.\nSome non-daemon threads can be only stopped when\nSparkContext.stop()\nis called.\nSome non-daemon threads are only cleaned up in a JVM shutdown hook.\nSolution\nTo ensure the Spark Submit job terminates properly, explicitly invoke\nSystem.exit(0)\nafter\nSparkContext.stop()\n.\nPython\nimport sys\r\nsc = SparkSession.builder.getOrCreate().sparkContext  # Or otherwise obtain handle to SparkContext\r\nrunTheRestOfTheUserCode()\r\n# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception\r\n# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by\r\n# PythonRunner\r\nsc._gateway.jvm.System.exit(0)\nScala\ndef main(args: Array[String]): Unit = {\r\n  try {\r\n    runTheRestOfTheUserCode() // The actual application logic\r\n  } catch {\r\n    case t: Throwable =>\r\n      try {\r\n        // Log the throwable or error here\r\n      } finally {\r\n        System.exit(1)\r\n      }\r\n  }\r\n  System.exit(0)\r\n}"
5
+ }
scraped_kb_articles/apache-spark-ui-task-logs-intermittently-return-http-500-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/clusters/apache-spark-ui-task-logs-intermittently-return-http-500-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nUsers of Shared access mode clusters experience intermittent HTTP 500 errors when trying to view task logs in the Apache Spark UI. This also applies to admins.\nErrorCaused by:java.lang.Exception: Log viewing is disabled on this cluster\r\n    at org.apache.spark.deploy.worker.ui.LogPage.render(LogPage.scala:65)\r\n    at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:100)\r\n    at org.apache.spark.ui.WebUI$$anonfun$3.apply(WebUI.scala:100)\r\n    at org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90)\r\n    at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)\r\n    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\r\n    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)\r\n    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)\r\n    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)\r\n    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)\r\n    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)\r\n    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\r\n    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)\r\n    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\r\n    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\r\n    at org.eclipse.jetty.server.Server.handle(Server.java:534)\r\n    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)\r\n    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)\nCause\nThis specific exception is controlled by the\nspark.databricks.ui.logViewingEnabled\nSpark property. When this value is set to\nfalse\n, log viewing is disabled. When Spark log viewing is disabled on the cluster, the Spark UI generates an error when you attempt to view the logs.\nThe\nspark.databricks.ui.logViewingEnabled\nproperty defaults to\ntrue\n, however sometimes other Spark configurations (such as\nspark.databricks.acl.dfAclsEnabled\n) can alter its value and set it to\nfalse\n.\nSolution\nSet\nspark.databricks.ui.logViewingEnabled\nto\ntrue\nin the cluster's\nSpark config\n(\nAWS\n|\nAzure\n|\nGCP\n).\nspark.databricks.ui.logViewingEnabled true\nThis restores the default configuration in case it is accidentally overwritten."
5
+ }
scraped_kb_articles/append-a-row-to-rdd-or-dataframe.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/data/append-a-row-to-rdd-or-dataframe",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "To append to a DataFrame, use the\nunion\nmethod.\n%scala\r\n\r\nval firstDF = spark.range(3).toDF(\"myCol\")\r\nval newRow = Seq(20)\r\nval appended = firstDF.union(newRow.toDF())\r\ndisplay(appended)\n%python\r\n\r\nfirstDF = spark.range(3).toDF(\"myCol\")\r\nnewRow = spark.createDataFrame([[20]])\r\nappended = firstDF.union(newRow)\r\ndisplay(appended)"
5
+ }
scraped_kb_articles/append-output-not-supported-no-watermark.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/append-output-not-supported-no-watermark",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are performing an aggregation using append mode and an exception error message is returned.\nAppend output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark\nCause\nYou cannot use append mode on an aggregated DataFrame without a watermark. This is by design.\nSolution\nYou must apply a watermark to the DataFrame if you want to use append mode on an aggregated DataFrame.\nThe aggregation must have an event-time column, or a window on the event-time column.\nGroup the data by window and word and compute the count of each group.\n.withWatermark()\nmust be called on the same column as the timestamp column used in the aggregation. The example code shows how this can be done.\nReplace the value\n<type>\nwith the type of element you are processing. For example, you would use Row if you are processing by row.\nReplace the value\n<words>\nwith the streaming DataFrame of schema { timestamp: Timestamp, word: String }.\n%java\r\n\r\nDataset<type> windowedCounts = <words>\r\n    .withWatermark(\"timestamp\", \"10 minutes\")\r\n    .groupBy(\r\n        functions.window(words.col(\"timestamp\"), \"10 minutes\", \"5 minutes\"),\r\n        words.col(\"word\"))\r\n    .count();\n%python\r\n\r\nwindowedCounts = <words> \\\r\n    .withWatermark(\"timestamp\", \"10 minutes\") \\\r\n    .groupBy(\r\n        window(words.timestamp, \"10 minutes\", \"5 minutes\"),\r\n        words.word) \\\r\n    .count()\n%scala\r\n\r\nimport spark.implicits._\r\n\r\nval windowedCounts = <words>\r\n    .withWatermark(\"timestamp\", \"10 minutes\")\r\n    .groupBy(\r\n        window($\"timestamp\", \"10 minutes\", \"5 minutes\"),\r\n        $\"word\")\r\n    .count()\nYou must call\n.withWatermark()\nbefore you perform the aggregation. Attempting otherwise fails with an error message. For example,\ndf.groupBy(\"time\").count().withWatermark(\"time\", \"1 min\")\nreturns an exception.\nPlease refer to the Apache Spark documentation on\nconditions for watermarking to clean the aggregation slate\nfor more information."
5
+ }
scraped_kb_articles/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta-live-tables/applyinpandaswithstate-fails-with-a-modulenotfounderror-when-used-with-delta-live-tables",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are trying to use\napplyInPandasWithState\nwith Delta Live Tables but execution fails with a\nModuleNotFoundError: No module named 'helpers'\nerror message.\nExample error\nTraceback (most recent call last):\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 1964, in main\r\n   func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 1770, in read_udfs\r\n   arg_offsets, f = read_single_udf(\r\n                    ^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker.py\", line 802, in read_single_udf\r\n   f, return_type = read_command(pickleSer, infile)\r\n                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/worker_util.py\", line 70, in read_command\r\n   command = serializer._read_with_length(file)\r\n             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 196, in _read_with_length\r\n   raise SerializationError(\"Caused by \" + traceback.format_exc())\r\npyspark.serializers.SerializationError: Caused by Traceback (most recent call last):\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 192, in _read_with_length\r\n   return self.loads(obj)\r\n          ^^^^^^^^^^^^^^^\r\n File \"/databricks/spark/python/pyspark/serializers.py\", line 572, in loads\r\n   return cloudpickle.loads(obj, encoding=encoding)\r\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\nModuleNotFoundError: No module named 'helpers'\nCause\nApplyInPandasWithState\ndoes not work correctly when used with Delta Live Tables if you define the function you want to use outside of your notebook.\nIn this case, we are trying to use the\ncount_fn\nfunction that is defined in the\nhelpers.streaming.functions\nmodule. It is imported at the start of the example code block and then called as part of\napplyInPandasWithState\n. This results in an error.\nExample code\n%python\r\n\r\nimport pandas as pd\r\nfrom pyspark.sql.functions import col\r\nfrom pyspark.sql.streaming.state import GroupStateTimeout\r\nfrom helpers.streaming.functions import count_fn\r\nfrom pyspark.sql.functions import udf\r\n\r\ndf = (\r\n    spark.readStream.format(\"rate\")\r\n    .option(\"rowsPerSecond\", \"100\")\r\n    .load()\r\n    .withColumn(\"id\", col(\"value\"))\r\n    .groupby(\"id\")\r\n    .applyInPandasWithState(\r\n        func=count_fn,\r\n        outputStructType=\"id long, countAsString string\",\r\n        stateStructType=\"len long\",\r\n        outputMode=\"append\",\r\n        timeoutConf=GroupStateTimeout.NoTimeout,\r\n    )\r\n)\r\nimport dlt\r\nimport time\r\n\r\n@dlt.table(name=f\"random_{int(time.time())}\")\r\ndef a():\r\n  return df\nSolution\nYou should define the function you want to use within the notebook, reimporting the function you want to call as part of your custom function. Call the function you defined and it completes as expected.\nThis custom function imports\ncount_fn\nand runs it. By adding this to the sample code, and calling\nmy_func\ninstead of calling\ncount_fn\ndirectly, the example code successfully completes.\ndef my_func(*args):\r\n    from helpers.streaming.functions import count_fn\r\n    return count_fn(*args)\nExample code\n%python\r\n\r\nimport pandas as pd\r\nfrom pyspark.sql.functions import col\r\nfrom pyspark.sql.streaming.state import GroupStateTimeout\r\nfrom helpers.streaming.functions import count_fn\r\nfrom pyspark.sql.functions import udf\r\n\r\ndef my_func(*args):\r\n    from helpers.streaming.functions import count_fn\r\n    return count_fn(*args)\r\n\r\ndf = (\r\n    spark.readStream.format(\"rate\")\r\n    .option(\"rowsPerSecond\", \"100\")\r\n    .load()\r\n    .withColumn(\"id\", col(\"value\"))\r\n    .groupby(\"id\")\r\n    .applyInPandasWithState(\r\n        func=my_func,\r\n        outputStructType=\"id long, countAsString string\",\r\n        stateStructType=\"len long\",\r\n        outputMode=\"append\",\r\n        timeoutConf=GroupStateTimeout.NoTimeout,\r\n    )\r\n)\r\nimport dlt\r\nimport time\r\n\r\n@dlt.table(name=f\"random_{int(time.time())}\")\r\ndef a():\r\n  return df"
5
+ }
scraped_kb_articles/arcgis-library-installation-fails-with-subprocess-exited-with-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/libraries/arcgis-library-installation-fails-with-subprocess-exited-with-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are trying to pip install the ArcGIS library on a cluster and get the following error.\nline 159, in _get_build_requires\r\n          self.run_setup()\r\n        File \"/databricks/python/lib/python3.9/site-packages/setuptools/build_meta.py\", line 174, in run_setup\r\n          exec(compile(code, __file__, 'exec'), locals())\r\n        File \"setup.py\", line 109, in <module>\r\n          link_args = shlex.split(get_output(f\"{kc} --libs gssapi\"))\r\n        File \"setup.py\", line 22, in get_output\r\n          res = subprocess.check_output(*args, shell=True, **kwargs)\r\n        File \"/usr/lib/python3.9/subprocess.py\", line 424, in check_output\r\n          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,\r\n        File \"/usr/lib/python3.9/subprocess.py\", line 528, in run\r\n          raise CalledProcessError(retcode, process.args,\r\n      subprocess.CalledProcessError: Command 'krb5-config --libs gssapi' returned non-zero exit status 127.\r\n      [end of output]\r\n  \r\n  note: This error originates from a subprocess, and is likely not a problem with pip.\r\nerror: subprocess-exited-with-error\r\n\r\n× Getting requirements to build wheel did not run successfully.\r\n│ exit code: 1\r\n╰─> See above for output.\nCause\nThe ArcGIS package requires native components that link against Kerberos (GSSAPI). During installation, the ArcGIS package invokes\nkrb5-config\n, which is missing from the Databricks environment.\nkrb5-config\nis provided by the\nlibkrb5-dev\npackage, and its absence causes the installation to fail with a\nsubprocess-exited-with-error\n.\nSolution\nFrom the UI, create a cluster-scoped init script to install the required\nlibkrb5-dev\nbinary package. The script is stored in your workspace filesystem, though you may optionally store it in S3 or a Unity Catalog volume.\nFrom your workspace landing page, navigate to your home folder.\nClick\nCreate\nin the top-right corner of the page. Then select\nFile\n.\nAdd the following script to the editor and save the file as\narcgis_requirements.sh\n#!/bin/bash\r\n\r\n# Install all the subdependencies packages required for ArcGIS\r\n# Remove all cached package lists to ensure a fresh update.\r\nsudo rm -rf /var/lib/apt/lists/*\r\n# Update the local package index to get the latest package information.\r\nsudo apt-get -y update\r\n\r\n# Install required packages for Kerberos\r\nsudo apt install -y libkrb5-dev\nNext, attach the init script to your cluster.\nNavigate to\nCompute\n> your cluster and click\nEdit\nto edit the cluster.\nExpand the\nAdvanced options\nand click the\nInit scripts\ntab.\nIn the\nSource\ndrop-down, select\nWorkspace\n.\nIn the\nFile\npath\n, select the path to the script\narcgis_requirements.sh\nClick\nAdd\n, then\nConfirm\nand restart\n.\nLast, install ArcGIS from the cluster UI.\nOn the cluster page, click\nLibraries\n. Then click\nInstall new\n.\nUnder\nLibrary Source\n, select\nPyPi\n.\nProvide the package name as\narcgis==<version>\nClick\nInstall\n.\nFor additional information on init scripts, review the\nCluster-scoped init scripts\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor additional information on ArcGIS dependencies, refer to the esri Developer\nSystem requirements\ndocumentation."
5
+ }
scraped_kb_articles/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/security/attempting-to-connect-to-sharepoint-from-a-databricks-notebook-gives-a-404-not-found-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen attempting to connect to SharePoint from a Databricks notebook, you encounter a 404 'Not Found' error, even though you have the correct permissions and a valid folder path.\nThis error specifically occurs when trying to access a file from SharePoint, for example,\nhttps://xxxx.sharepoint.com/<path-to-the-file>\n, suggesting that the issue is more likely related to connectivity or URL formatting rather than permissions.\nCause\nYour code uses URL-encoded characters (for example, %20 for spaces) in the file path.\nWhile URL encoding is typically correct for web URLs, in the specific case of SharePoint integration with Databricks, the SharePoint API expects native filesystem-style paths rather than web-encoded URLs when accessing documents through the\nFile.open_binary\nmethod.\nSolution\nRemove URL encoding for spaces (%20) from the path.\nUse regular spaces in the file path.\nMaintain the relative path structure as is.\nExample\nThe following code shows an example of %20 for spaces, which causes the error.\nresponse = File.open_binary(ctx, \"/sites/itfinance/Shared%20Documents/Evolve%20IT%20Finance/Database/...\")\nThe corrected code uses spaces instead of of %20s.\npythonCopyresponse = File.open_binary(ctx, \"/sites/itfinance/Shared Documents/Evolve IT Finance/Database/...\")"
5
+ }
scraped_kb_articles/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/machine-learning/attributeerror-exportmetricsresponse-when-retrieving-serving-endpoint-metrics",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou are trying to use a Python script to retrieve your serving endpoint metrics and save them to a file in the Prometheus format, when you encounter an error.\nAttributeError: 'ExportMetricsResponse' object has no attribute 'metrics'\r\nsuggests that the ExportMetricsResponse object returned by w.serving_endpoints.export_metrics(name=endpoint_name) does not contain an attribute called metrics.\nCause\nYou can request metrics from the serving endpoint API in the Prometheus format, but if you are using alternate methods (such as a Python script to retrieve and save the data to a file and then use it in another job) the JSON response is not guaranteed to be in the Prometheus format.\nSolution\nYou can use an example Python script as a base to retrieve the metrics data from a serving endpoint, convert it to Prometheus format, and write it out to a file.\nExample code\nThis python script fetches the data from the\nserving endpoint metrics API\n(\nAWS\n|\nAzure\n|\nGCP\n) and writes the metrics from\nmetrics_output.prom\nout to a file.\nThe\nmetrics_output.prom\nfile can be read and referred to in any way you require. For example, you can ingest the response into\nsignalfx\n->\nprometheus-exporter\n.\nBefore running this example code, you need to replace:\n<workspace-url>\n- URL of the workspace\n<authentication-token>\n- Your PAT/OAuth token\n<serving-endpointname>\n- The serving endpoint you want to collect metrics from\nimport requests\r\nfrom databricks.sdk import WorkspaceClient\r\n\r\n# Initialize Databricks client\r\nworkspace = WorkspaceClient(host=\"<workspace-url>\", token=\"<authentication-token>\")\r\n\r\n# Define the endpoint URL\r\nendpoint = \"<workspace-url>/api/2.0/serving-endpoints/<serving-endpointname>/metrics\"\r\n\r\n# Set headers with correct Content-Type\r\nheaders = {\r\n\"Authorization\": f\"Bearer {workspace.config.token}\",\r\n\"Content-Type\": \"text/plain; version=0.0.4; charset=utf-8\"\r\n}\r\n\r\n# Make the GET request\r\nresponse = requests.get(endpoint, headers=headers)\r\n\r\n# Check if the request was successful\r\nif response.status_code == 200:\r\n# Save response to a file\r\nwith open(\"metrics_output.prom\", \"w\") as file:\r\nfile.write(response.text)\r\n\r\n# Print the contents of the file\r\nwith open(\"metrics_output.prom\", \"r\") as file:\r\nprint(file.read())\r\nelse:\r\nprint(f\"Error: {response.status_code}, {response.text}\")"
5
+ }
scraped_kb_articles/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/security/authorization-error-when-trying-to-retrieve-subnet-information-after-saving-locally",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you save your subnet information, you store it locally. When you later try to retrieve the subnet information again, you encounter an authorization error.\nThis Azure storage request is not authorized. The storage account's 'Firewalls and virtual networks' settings may be blocking access to storage services. Please verify your Azure storage credentials or firewall exception settings.\nCause\nSubnet information is dynamic because its serverless. Storing subnet information locally creates a static version, which becomes outdated when you try to use it later.\nSolution\nUse a script that leverages the Network Configuration Controller (NCC) API to fetch subnet information dynamically, and then allowlist the subnets. This approach integrates with serverless and automates Azure Data Lake Storage (ADLS) firewall allowlisting.\nCreate an NCC object.\nEndpoint:\nPOST https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/network-connectivity-configs\nPermission: Databricks Account Admin\nPayload:\nname (<string>), region (<string>)\nAttach the NCC object to one or more workspaces.\nEndpoint:\nPATCH https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/workspaces/{{<workspace-id>}}\nPermission: Databricks Account Admin\nPayload:\nnetwork_connectivity_config_id (<string>)\nObtain the NCC object details and record the subnet IDs.\nEndpoint:\nGET https://accounts.azuredatabricks.net/api/2.0/accounts/{{accountId}}/workspaces/{{workspaceId}}/network-connectivity-configs\nPermission: Databricks Workspace Admin\nAdd the subnet IDs to the firewall of the Azure storage accounts.\nEndpoint: (Azure CLI command)\naz storage account network-rule add\nParameters:\n--resource-group, --account-name, --subscription, --subnet\nVerify connectivity.\nNo API endpoint required. Verify by creating a serverless cluster and accessing the desired storage accounts."
5
+ }
scraped_kb_articles/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/delta-live-tables/auto-loader-does-not-pick-up-files-for-processing-when-uploading-via-an-azure-function",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen you upload files to a source location using an Azure function in Auto Loader, Auto Loader does not pick up the files for processing. The files are also not available in the queue, which Auto Loader sets automatically. The process does work, however, with manual intervention.\nCause\nAuto Loader listens for the ‘\nFlushWithClose\n’ event to process a file in file notification mode. The Azure function used for uploading files to a source location does not set the\nClose\nproperty in File Flush options, so EventBridge does not send the necessary event.\nSolution\nModify your Azure function to set the File Flush option with the\n'\nCLOSE\n' parameter.\nVerify that the\nFlushWithClose\nevent is generated in the Azure Queue after uploading the files to the source location.\nUse Directory listing mode until the file notification issue is resolved, then switch to file notification.\nMonitor the diagnostic logging on the storage Blob and Queue to identify any issues with the file upload process.\nFor more information, please review the\nWhat is Auto Loader file notification mode?\nand\nDataLakeFileFlushOptions.Close Property\ndocumentation."
5
+ }
scraped_kb_articles/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/auto-loader-fails-to-pick-up-new-files-when-using-directory-listing-mode",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou may encounter an issue where Auto Loader does not pick up new files in\ndirectory listing mode\n(\nAWS\n|\nAzure\n|\nGCP\n) in scenarios where the source\ncloudFiles\nfile naming convention has changed.\nCause\nThis is related to the way lexical ordering works when using directory listing mode in Auto Loader. New files with different naming conventions are not being recognized as new.\nExample\nThis Python example demonstrates a possible scenario that could occur.\n# List of sample filenames in source cloudfiles location\r\nfilenames = [\r\n    \"MYAPP_1970-01-01.parquet\",  # <-- older file\r\n    \"MYAPPX_1970-01-02.parquet\", # <-- newer file with slightly modified naming convention (added X character before the _)\r\n]\r\n# Sort the filenames lexicographically\r\nfor filename in sorted(filenames):\r\n    print(filename)\r\n#MYAPPX_1970-01-02.parquet # new file listed first (considered oldest)\r\n#MYAPP_1970-01-01.parquet  # old file listed last (considered newest)\nTo better understand how lexical ordering works, please review the\nLexical ordering of files\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nSolution\nUse file notification mode\nUse\nfile notification mode\n(\nAWS\n|\nAzure\n|\nGCP\n) instead of directory listing mode. File notification mode is lower-latency, can be more cost-effective, and helps avoid lexical ordering issues.\nDisable incremental listing\nIf you cannot use file notification mode, you should disable incremental listing by setting the Apache Spark option\ncloudFiles.useIncrementalListing\nto\nfalse\n.\nThis allows new files to be picked up, although it may increase the time spent listing files.\nNote\nIncremental listing mode is deprecated and should not be used. For more information, review the\nIncremental Listing (deprecated)\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nFor more information and best practices on using Auto Loader, review the\nConfigure Auto Loader for production workloads\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/auto-loader-failures-with-javaiofilenotfoundexception-for-sst-and-log-files",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou notice that jobs which previously ran successfully start failing without any recent changes to the code or environment. You discover subsequently that pipelines using Auto Loader to ingest Delta changes for multiple tables fail suddenly with an error message,\njava.io.FileNotFoundException\n, for SST and log files.\nCause\nYour table’s checkpoint and Delta paths are misconfigured. When the\nVACUUM\ncommand is executed on the table, it deletes all files in the directory that are not tracked by\n_delta_log\n, including the checkpoint files. This can happen if the checkpoint path is set to the same location as the Delta path of the table.\nThe issue can also occur if multiple streams or jobs use the same checkpoint directory, which leads to conflicts and file deletions that can cause pipeline failures.\nSolution\nFirst, create a separate checkpoint folder outside of the Delta directory. This ensures that the checkpoint files are not deleted during the\nVACUUM\noperation.\nNext, for successful runs, copy the checkpoint files to this new path. For failed runs, start with a new checkpoint and use the\nmodifiedAfter\noption to the stream to ingest files that have a modification timestamp after a specific timestamp.\nFor more detail on the\nmodifiedAfter\noption, refer to the\nAuto Loader options\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation.\nLast, run a\nVACUUM DRY RUN\nto validate which files will be deleted in the next vacuum execution. This helps ensure that no necessary files are deleted."
5
+ }
scraped_kb_articles/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/auto-loader-file-notification-mode-fails-to-identify-new-files-from-the-cloud-queue-service",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nWhen using Auto Loader in file notification mode, Auto Loader does not ingest your files to the target location as expected. This may happen even in situations where the referenced cloud queue service (AWS SQS, Azure Queue Storage, or Google Pub/Sub) contains messages with valid object storage URIs you expect Auto Loader to ingest. You receive\nWARN\nmessages in log4j.\nThe following is an example from AWS SQS.\nWARN S3Event: Ignoring unexpected message received in SQS queue\nExample configuration\nspark.readStream.format(\"cloudFiles\")\r\n  .option(\"cloudFiles.format\", \"json\")\r\n  .option(\"cloudFiles.useNotifications\", True)\r\n  .option(\"cloudFiles.queueUrl\", \"https://<cloud-queue-url>\")\r\n  .option(\"<other-cloudFiles-options>\",\"<other-values>\")\r\n  ...\nCause\nMessages in your cloud queue service do not conform to the expected format. Messages that don’t comply with Auto Loader's expectations cause warnings like\nWARN S3Event: Ignoring unexpected message received in SQS queue\nfor AWS.\nSolution\nEnsure that your cloud queue messages conform to the expected format of the service consuming them.\nWhen using Auto Loader in AWS, the message format should comply with the\nObjectCreated\nevents in the AWS SQS Queue.\nAuto Loader on Azure expects messages that relate to the\nFlushWithClose\nevents.\nGCP expects messages that relate to the\nOBJECT_FINALIZE\nevent.\nIf the messages conform to the expected format, the Auto Loader stream should then progress as expected.  For further information on Auto Loader File Notification mode, refer to the\nWhat is Auto Loader file notification mode?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/auto-loader-streaming-job-failure-with-schema-inference-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/auto-loader-streaming-job-failure-with-schema-inference-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou have an Apache Spark streaming job using Auto Loader encounter an error stating:\nSchema inference for the 'parquet' format from the existing files in the input path <Root Folder> has failed\nCause\nOne possible cause for this issue is having multiple types of files in the child directories.\nThe input directory structure includes a root folder containing nested directories such as folder A and folder B, each containing various file formats.\nRoot Folder -> Folder A -> Folder B -> Avro files (*.avro)\r\nRoot Folder -> Folder A -> Folder C -> Parquet files (*.parquet)\nSolution\nTo selectively read a specific type of file using Auto Loader from a directory with diverse file formats, use the\npathGlobFilter\noption.\nFor example, you can use\n.option(\"pathGlobfilter\", \"*.parquet\")\nto set a suffix pattern for Parquet files, ensuring that only Parquet files are processed.\nFor more information, review the\nFiltering directories or files using glob patterns\n(\nAWS\n|\nAzure\n|\nGCP\n)."
5
+ }
scraped_kb_articles/auto-loader-streaming-query-failure-with-unknownfieldexception-error.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/auto-loader-streaming-query-failure-with-unknownfieldexception-error",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYour Auto Loader streaming job fails with an\nUnknownFieldException\nerror when a new column is added to the source file of the stream.\nException: org.apache.spark.sql.catalyst.util.UnknownFieldException: Encountered unknown field(s) during parsing: <column name>\nCause\nAn\nUnknownFieldException\nerror occurs when Auto Loader detects the addition of new columns as it processes incoming data.\nThe addition of a new column causes the stream to stop and generates an\nUnknownFieldException\nerror.\nSolution\nSet your Auto Loader stream to use schema evolution to avoid this issue.\nFor more information, review the\nHow does Auto Loader schema evolution work?\n(\nAWS\n|\nAzure\n|\nGCP\n) documentation."
5
+ }
scraped_kb_articles/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "url": "https://kb.databricks.com/en_US/streaming/autoloader-job-fails-with-a-urisyntaxexception-error-due-to-invalid-characters-in-filenames",
3
+ "title": "Título do Artigo Desconhecido",
4
+ "content": "Problem\nYou have an Autoloader job configured in Directory listing mode and are encountering a failure with a\nURISyntaxException\nerror.\njava.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: [masked_uri]\nCause\nThe error message indicates an issue with the URI (Uniform Resource Identifier) used in the Autoloader job configuration. This problem occurs when the source folder contains files with names containing colons (\"\n:\n\"). The Autoloader, in Directory listing mode, relies on the Hadoop library for file listing. Filenames with colons violate the library's naming limitations.\nFor more information on these limitations please review the\nHadoop Documentation\n.\nThe Apache community has acknowledged this issue in\nHDFS-14762\n.\nSolution\nYou have a few options to work around this naming limitation. Choose the most appropriate resolution based on your specific use case and requirements.\nAvoid filenames with colons:\nEnsure that filenames within the source path do not contain colons to comply with the Hadoop library naming constraints.\nSwitch to File notification mode:\nTransition from Directory listing mode to File notification mode in Autoloader.\nDisable incremental listing:\nIf Directory listing mode is required, disable incremental listing by setting\noption(\"cloudFiles.useIncrementalListing\", \"false\")\non\nreadStream\n. Note that this may degrade read performance.\nClear the checkpoint (temporary mitigation):\nIf the issue is infrequent, clearing the checkpoint may provide temporary relief. This should be considered a last resort,as it may result in duplicate processing in stateful streaming queries."
5
+ }